You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently ray dag mode run supervisor in client, which have some issues:
Client and Ray cluster are usually not in the same cluster or data center. There may be instability and delay in communication, which will lead to serious latency in Ray Task scheduling, resulting in insufficient pipelineization and low task throughput and resource utilization.
The Mars Dashboard address cannot be accessed through a browser. Since Supervisor is created in Notebook, Mars Dashboard cannot be accessed through a proxy in ray cluster.
The small size of Notebook container may not meet the resource requirements for supervisor. Generally, the Notebook container specification is 2C4G or 4C8G. When the data scale is large, Supervisor may OOM.
Can not support the large-scale Failover which needs distributed Supervisor. Large-scale Failover need to save the lineage of a large number of subtasks, so it may need to make the Supervisor running in multiple Ray Actors to ensure that the Supervisor can store a large number of fine-grained lineages. If the Supervisor is running in the client, it is not an independent instance, and it is difficult to extend and make it distributed.
Ray driver resource usage are not managed by ray cluster, which also increase the possibilities of OOM
Describe the solution you'd like
We should support scheduling ray tasks in ray actors.
The text was updated successfully, but these errors were encountered:
In the Ray Task mode, a large number of ObjectRefs are held in the supervisor. If the supervisor is created in the client, and the client is connected to the Ray cluster through the Ray Client, the square-level number of ObjectRefs in these intermediate processes will be processed by the Ray client server once. The client server becomes the cluster bottleneck.
Is your feature request related to a problem? Please describe.
Currently ray dag mode run supervisor in client, which have some issues:
Describe the solution you'd like
We should support scheduling ray tasks in ray actors.
The text was updated successfully, but these errors were encountered: