Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray] Support supervisor on oscar for ray task mode #3164

Closed
chaokunyang opened this issue Jun 23, 2022 · 1 comment · Fixed by #3165
Closed

[Ray] Support supervisor on oscar for ray task mode #3164

chaokunyang opened this issue Jun 23, 2022 · 1 comment · Fixed by #3165

Comments

@chaokunyang
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Currently ray dag mode run supervisor in client, which have some issues:

  • Client and Ray cluster are usually not in the same cluster or data center. There may be instability and delay in communication, which will lead to serious latency in Ray Task scheduling, resulting in insufficient pipelineization and low task throughput and resource utilization.
  • The Mars Dashboard address cannot be accessed through a browser. Since Supervisor is created in Notebook, Mars Dashboard cannot be accessed through a proxy in ray cluster.
  • The small size of Notebook container may not meet the resource requirements for supervisor. Generally, the Notebook container specification is 2C4G or 4C8G. When the data scale is large, Supervisor may OOM.
  • Can not support the large-scale Failover which needs distributed Supervisor. Large-scale Failover need to save the lineage of a large number of subtasks, so it may need to make the Supervisor running in multiple Ray Actors to ensure that the Supervisor can store a large number of fine-grained lineages. If the Supervisor is running in the client, it is not an independent instance, and it is difficult to extend and make it distributed.
  • Ray driver resource usage are not managed by ray cluster, which also increase the possibilities of OOM

Describe the solution you'd like
We should support scheduling ray tasks in ray actors.

@chaokunyang
Copy link
Contributor Author

chaokunyang commented Aug 9, 2022

Another issue is Ray client server bottleneck:

image

In the Ray Task mode, a large number of ObjectRefs are held in the supervisor. If the supervisor is created in the client, and the client is connected to the Ray cluster through the Ray Client, the square-level number of ObjectRefs in these intermediate processes will be processed by the Ray client server once. The client server becomes the cluster bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant