-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ray start block in Pod's entrypoint #77
Conversation
Signed-off-by: chenk008 <[email protected]>
@akanso @Jeffwan ray-project/ray#19546 is merged. I think this PR is a workaround to support raylet failover util we get a liveness Exec probes. |
I have removed |
Sounds good. One last comment, since Ali raise the question, can we at least keep one example file to use code.py? (use custom command to override it) User will know how to use current solution to submit jobs etc. What's more, if some user do not like |
yes, that is a good idea, to have one example using the --block, and the others without it |
@chenk008 Did you get a chance to verify the changes. I use following steps to verify the restarts.
raylet.out logs
settings
Anything I am missing something here? |
I think |
I am good with the PR. Can we add a yaml comment. in the file just to explain to the user of the example the impact of the --block |
Yeah, I will add a yaml comment. |
@Jeffwan I think there is some issue in ray core. I did the same test in ray:1.8, when the head node exited and restarted, the other raylet still alive and not reconnect to the head node except the raylet on head node, on dashboard we can see only one host which is head. The other raylet will exit 14 minutes after the head exited. The log show below
|
* use ray start block Signed-off-by: chenk008 <[email protected]> * add block into rayStartParams * fix ut * add block in sample config * add sample without block Co-authored-by: wuhua.ck <[email protected]>
Signed-off-by: chenk008 [email protected]
Why are these changes needed?
When ray process(e.g. raylet,gcs) exited, the Pod should restart so that the ray process can failover.
Generate the Pod args like
ray start --block
. And this PR remove test scriptray-code
, when ray starts withblock
, theray-code
will not be executed.Related issue number
Close #62
Checks