-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Clean up init container configuration and startup sequence. #476
Comments
I think shielding the user from the init-container logic has its advantages. We can have the Operator add the init-container in the pod. (We need to make sure here that we append it to the list of init containers that the user may have already defined for other purposes). Since Helm is not the only way of deployment, I don't think we should add the init-container there since some users of KubeRay might miss this logic completely if they do not use helm. One question here is how does the init container "waits for the GCS to be ready"? |
All Ray versions since ~1.4.0 have a cli command One idea is to use the Ray image for the init container and loop on There's no overhead from pulling the Ray image, since you need it anyway to run Ray. |
It's not urgent to fix in the next release -- it works well enough to copy-paste the extra configuration. |
Some recent discussion: Just a basic ping to the GCS server should do the trick. |
Hi @DmitriGekhtman, Would you mind explaining some details about this issue? I have read this issue and related slack discussions. In my understanding, current solution is to
Is it correct? Thank you! |
I'd recommend First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility. Next, investigate whether we need to do change anything for the worker startup sequence. |
One option is to modify the Ray code to make the number of retries adjustable via env variable. I think the retry logic is here. |
What's the difference between "recent Ray versions" and "very old Ray versions" mentioned above? Thank you! |
Uh, good question. I actually don't know - @akanso mentioned that the init containers were necessary to get things to work for older Ray versions, roughly two years ago. |
Search before asking
Description
At the moment, we instruct users to include an init container in each worker group spec.
The purpose of the init container is to wait for the service exposing the head GCS server to be created before the worker attempts ray start.
There are two issues with the current setup:
After the initContainer determines that the head service is ready, the Ray worker container immediately runs Ray start,
whether or not the GCS is ready. Ray start has internal retry logic that eventually gives up if the head pod is not started
quickly enough -- the worker container will then crash-loop. (This is not that bad given the typical time scales for
provisioning Ray pods and given ray start's internal timeout.)
The tasks are to simplify configuration and correct the logic.
Two ways to correct the logic:
Advantage of 2. is that it's simpler.
Advantage of 1. is that it's perhaps more idiomatic and gives more feedback to a user who is examining worker pod status with
kubectl get pod
-- the user can distinguish "Initializing" and "Running" states for the worker container.If we stick with an initContainer (option 2), we can either
Use case
Interface cleanup.
Related issues
This falls under the generic category of "interface cleanup", for which we have this issue:
#368
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: