-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Ray container must be the first application container #1379
[Feature] Ray container must be the first application container #1379
Conversation
87beb6c
to
860765d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by the PR description. Can you summarize the changes in the PR? It looks like:
-
Before, the Ray container could be any container and the code tries to handle that (but it handles it poorly).
-
After this PR, it's still possible that the ray container could be any container, but now the code explicitly assumes it's the first container.
Does this PR just make it fail faster if the Ray container is not the first container? What's the error message?
Without this PR, users can only set the Ray container as the first app container in a Pod because of bugs. This PR defines that the Ray container must be the first app container explicitly. This can avoid a lot of complexity in the implementations. |
That part makes sense, so in my understanding, if the user set the Ray container to be the second container, then before this PR, it would fail in some confusing and subtle way. How does it fail after this PR? What's the error message? |
Currently, we don't print any related error message from KubeRay side because we lack a method to verify if the image is a Ray image. However, it is pretty easy for users to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification! Ideally we should document this requirement (that ray be the first container) if it isn't already documented.
The RayService test also fails in the master branch, so it is not related to this PR. I will address the failure in a separate PR. |
…project#1379) Ray container must be the first application container
Why are these changes needed?
Unlike init containers, which start one by one in a sequence, all application containers in the same Pod can start simultaneously. Therefore, it is fine to enforce the Ray container to be the first application container.
Backward compatibility (v0.6.0):
[Case 1]:
The head Pod crashes repeatedly because KubeRay identifies the first app container (i.e., Nginx container) as the Ray container and injects
ray start ...
command into the Nginx container.[Case 2]:
ray
and value"true"
in both head and workers.The function
getRayContainerIndex
identifies the Ray container as the second app container. While the head Pod starts successfully, the worker's init container hangs indefinitely. This happens becauseFindRayContainerIndex
always recognizes the first app container as the Ray container, and this function helps generate the spec for the head service. As a result, the head service exposes the Nginx port instead of the ports defined in the Ray container. Hence, the worker Pods cannot connect with the Ray head Pod successfully.In conclusion, KubeRay has never supported a Ray container with an index other than 0.
Related issue number
Checks