[Feature] Ray container must be the first application container #1379

kevin85421 · 2023-08-31T07:29:14Z

Why are these changes needed?

Unlike init containers, which start one by one in a sequence, all application containers in the same Pod can start simultaneously. Therefore, it is fine to enforce the Ray container to be the first application container.

Backward compatibility (v0.6.0):

[Case 1]:

Add an Nginx sidecar container as the first application container, and the Ray container as the second.
Gist

The head Pod crashes repeatedly because KubeRay identifies the first app container (i.e., Nginx container) as the Ray container and injects ray start ... command into the Nginx container.

[Case 2]:

Add an Nginx sidecar container as the first application container, and the Ray container as the second.
Add an env with name ray and value "true" in both head and workers.
```
env:
- name: ray
  value: "true"
```
Gist

The function getRayContainerIndex identifies the Ray container as the second app container. While the head Pod starts successfully, the worker's init container hangs indefinitely. This happens because FindRayContainerIndex always recognizes the first app container as the Ray container, and this function helps generate the spec for the head service. As a result, the head service exposes the Nginx port instead of the ports defined in the Ray container. Hence, the worker Pods cannot connect with the Ray head Pod successfully.

In conclusion, KubeRay has never supported a Ray container with an index other than 0.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

architkulkarni

I'm a little confused by the PR description. Can you summarize the changes in the PR? It looks like:

Before, the Ray container could be any container and the code tries to handle that (but it handles it poorly).
After this PR, it's still possible that the ray container could be any container, but now the code explicitly assumes it's the first container.

Does this PR just make it fail faster if the Ray container is not the first container? What's the error message?

kevin85421 · 2023-08-31T17:53:24Z

Without this PR, users can only set the Ray container as the first app container in a Pod because of bugs. This PR defines that the Ray container must be the first app container explicitly. This can avoid a lot of complexity in the implementations.

architkulkarni · 2023-08-31T17:58:25Z

That part makes sense, so in my understanding, if the user set the Ray container to be the second container, then before this PR, it would fail in some confusing and subtle way. How does it fail after this PR? What's the error message?

kevin85421 · 2023-08-31T18:47:17Z

How does it fail after this PR? What's the error message?

Currently, we don't print any related error message from KubeRay side because we lack a method to verify if the image is a Ray image. However, it is pretty easy for users to use kubectl logs ... to check the head/worker Pods, and they will see the error messages that complain there is no binary ray in the image.

architkulkarni

Thanks for the clarification! Ideally we should document this requirement (that ray be the first container) if it isn't already documented.

kevin85421 · 2023-08-31T19:53:55Z

The RayService test also fails in the master branch, so it is not related to this PR. I will address the failure in a separate PR.

…project#1379) Ray container must be the first application container

kevin85421 marked this pull request as ready for review August 31, 2023 08:12

kevin85421 added 2 commits August 31, 2023 08:13

update

e330a4a

update

860765d

kevin85421 force-pushed the ray-container-index branch from 87beb6c to 860765d Compare August 31, 2023 08:13

kevin85421 requested a review from architkulkarni August 31, 2023 16:56

architkulkarni reviewed Aug 31, 2023

View reviewed changes

update

d116d78

architkulkarni approved these changes Aug 31, 2023

View reviewed changes

kevin85421 merged commit ad06bbd into ray-project:master Aug 31, 2023

kevin85421 mentioned this pull request Aug 31, 2023

[Bug] Environment variable to select Ray container is lower case sensitive. #1078

Closed

2 tasks

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Feature] Ray container must be the first application container (ray-…

4835361

…project#1379) Ray container must be the first application container

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Ray container must be the first application container #1379

[Feature] Ray container must be the first application container #1379

kevin85421 commented Aug 31, 2023 •

edited

Loading

architkulkarni left a comment

kevin85421 commented Aug 31, 2023

architkulkarni commented Aug 31, 2023

kevin85421 commented Aug 31, 2023

architkulkarni left a comment

kevin85421 commented Aug 31, 2023

[Feature] Ray container must be the first application container #1379

[Feature] Ray container must be the first application container #1379

Conversation

kevin85421 commented Aug 31, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni left a comment

Choose a reason for hiding this comment

kevin85421 commented Aug 31, 2023

architkulkarni commented Aug 31, 2023

kevin85421 commented Aug 31, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

kevin85421 commented Aug 31, 2023

kevin85421 commented Aug 31, 2023 •

edited

Loading