Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS FT] Consider the case of sidecar containers #1386

Merged
merged 5 commits into from
Sep 5, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Sep 3, 2023

Why are these changes needed?

In PR #1341, KubeRay is configured to delete the pods that have a Never restart policy and are in terminal states (i.e., Succeeded, Failed). However, an edge case exists for Pods equipped with sidecar containers.

According to this Kubernetes document, a Pod status of Running indicates that "at least one container is still running, or is in the process of starting or restarting." This leads to a situation where the Ray cluster may never recover from a failure, as illustrated below:

  • Create a Ray Pod housing two containers: a primary Ray container and a sidecar container, with the Pod's restart policy designated as Never.
  • Terminate the ray start process in the Ray container, which subsequently will not restart.
  • The Pod maintains a Running status, given that the sidecar container continues to operate.

In this PR, we will check the status of the Ray container instead of checking the Pod's status only.

Related issue number

Closes #1355

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title [WIP][GCS FT] Consider the case of sidecar containers [GCS FT] Consider the case of sidecar containers Sep 5, 2023
@kevin85421 kevin85421 marked this pull request as ready for review September 5, 2023 17:40
@architkulkarni architkulkarni self-assigned this Sep 5, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of very minor notes:

  • Currently, the definition of "when does KubeRay restart a Ray pod" only appears in the PR description, and the implementation of shouldDeletePod. I think it should appear in user facing docs somewhere, what do you think? Or is that too technical?
  • [Nit] Consider parametrizing the new test with 6 cases; I think it can be done using "table driven tests".

@kevin85421
Copy link
Member Author

Looks great! A couple of very minor notes:

  • Currently, the definition of "when does KubeRay restart a Ray pod" only appears in the PR description, and the implementation of shouldDeletePod. I think it should appear in user facing docs somewhere, what do you think? Or is that too technical?
  • [Nit] Consider parametrizing the new test with 6 cases; I think it can be done using "table driven tests".

Create issues #1392 and #1393.

@kevin85421 kevin85421 merged commit 1c9de23 into ray-project:master Sep 5, 2023
z103cb pushed a commit to z103cb/kuberay that referenced this pull request Sep 11, 2023
z103cb pushed a commit to z103cb/kuberay that referenced this pull request Sep 11, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GCS FT] Consider the case of sidecar containers
2 participants