-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not update pod labels if they haven't changed #1304
Do not update pod labels if they haven't changed #1304
Conversation
Signed-off-by: Josh Karpel <[email protected]>
74a4485
to
899e752
Compare
Thank you for the PR! I will review it today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! Would you mind running some e2e tests by the following command in the root directory of kuberay
repository? Currently, these tests are not in CI pipeline due to some CI infra issues.
RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=controller:latest pytest -vs tests/test_sample_rayservice_yamls.py --log-cli-level=INFO
This PR looks good to me, provided it can pass the e2e tests in your local environment. In the future, I might consider using a readiness probe to determine whether a worker Pod is ready to serve requests. Relying on the KubeRay operator to check the availability of each Pod doesn't seem scalable.
The autoscaling test is actually failing on both master and my branch, at least for me 😬
So that doesn't seem related to my change... but maybe related to something else in my local setup? Thoughts?
I was actually just asking about something related on Ray Slack and @edoakes was talking about some other changes to the Serve HTTP proxy related to autoscaling, where the proxy now returns a 503 error if the Ray worker doesn't have any other Serve replicas on it. We were seeing these 503s in the KubeRay operator, which triggered some of our alerting. So moving to a k8s readiness check would fix that (no 503s between the operator and the worker pods)... but then our pods would be unready and we also alert on that! I haven't thought deeply about it, but it seems like the problem is that we need to distinguish between a variety of states:
But maybe those states will change with the work @edoakes was talking about? |
cc @zcin do you have any thoughts? |
I'm confused about this. In my understanding, if a node lacks the requested replica, the HTTP Proxy Actor would forward the request to the node containing that replica rather than returning 503. I will check the Ray Serve team, and will keep you updated.
Would you mind explaining the difference between (2) and (3)? From my understanding, a failed readiness probe will cause the container's service endpoint to be removed, and it will not receive any new traffic. Hence, (2) will not receive any new traffic. |
Sorry, I should have linked more context - this is very specifically about the question I asked here: https://ray-distributed.slack.com/archives/CNCKBBRJL/p1691532723786239?thread_ts=1690284353.203859&cid=CNCKBBRJL . re: (2) vs (3) My thought was that a pod with nothing on it but the proxy actor would actually report unready and thus be taken out of the |
Would you mind adding more details on how does "monitoring schemes" work for unready Pods? Thanks! |
Btw, I can reproduce this error. I thought I broke something recently 😭. I will fix it ASAP. |
[Update] "503 error" When a Ray node is not properly serving traffic (either the node is unhealthy or there are no replicas on the node), the health check queries to |
[Update] I found that the tests failed because the image RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=kuberay/operator:nightly pytest -vs tests/test_sample_rayservice_yamls.py --log-cli-level=INFO I cloned your branch, built the image, and ran the tests: RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=controller:latest pytest -vs tests/test_sample_rayservice_yamls.py --log-cli-level=INFO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sure - I will DM you on Ray Slack so I can provide some more details |
Do not update pod labels if they haven't changed
Do not update pod labels if they haven't changed
Do not update pod labels if they haven't changed
Do not update pod labels if they haven't changed
Do not update pod labels if they haven't changed
Why are these changes needed?
See #1303 for more context.
I decided on some light future-proofing by checking if any labels have changed before updating the pod (though there is only one label being changed right now).
Related issue number
Closes #1303
Checks
I followed
DEVELOPMENT.md
and was able to stand up aRayService
in a localkind
cluster using an operator image built from this PR:So cluster formation looks good, but I'm not really sure how to test the opposite case where the label should go from
true
tofalse
. Open to suggestions!Also I don't really know anything about go - please review carefully for style and correctness :)