-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold #1293
[RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold #1293
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, just a few minor questions. Will leave it to Serve folks to decide if the new defaults and new behavior make sense. They seem reasonable to me.
Should this behavior appear in the docs somewhere?
Actually there is one question which Kai-Hsun asked earlier which I'm curious about, should we restart the cluster if the app is stuck in DEPLOYING too long? Or does Serve guarantee that after some time it transitions from DEPLOYING to DEPLOY_FAILED? |
Co-authored-by: Archit Kulkarni <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Is there relevant documentation that should be updated? If not, we should add some. This is an important behavior for folks to understand.
@architkulkarni @edoakes I updated the doc in 875c7f3. |
…dered unhealthy and the default threshold (ray-project#1293) Revisit the conditions under which a RayService is considered unhealthy and the default threshold
…dered unhealthy and the default threshold (ray-project#1293) Revisit the conditions under which a RayService is considered unhealthy and the default threshold
…dered unhealthy and the default threshold (ray-project#1293) Revisit the conditions under which a RayService is considered unhealthy and the default threshold
…dered unhealthy and the default threshold (ray-project#1293) Revisit the conditions under which a RayService is considered unhealthy and the default threshold
Why are these changes needed?
Before Ray 2.6.0, the status of a Serve deployment would be
UPDATING
if it was attempting to scale up additional replicas. The Serve deployment could remain in theUPDATING
status for longer thanserviceUnhealthySecondThreshold
seconds if the cluster lacked sufficient resources to accommodate the new replicas. In such cases, KubeRay would mark the RayCluster as unhealthy and proceed to prepare a new RayCluster instead. We should not trigger the new RayCluster preparation if it is in scaling process.Definition of "unhealthy" without this PR
In the RayService, KubeRay can mark a RayCluster as unhealthy in two possible scenarios.
Case 1: The KubeRay operator cannot connect to the dashboard agent on the head Pod for more than the duration defined by the
deploymentUnhealthySecondThreshold
parameter.deploymentUnhealthySecondThreshold
: If it is not set, the default value is 60 seconds. In sample YAML files, we typically set the value to 300 seconds.Case 2: The KubeRay operator will mark a RayCluster as unhealthy if the status of a serve application is not
RUNNING
or if a serve deployment isn'tHEALTHY
for a duration exceeding theserviceUnhealthySecondThreshold
parameter.serviceUnhealthySecondThreshold
: If it is not set, the default value is 60 seconds. In sample YAML files, we typically set the value to 300 seconds.After KubeRay marks a RayCluster as unhealthy, it initiates the creation of a new RayCluster. Once the new RayCluster is ready, KubeRay redirects network traffic to it, and subsequently deletes the old RayCluster.
Definition of "unhealthy" with this PR
In the RayService, KubeRay can mark a RayCluster as unhealthy in two possible scenarios.
Case 1: The KubeRay operator cannot connect to the dashboard agent on the head Pod for more than the duration defined by the
deploymentUnhealthySecondThreshold
parameter.deploymentUnhealthySecondThreshold
: Both the default value and values in sample YAML files are 300 seconds.Case 2: The KubeRay operator will mark a RayCluster as unhealthy if the status of a serve application is
DEPLOY_FAILED
orUNHEALTHY
for a duration exceeding theserviceUnhealthySecondThreshold
parameter.serviceUnhealthySecondThreshold
: Both the default value and values in sample YAML files are 900 seconds.After KubeRay marks a RayCluster as unhealthy, it initiates the creation of a new RayCluster. Once the new RayCluster is ready, KubeRay redirects network traffic to it, and subsequently deletes the old RayCluster.
To conclude, there are two main differences in this PR:
deploymentUnhealthySecondThreshold
andserviceUnhealthySecondThreshold
to decrease the possibility of triggering new cluster preparation.UPDATING
. Hence, the RayCluster will not be considered unhealthy when the Serve application tries to scale up more replicas.Related issue number
Closes #1277
Checks