-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
…worker-group Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
…into worker-group
…test.go` Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
} else if clusterAction == Update { | ||
// Update the active cluster. | ||
r.Log.Info("Updating the active RayCluster instance.") | ||
if activeRayCluster, err = r.constructRayClusterForRayService(rayServiceInstance, activeRayCluster.Name); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if there is inconsistency between the RayCluster and RayService's RayClusterSpec? For example, what if a worker group's replica count is set to 0 in RayClusterSpec, but the Ray Autoscaler has already scaled it up to 10? Updating the RayCluster may cause a lot of running Pods to be killed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't do any special handling for this case. So that means in this case, the user-specified replica count will take precedence over the Ray Autoscaler's case.
One alternative would be to never update the replicas
and workerstodelete
when updating the RayCluster. The downside is that the user can never override replicas
.
My guess is that the current approach (the first approach) is better, because the user should always have a way to set replicas
, and this inconsistent case is an edge case, not the common case. But what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that the current approach (the first approach) is better
From users' perspectives, it is much better to disregard the users' settings regarding replicas
than to delete a Pod that has any running Ray task or actor.
On second thought, if Ray Autoscaling is enabled, Ray Autoscaler is the only decision maker to delete Ray Pods after #1253. Hence, users can increase the number of Pods, but can't delete Pods by updating replicas
.
the user should always have a way to set replicas,
Users can directly update RayCluster. In addition, setting replicas
for existing worker groups is not common. Most RayService users use Ray Autoscaler as well. If users still need to manually update replicas
with Ray Autoscaling, Ray Autoscaler needs to be improved.
return err | ||
} | ||
|
||
// Update the fetched RayCluster with new changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We almost replace anything in currentRayCluster
. Why do we need to get currentRayCluster
? Do we need any information from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first approach was to not get currentRayCluster
, but then I got this error when calling r.Update()
:
rayclusters.ray.io \"rayservice-sample-raycluster-qb8z4\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
After some research it seemed that a common approach was to first get the current object, then apply changes to the object, and then call Update()
.
Do you know what the best practice is? Is it better to use Patch()
here, or is there some third approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rayclusters.ray.io "rayservice-sample-raycluster-qb8z4" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
Got it. This makes sense.
Do you know what the best practice is? Is it better to use Patch() here, or is there some third approach?
In my understanding, Patch
isn't protected by the Kubernetes optimistic concurrency model. I don't understand the use case for Patch
. We should avoid using Patch
until we understand it more.
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the PR description use RayCluster instead of RayService for manual testing?
Ah typo, it should say |
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@kevin85421 I've updated the PR and the PR description per our discussion offline. There is one change: For the case where the existing workergroup has a change outside of In this PR, we instead just do a rolling upgrade in this case. It's not a regression, and we can add the rejection behavior in the future in an independent PR if a user asks for it. This might require some more design, such as using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.) The current state of the PR implements the minimal way to enable the feature request in the linked issue, so I think it can be merged as is. |
Good point! |
@@ -1137,8 +1245,41 @@ func (r *RayServiceReconciler) labelHealthyServePods(ctx context.Context, rayClu | |||
return nil | |||
} | |||
|
|||
func generateRayClusterJsonHash(rayClusterSpec rayv1.RayClusterSpec) (string, error) { | |||
// Mute all fields that will not trigger new RayCluster preparation. For example, | |||
func getClusterAction(old_spec rayv1.RayClusterSpec, new_spec rayv1.RayClusterSpec) (ClusterAction, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function's coding style (old_spec
-> oldSpec
) is inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, fixed e62ac34
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit still has some snake case style. Ex: newSpec_without_new_worker_groups
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed 8b0056b
I wonder if there's a linter that can check for this
Expect(err).NotTo(HaveOccurred(), "failed to update test RayService resource") | ||
|
||
// Confirm it didn't switch to a new RayCluster | ||
Consistently( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RayCluster will only switch if the new head Pod is ready, and the Ray Serve applications in the new RayCluster are prepared to serve requests. There is no Pod controller in envtest. Hence, the head Pod will not become ready if you don't manually update it. We may need to check whether the pending RayCluster has been created, rather than whether the active RayCluster has changed.
I checked the existing tests. Some tests check whether the active RayCluster has changed or not to determine whether the rollout is triggered or not without manually updating the head Pod status. Would you mind opening an issue to track the progress? Thanks!
) | ||
}) | ||
It("should update the pending RayCluster in place when WorkerGroupSpecs are modified by the user in RayServiceSpec", func() { | ||
// Trigger a new RayCluster preparation by updating the RayVersion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, our tests are not following good practices. They are all coupled together, so we may need to ensure that there is no pendingRayCluster before proceeding with the following logic. In the future, we might need to make them more independent.
Would you mind opening an issue to track the progress of https://github.com/ray-project/kuberay/pull/1734/files#r1424680579 if we decide not to cover this issue in this PR? Thanks! |
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Thanks for the review - will respond to the remaining comments soon |
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@kevin85421 Thanks for the review, I believe all comments are addressed now.
|
Signed-off-by: Archit Kulkarni <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested this PR manually. Feel free to merge it after CI passes if the PR works as expected.
@kevin85421 Thanks! I will run through the manual test on the final commit before merging. By the way, are Also the |
The manual test described in the PR description passed. Merging |
Why are these changes needed?
Previously, when the RayCluster spec of a RayService was updated, one of two things would happen:
Replicas
andWorkersToDelete
fields changed, nothing would happen. This behavior was added by [Bug] RayService restarts repeatedly with Autoscaler #1037 to prevent the Autoscaler from inadvertently triggering rollouts when modifying these fields.)This PR adds a third case: If
WorkerGroupSpecs
is modified in the following specific way and it doesn't fall into the case above, then the RayService controller will update the RayCluster instance in place without rolling out a new one.Here is the specific way that triggers the third case:
In general, the updating happens in two places:
Either of these clusters two may see an update to the spec, so we must handle both of these places.
In a followup, we may add the following optimization: If an existing worker group is modified and one or more entries to WorkerGroupSpecs is added, we should reject the spec. This will require using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)
Other followup issues for this PR:
TODO:
rayservice_controller_test.go
Related issue number
Closes #1643
Checks
Wait for the worker pod to come up.
Add a new entry to the WorkerGroupSpec, copying the first one but changing the name to
worker-group-2
.See the second worker pod come up.
Update
worker-group-1
outside ofreplicas/WorkersToDelete
. For example change the name toworker-group-111
. At the same time, add another entry toWorkerGroupSpecs
.Watch a zero-downtime upgrade get triggered.
Finally, edit a Replicas field of a workergroup. Watch no pods get added or terminated.