Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734

Merged
merged 25 commits into from
Dec 26, 2023

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Dec 11, 2023

Why are these changes needed?

Previously, when the RayCluster spec of a RayService was updated, one of two things would happen:

  1. A new cluster would be rolled out via "zero-downtime-upgrade", or
  2. In the case where only the Replicas and WorkersToDelete fields changed, nothing would happen. This behavior was added by [Bug] RayService restarts repeatedly with Autoscaler  #1037 to prevent the Autoscaler from inadvertently triggering rollouts when modifying these fields.)

This PR adds a third case: If WorkerGroupSpecs is modified in the following specific way and it doesn't fall into the case above, then the RayService controller will update the RayCluster instance in place without rolling out a new one.

Here is the specific way that triggers the third case:

The existing worker groups are not modified except for Replicas and WorkersToDelete, and one or more entries to WorkerGroupSpecs is added.


In general, the updating happens in two places:

  1. For the active RayCluster
  2. For the pending RayCluster

Either of these clusters two may see an update to the spec, so we must handle both of these places.


In a followup, we may add the following optimization: If an existing worker group is modified and one or more entries to WorkerGroupSpecs is added, we should reject the spec. This will require using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)

Other followup issues for this PR:

TODO:

  • Add/update rayservice_controller_test.go

Related issue number

Closes #1643

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
kubectl apply -f /Users/architkulkarni/kuberay/ray-operator/config/samples/ray-service.autoscaler.yaml
watch kubectl get pod

Wait for the worker pod to come up.

Add a new entry to the WorkerGroupSpec, copying the first one but changing the name to worker-group-2.

kubectl apply -f /Users/architkulkarni/kuberay/ray-operator/config/samples/ray-service.autoscaler.yaml
watch kubectl get pod

See the second worker pod come up.

Update worker-group-1 outside of replicas/WorkersToDelete. For example change the name to worker-group-111. At the same time, add another entry to WorkerGroupSpecs.

Watch a zero-downtime upgrade get triggered.

Finally, edit a Replicas field of a workergroup. Watch no pods get added or terminated.

  • This PR is not tested :(

Archit Kulkarni added 7 commits December 11, 2023 12:10
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@architkulkarni architkulkarni marked this pull request as ready for review December 12, 2023 17:03
@kevin85421 kevin85421 self-assigned this Dec 12, 2023
} else if clusterAction == Update {
// Update the active cluster.
r.Log.Info("Updating the active RayCluster instance.")
if activeRayCluster, err = r.constructRayClusterForRayService(rayServiceInstance, activeRayCluster.Name); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if there is inconsistency between the RayCluster and RayService's RayClusterSpec? For example, what if a worker group's replica count is set to 0 in RayClusterSpec, but the Ray Autoscaler has already scaled it up to 10? Updating the RayCluster may cause a lot of running Pods to be killed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do any special handling for this case. So that means in this case, the user-specified replica count will take precedence over the Ray Autoscaler's case.

One alternative would be to never update the replicas and workerstodelete when updating the RayCluster. The downside is that the user can never override replicas.

My guess is that the current approach (the first approach) is better, because the user should always have a way to set replicas, and this inconsistent case is an edge case, not the common case. But what are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that the current approach (the first approach) is better

From users' perspectives, it is much better to disregard the users' settings regarding replicas than to delete a Pod that has any running Ray task or actor.

On second thought, if Ray Autoscaling is enabled, Ray Autoscaler is the only decision maker to delete Ray Pods after #1253. Hence, users can increase the number of Pods, but can't delete Pods by updating replicas.

the user should always have a way to set replicas,

Users can directly update RayCluster. In addition, setting replicas for existing worker groups is not common. Most RayService users use Ray Autoscaler as well. If users still need to manually update replicas with Ray Autoscaling, Ray Autoscaler needs to be improved.

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
return err
}

// Update the fetched RayCluster with new changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We almost replace anything in currentRayCluster. Why do we need to get currentRayCluster? Do we need any information from it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first approach was to not get currentRayCluster, but then I got this error when calling r.Update():

rayclusters.ray.io \"rayservice-sample-raycluster-qb8z4\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update

After some research it seemed that a common approach was to first get the current object, then apply changes to the object, and then call Update().

Do you know what the best practice is? Is it better to use Patch() here, or is there some third approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rayclusters.ray.io "rayservice-sample-raycluster-qb8z4" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update

Got it. This makes sense.

Do you know what the best practice is? Is it better to use Patch() here, or is there some third approach?

In my understanding, Patch isn't protected by the Kubernetes optimistic concurrency model. I don't understand the use case for Patch. We should avoid using Patch until we understand it more.

Archit Kulkarni added 2 commits December 14, 2023 10:48
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the PR description use RayCluster instead of RayService for manual testing?

@architkulkarni
Copy link
Contributor Author

architkulkarni commented Dec 14, 2023

Why does the PR description use RayCluster instead of RayService for manual testing?

Ah typo, it should say ray-service.autoscaler.yaml. Updated

Archit Kulkarni added 3 commits December 14, 2023 12:33
@architkulkarni
Copy link
Contributor Author

@kevin85421 I've updated the PR and the PR description per our discussion offline. There is one change:

For the case where the existing workergroup has a change outside of Replicas/WorkersToDelete AND we append new workergroup(s) at the end, we had discussed we should reject the new spec.

In this PR, we instead just do a rolling upgrade in this case. It's not a regression, and we can add the rejection behavior in the future in an independent PR if a user asks for it. This might require some more design, such as using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)

The current state of the PR implements the minimal way to enable the feature request in the linked issue, so I think it can be merged as is.

@kevin85421
Copy link
Member

In this PR, we instead just do a rolling upgrade in this case. It's not a regression, and we can add the rejection behavior in the future in an independent PR if a user asks for it. This might require some more design, such as using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)

Good point!

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
@@ -1137,8 +1245,41 @@ func (r *RayServiceReconciler) labelHealthyServePods(ctx context.Context, rayClu
return nil
}

func generateRayClusterJsonHash(rayClusterSpec rayv1.RayClusterSpec) (string, error) {
// Mute all fields that will not trigger new RayCluster preparation. For example,
func getClusterAction(old_spec rayv1.RayClusterSpec, new_spec rayv1.RayClusterSpec) (ClusterAction, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function's coding style (old_spec -> oldSpec) is inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, fixed e62ac34

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit still has some snake case style. Ex: newSpec_without_new_worker_groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed 8b0056b

I wonder if there's a linter that can check for this

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
Expect(err).NotTo(HaveOccurred(), "failed to update test RayService resource")

// Confirm it didn't switch to a new RayCluster
Consistently(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RayCluster will only switch if the new head Pod is ready, and the Ray Serve applications in the new RayCluster are prepared to serve requests. There is no Pod controller in envtest. Hence, the head Pod will not become ready if you don't manually update it. We may need to check whether the pending RayCluster has been created, rather than whether the active RayCluster has changed.

I checked the existing tests. Some tests check whether the active RayCluster has changed or not to determine whether the rollout is triggered or not without manually updating the head Pod status. Would you mind opening an issue to track the progress? Thanks!

ray-operator/controllers/ray/rayservice_controller_test.go Outdated Show resolved Hide resolved
)
})
It("should update the pending RayCluster in place when WorkerGroupSpecs are modified by the user in RayServiceSpec", func() {
// Trigger a new RayCluster preparation by updating the RayVersion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, our tests are not following good practices. They are all coupled together, so we may need to ensure that there is no pendingRayCluster before proceeding with the following logic. In the future, we might need to make them more independent.

ray-operator/controllers/ray/rayservice_controller_test.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller_test.go Outdated Show resolved Hide resolved
@kevin85421
Copy link
Member

Would you mind opening an issue to track the progress of https://github.com/ray-project/kuberay/pull/1734/files#r1424680579 if we decide not to cover this issue in this PR? Thanks!

Signed-off-by: Archit Kulkarni <[email protected]>
Archit Kulkarni added 3 commits December 20, 2023 15:55
@architkulkarni
Copy link
Contributor Author

Thanks for the review - will respond to the remaining comments soon

Signed-off-by: Archit Kulkarni <[email protected]>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this PR manually. Feel free to merge it after CI passes if the PR works as expected.

@architkulkarni
Copy link
Contributor Author

@kevin85421 Thanks! I will run through the manual test on the final commit before merging.

By the way, are rayservice-sample-yamls tests known to be flaky? They are somewhat flaky on this PR. If the tests were 100% green before, then that could indicate an issue with the PR, but if they were flaky before, then I'm less worried.

Also the raycluster-sample-yamls tests are flaky on this PR, but those are almost certainly unrelated.

@architkulkarni
Copy link
Contributor Author

The manual test described in the PR description passed. Merging

@architkulkarni architkulkarni merged commit a6cf6e0 into master Dec 26, 2023
25 checks passed
@architkulkarni architkulkarni deleted the worker-group branch December 26, 2023 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add new worker groups to Ray Service without rollout
2 participants