Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REP: KubeRay RayService Incremental Rollout #58

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ryanaoleary
Copy link

Add REP for RayService incremental upgrade strategy using max_surge_percent and Gateway API.

Signed-off-by: ryanaoleary <[email protected]>
@ryanaoleary
Copy link
Author

cc: @kevin85421 @andrewsykim

metadata:
name: example-rayservice
spec:
upgradeStrategy: "NewCluster"
Copy link
Contributor

@andrewsykim andrewsykim Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not too late to change this API since it's not released yet, should we change this now before v1.3 so it'll play nicely with the new fields you're proposing here?

spec:
  upgradeStrategy:
    type: "NewCluster"
    targetCapacity: 50
    maxSurgePercent: 50
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for 1.3, we can just support

spec:
  upgradeStrategy:
    type: NewCluster

So it's easy to add new fields in v1.4 that are compatible with the upgradeStrategy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I like that idea, I'll edit the REP to reflect that and put out a PR with the API fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2678 has already been merged. Should we update the example here? In addition, should we add a new type of upgrade instead of reusing NewCluster?

Signed-off-by: ryanaoleary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
@kevin85421 kevin85421 self-assigned this Jan 7, 2025
metadata:
name: example-rayservice
spec:
upgradeStrategy: "NewCluster"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2678 has already been merged. Should we update the example here? In addition, should we add a new type of upgrade instead of reusing NewCluster?


### Gateway API

Kubernetes Gateway API is a set of resources designed to manage HTTP(S), TCP, and other network traffic for Kubernetes clusters. It provides a more flexible and extensible alternative to the traditional Ingress API, offering better support for service mesh integrations, routing policies, and multi-cluster configurations. Gateway API already has a well-defined interface for traffic splitting, so KubeRay would not need to implement backends for different Ingress controllers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which persona is responsible for which Gateway resources? For example, GatewayClass is created / maintained by the K8s cluster admin.

| v1.GRPCRoute | Specifies routing behavior of gRPC requests from a Gateway listener to an API object | GA (v1.1+) | v1.25+ |

### Example Upgrade Process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does the Ray Serve API / KubeRay API / K8s Gateway API interact with each other? For example, what's actually happening under the hood for "The KubeRay controller switches target_capacity percent of the requests to the new RayCluster once it is ready."?

3. The upgraded RayCluster scales up to (1 + `max_surge_percent` - `target_capacity_old`), and the old RayCluster decreases its `target_capacity` until `target_capacity_new` plus `target_capacity_old` equals 1.

## Compatibility, Deprecation, and Migration Plan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summarize the limitation / assumption for the incremental upgrade. For example, the new RayCluster should have the capacity to handle the same or more requests comparing to the old RayCluster. If users don't follow it, some requests may be dropped.

| ------- | ------ | ---------- | ------------- |
| v1.GatewayClass | Defines a Gateway Cluster level resource | GA (v0.5+) | v1.24+ |
| v1.Gateway | Infrastructure that binds Listeners to a set of IP addresses | GA (v0.5+) | v1.24+ |
| v1.HTTPRoute | Provides a way to route HTTP requests | GA (v0.5+) | v1.24+ |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add examples of HTTPRoute that KubeRay needs to create https://gateway-api.sigs.k8s.io/api-types/httproute/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants