-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REP: KubeRay RayService Incremental Rollout #58
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ryanaoleary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
metadata: | ||
name: example-rayservice | ||
spec: | ||
upgradeStrategy: "NewCluster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not too late to change this API since it's not released yet, should we change this now before v1.3 so it'll play nicely with the new fields you're proposing here?
spec:
upgradeStrategy:
type: "NewCluster"
targetCapacity: 50
maxSurgePercent: 50
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for 1.3, we can just support
spec:
upgradeStrategy:
type: NewCluster
So it's easy to add new fields in v1.4 that are compatible with the upgradeStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I like that idea, I'll edit the REP to reflect that and put out a PR with the API fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a5ea9db, API change: ray-project/kuberay#2678
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2678 has already been merged. Should we update the example here? In addition, should we add a new type
of upgrade instead of reusing NewCluster
?
Signed-off-by: ryanaoleary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
metadata: | ||
name: example-rayservice | ||
spec: | ||
upgradeStrategy: "NewCluster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2678 has already been merged. Should we update the example here? In addition, should we add a new type
of upgrade instead of reusing NewCluster
?
|
||
### Gateway API | ||
|
||
Kubernetes Gateway API is a set of resources designed to manage HTTP(S), TCP, and other network traffic for Kubernetes clusters. It provides a more flexible and extensible alternative to the traditional Ingress API, offering better support for service mesh integrations, routing policies, and multi-cluster configurations. Gateway API already has a well-defined interface for traffic splitting, so KubeRay would not need to implement backends for different Ingress controllers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which persona is responsible for which Gateway resources? For example, GatewayClass
is created / maintained by the K8s cluster admin.
| v1.GRPCRoute | Specifies routing behavior of gRPC requests from a Gateway listener to an API object | GA (v1.1+) | v1.25+ | | ||
|
||
### Example Upgrade Process | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does the Ray Serve API / KubeRay API / K8s Gateway API interact with each other? For example, what's actually happening under the hood for "The KubeRay controller switches target_capacity
percent of the requests to the new RayCluster once it is ready."?
3. The upgraded RayCluster scales up to (1 + `max_surge_percent` - `target_capacity_old`), and the old RayCluster decreases its `target_capacity` until `target_capacity_new` plus `target_capacity_old` equals 1. | ||
|
||
## Compatibility, Deprecation, and Migration Plan | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summarize the limitation / assumption for the incremental upgrade. For example, the new RayCluster should have the capacity to handle the same or more requests comparing to the old RayCluster. If users don't follow it, some requests may be dropped.
| ------- | ------ | ---------- | ------------- | | ||
| v1.GatewayClass | Defines a Gateway Cluster level resource | GA (v0.5+) | v1.24+ | | ||
| v1.Gateway | Infrastructure that binds Listeners to a set of IP addresses | GA (v0.5+) | v1.24+ | | ||
| v1.HTTPRoute | Provides a way to route HTTP requests | GA (v0.5+) | v1.24+ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add examples of HTTPRoute that KubeRay needs to create https://gateway-api.sigs.k8s.io/api-types/httproute/.
Add REP for RayService incremental upgrade strategy using
max_surge_percent
and Gateway API.