Consolidate different operators into an unified golang ray operator #17

Jeffwan · 2021-09-05T23:30:38Z

After talking with @chenk008 @caitengwei @akanso , I think we already have some consensus on this plan. Before we move forward, Let's discuss some detail items that still not clear. Please have a review and leave your comments

/cc @chenk008 @caitengwei @akanso @chaomengyuan @wilsonwang371
/cc @DmitriGekhtman @yiranwang52

API

It might be easy to follow one of the three styles and make the required changes to meet everyones's need. Ant Group use some low level objects to gain granular control and reusability. API. MSFT and Bytedance both uses PodTemplate for strong consistency with Pod specs. Distinctions between two some some details.

Let's use MSFT API as the template here and see if there's anything else need to be added.

Scale in arbitrary pods
Ant and MSFT uses different way to support this feature in the API. I personally think there're tradeoffs. IdList can exactly reflect accurate state of the cluster but the IdList could be long for large clusters. workersToDelete overcomes large cluster paint point but information on RayClusterSpec could be outdated but still acceptable. I think at this moment, we both agree to have workersTodelete supported firstly. If we have further requirement, we can extend ScaleStrategy for richer configurations.

Ant Group
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/antgroup/deploy/ray-operator/api/v1alpha1/raycluster_types.go#L89-L93

MSFT
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L56-L60

Remove HeadService
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L17-L19

I feel the ideal way to make it is to add one more field into HeadGroupSpec and ask controller to generate it. The only thing user wants to control is the what kind of service (including ingress) user want to create. It could be ClusterIP, NodePort, LoadBalancer or Ingress based on different network environment. Ports should be exposed based on head pod setting.

Add more status in the RayClusterStatus.

https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L63-L71

This is not enough to reflect cluster status. I feel it's better to extend the status to concentrate more on the cluster level status like Bytedance did here. https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/bytedance/pkg/api/v1alpha1/raycluster_types.go#L87-L102

Support in-tree and out-of-tree autoscaler

Seems both Ant and MSFT use out-of-tree autoscaler, Bytedance still uses in-tree. The changes on the operator side are

in-tree: automatically append --autoscaling-config=$PATH to head commands. Operator also needs to convert custom resource to an autoscaling config, store it in configmap and mount to head pod.
out-of-tree: explicitly append --no-monitor to head start commands.

it would be good to add a switch to cluster.spec to control the logic. Do you have different opnions?

AutoscalingEnabled *bool `json:"autoscalingEnabled,omitempty"`

Don't use webhook

Kubebuilder uses webhook for validation and default setting. I feel it's a little bit heavy and not a performance efficient way.
I suggest to add basic CRD validation via and use defaulters.go instead.

Management

Do we want to manage ray-operator code at root level or have a ray-operator umbrella folder? Looks like ray-contrib aims to hold different community codes but currently there're only a few ray-operator implementations.

├── ray-contrib
│   ├── BUILD.bazel
│   ├── Dockerfile
│   ├── LICENSE
│   ├── Makefile
│   ├── README.md
│   ├── api
│   ├── config
│   ├── controllers
│   ├── go.mod
│   ├── go.sum
│   ├── hack
│   ├── main
│   └── main.go

or

└── ray-contrib
    ├── README.md
    └── ray-operator

Code review
What's the review styles we want to follow? Every PR should be accepted by reviewers from two different companies?
Release
Even every company may have downstream version, we'd like to effectively track and control changes. I suggest to follow semantic versioning.

For each minor version, let's cut a release branch to track bug fixes and patches.
Cut tag for every release

# Branches
release-0.1
release-0.2

# Tags
v0.1.0
v0.1.1
v0.2.0
v0.2.1
v0.2.2

I will prepare some release tools to help easily manage these stuff.

Testing and Tools

Presubmit Testing.
MSFT already enables Github workflows and trigger basic testing against the submitted PRs. I think we should keep doing this and add e2e testing to guarantee the code quality.
Host container image

This would be super helpful for community members to achieve one-click deployment.
I am not sure if Anyscale can create a robot token to get access to dockerHub. Github support encrypted secrets and we can use Github action to retrieve it and push to docker registry.

Bazel build

I notice both Ant Group and MSFT add BAZEL here and here. I feel this is complicated and not the native way to manage golang dependencies. Kubernetes community already remove BAZEL and I would suggest not to use it in the new project.

Kubebuilder 2.0 vs 3.0 and Kubernetes version.

Kubebuilder already release 3.0, I think it would be good to upgrade to the latest version.
Let's also ping major stack version for the first release. Kubernetes 1.19.x and golang 1.15.

Please check any concerns one above plans and any features are missing.

The text was updated successfully, but these errors were encountered:

chenk008 · 2021-09-06T12:12:43Z

LGTM.
An unified golang ray operator is required, it can help user to start Ray cluster.

chenk008 · 2021-09-08T09:45:44Z

I prefer to add ray-operator directory in ray-contrib. We can separate it to a repository in the future.

└── ray-contrib
    ├── README.md
    └── ray-operator

akanso · 2021-09-09T00:07:28Z

I think so far we have converged on

└── ray-contrib
    ├── README.md
    └── ray-operator

removing HeadService definition.
keep serviceType and isIngress, where isIngress is optional and defaults to false.

in the status use MinWorkerReplicas => use the term workers since we are counting the workers. There is always a head so it is implicitly know that the cluster has a +1 pod as head.

make sure PRs are reviewed by at least 1 extra company, to avoid having the contributors of the same company submit PRs and approve them.

enhance testing, using release tags, split the Readme file into multiple ones and in different directories. [e.g. examples, ...]

The ./msft-operator/ray-operator will used as the starting point, and the changes will be done on that implementation. We will remove any company name from this repo to make more user friendly.

additional tools that are not part of the operator (e.g. kubectl extensions to show status, dashboards, etc.) can be stored under

└── ray-contrib
    ├── README.md
    └── ray-operator
    └── ray-operator-tools

Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue ray-project#17 for more details. Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan · 2021-09-10T17:48:50Z

To make commit and change history clean. We can file separate PRs which will be easier for review.

Deprecate existing implementations and rename reusable codes to /ray-contrib/ray-operator without modification
Apply new changes based on above step and this gives us real code changes

I create a stale branch https://github.com/ray-project/ray-contrib/tree/stale as a backup branch just in case.

Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue #17 for more details. Signed-off-by: Jiaxin Shan <[email protected]>

Rename msft-operator to ray-operator. See ray-project#17 for more details Signed-off-by: Jiaxin Shan <[email protected]>

Rename msft-operator to ray-operator. See #17 for more details Signed-off-by: Jiaxin Shan <[email protected]>

DmitriGekhtman · 2021-09-16T17:41:49Z

Haven't a chance to look at things carefully.
I prefer out-of-tree autoscaler:
user or external autoscaler can specify workers to delete and desired scales for a finite collection of Ray worker pod types.
(scale==numWorker==numReplicas, whatever the choice of terminology)
I think that's most of the desired requirements we'd like to express from Anyscale side.
cc @sasha-s @yiranwang52 for anything you'd like to add

akanso · 2021-09-16T19:04:17Z

@DmitriGekhtman the approach followed today, an external auto-scaler can update the raycluster custom resource to scale up/down.

Scaling up is straight forward, the auto-scaler would change the number of replicas.

Scaling down requires: 1 - changing the number of replicas 2- (optional) specify which specific pods to delete first.

if (currentNumber - desiredNumber) == length(listOfPodToDelete) then we don't delete random pods, otherwise after deleting the specific pods in the list, we remove random ones until the desiredNumber is met.

This allows for the most flexibility when scaling down, allowing to delete random pods or specific pods

DmitriGekhtman · 2021-09-16T19:09:04Z

@akanso
that sounds good to me

Is there currently support for adjusting the scale of different worker pod types independently?

Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue ray-project#17 for more details. Signed-off-by: Jiaxin Shan <[email protected]>

Rename msft-operator to ray-operator. See ray-project#17 for more details Signed-off-by: Jiaxin Shan <[email protected]>

move to TA pipeline

chenk008 mentioned this issue Sep 9, 2021

Deprecate antgroup ray operator and move to unified implementation #18

Merged

Jeffwan mentioned this issue Sep 10, 2021

Deprecate ByteDance operator and move to unified one #19

Merged

Jeffwan added a commit to Jeffwan/kuberay that referenced this issue Sep 10, 2021

Deprecate msft operator and rename to ray-operator

1c7a206

Rename msft-operator to ray-operator. See ray-project#17 for more details Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan mentioned this issue Sep 10, 2021

Deprecate msft operator and rename to ray-operator #20

Merged

Jeffwan added a commit that referenced this issue Sep 10, 2021

Deprecate msft operator and rename to ray-operator (#20)

f4076b4

Rename msft-operator to ray-operator. See #17 for more details Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan mentioned this issue Sep 11, 2021

Update v1alpha1 RayCluster CRD and controllers #22

Merged

Jeffwan added the enhancement New feature or request label Sep 12, 2021

Jeffwan self-assigned this Sep 13, 2021

Jeffwan mentioned this issue Sep 15, 2021

Proposal: KubeRay a toolkit to run Ray applications on Kubernetes #33

Closed

This was referenced Oct 2, 2021

Chore: Preprare for v0.1.0 release #47

Closed

Update README.md #52

Merged

Jeffwan closed this as completed Jan 19, 2022

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this issue Sep 24, 2023

Deprecate msft operator and rename to ray-operator (ray-project#20)

5eda1a2

Rename msft-operator to ray-operator. See ray-project#17 for more details Signed-off-by: Jiaxin Shan <[email protected]>

sutaakar pushed a commit to sutaakar/kuberay that referenced this issue Nov 15, 2024

Merge pull request ray-project#17 from red-hat-data-services/move-ta

7e98407

move to TA pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate different operators into an unified golang ray operator #17

Consolidate different operators into an unified golang ray operator #17

Jeffwan commented Sep 5, 2021 •

edited

Loading

chenk008 commented Sep 6, 2021

chenk008 commented Sep 8, 2021

akanso commented Sep 9, 2021

Jeffwan commented Sep 10, 2021

DmitriGekhtman commented Sep 16, 2021

akanso commented Sep 16, 2021

DmitriGekhtman commented Sep 16, 2021

Consolidate different operators into an unified golang ray operator #17

Consolidate different operators into an unified golang ray operator #17

Comments

Jeffwan commented Sep 5, 2021 • edited Loading

API

Management

Testing and Tools

chenk008 commented Sep 6, 2021

chenk008 commented Sep 8, 2021

akanso commented Sep 9, 2021

Jeffwan commented Sep 10, 2021

DmitriGekhtman commented Sep 16, 2021

akanso commented Sep 16, 2021

DmitriGekhtman commented Sep 16, 2021

Jeffwan commented Sep 5, 2021 •

edited

Loading