Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate different operators into an unified golang ray operator #17

Closed
Jeffwan opened this issue Sep 5, 2021 · 7 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Sep 5, 2021

After talking with @chenk008 @caitengwei @akanso , I think we already have some consensus on this plan. Before we move forward, Let's discuss some detail items that still not clear. Please have a review and leave your comments

/cc @chenk008 @caitengwei @akanso @chaomengyuan @wilsonwang371
/cc @DmitriGekhtman @yiranwang52

API

It might be easy to follow one of the three styles and make the required changes to meet everyones's need. Ant Group use some low level objects to gain granular control and reusability. API. MSFT and Bytedance both uses PodTemplate for strong consistency with Pod specs. Distinctions between two some some details.

Let's use MSFT API as the template here and see if there's anything else need to be added.

  1. Scale in arbitrary pods
    Ant and MSFT uses different way to support this feature in the API. I personally think there're tradeoffs. IdList can exactly reflect accurate state of the cluster but the IdList could be long for large clusters. workersToDelete overcomes large cluster paint point but information on RayClusterSpec could be outdated but still acceptable. I think at this moment, we both agree to have workersTodelete supported firstly. If we have further requirement, we can extend ScaleStrategy for richer configurations.

Ant Group
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/antgroup/deploy/ray-operator/api/v1alpha1/raycluster_types.go#L89-L93

MSFT
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L56-L60

  1. Remove HeadService
    https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L17-L19

I feel the ideal way to make it is to add one more field into HeadGroupSpec and ask controller to generate it. The only thing user wants to control is the what kind of service (including ingress) user want to create. It could be ClusterIP, NodePort, LoadBalancer or Ingress based on different network environment. Ports should be exposed based on head pod setting.

  1. Add more status in the RayClusterStatus.

https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L63-L71

This is not enough to reflect cluster status. I feel it's better to extend the status to concentrate more on the cluster level status like Bytedance did here. https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/bytedance/pkg/api/v1alpha1/raycluster_types.go#L87-L102

  1. Support in-tree and out-of-tree autoscaler

Seems both Ant and MSFT use out-of-tree autoscaler, Bytedance still uses in-tree. The changes on the operator side are

  • in-tree: automatically append --autoscaling-config=$PATH to head commands. Operator also needs to convert custom resource to an autoscaling config, store it in configmap and mount to head pod.
  • out-of-tree: explicitly append --no-monitor to head start commands.

it would be good to add a switch to cluster.spec to control the logic. Do you have different opnions?

AutoscalingEnabled *bool `json:"autoscalingEnabled,omitempty"`
  1. Don't use webhook

Kubebuilder uses webhook for validation and default setting. I feel it's a little bit heavy and not a performance efficient way.
I suggest to add basic CRD validation via and use defaulters.go instead.

Management

  1. Do we want to manage ray-operator code at root level or have a ray-operator umbrella folder? Looks like ray-contrib aims to hold different community codes but currently there're only a few ray-operator implementations.
├── ray-contrib
│   ├── BUILD.bazel
│   ├── Dockerfile
│   ├── LICENSE
│   ├── Makefile
│   ├── README.md
│   ├── api
│   ├── config
│   ├── controllers
│   ├── go.mod
│   ├── go.sum
│   ├── hack
│   ├── main
│   └── main.go

or

└── ray-contrib
    ├── README.md
    └── ray-operator
  1. Code review
    What's the review styles we want to follow? Every PR should be accepted by reviewers from two different companies?

  2. Release
    Even every company may have downstream version, we'd like to effectively track and control changes. I suggest to follow semantic versioning.

  • For each minor version, let's cut a release branch to track bug fixes and patches.
  • Cut tag for every release
# Branches
release-0.1
release-0.2

# Tags
v0.1.0
v0.1.1
v0.2.0
v0.2.1
v0.2.2

I will prepare some release tools to help easily manage these stuff.

Testing and Tools

  1. Presubmit Testing.
    MSFT already enables Github workflows and trigger basic testing against the submitted PRs. I think we should keep doing this and add e2e testing to guarantee the code quality.

  2. Host container image

This would be super helpful for community members to achieve one-click deployment.
I am not sure if Anyscale can create a robot token to get access to dockerHub. Github support encrypted secrets and we can use Github action to retrieve it and push to docker registry.

  1. Bazel build

I notice both Ant Group and MSFT add BAZEL here and here. I feel this is complicated and not the native way to manage golang dependencies. Kubernetes community already remove BAZEL and I would suggest not to use it in the new project.

  1. Kubebuilder 2.0 vs 3.0 and Kubernetes version.

Kubebuilder already release 3.0, I think it would be good to upgrade to the latest version.
Let's also ping major stack version for the first release. Kubernetes 1.19.x and golang 1.15.

Please check any concerns one above plans and any features are missing.

@chenk008
Copy link
Contributor

chenk008 commented Sep 6, 2021

LGTM.
An unified golang ray operator is required, it can help user to start Ray cluster.

@chenk008
Copy link
Contributor

chenk008 commented Sep 8, 2021

I prefer to add ray-operator directory in ray-contrib. We can separate it to a repository in the future.

└── ray-contrib
    ├── README.md
    └── ray-operator

@akanso
Copy link
Collaborator

akanso commented Sep 9, 2021

I think so far we have converged on

└── ray-contrib
    ├── README.md
    └── ray-operator

removing HeadService definition.
keep serviceType and isIngress, where isIngress is optional and defaults to false.

in the status use MinWorkerReplicas => use the term workers since we are counting the workers. There is always a head so it is implicitly know that the cluster has a +1 pod as head.

make sure PRs are reviewed by at least 1 extra company, to avoid having the contributors of the same company submit PRs and approve them.

enhance testing, using release tags, split the Readme file into multiple ones and in different directories. [e.g. examples, ...]

The ./msft-operator/ray-operator will used as the starting point, and the changes will be done on that implementation. We will remove any company name from this repo to make more user friendly.

additional tools that are not part of the operator (e.g. kubectl extensions to show status, dashboards, etc.) can be stored under

└── ray-contrib
    ├── README.md
    └── ray-operator
    └── ray-operator-tools

Jeffwan added a commit to Jeffwan/kuberay that referenced this issue Sep 10, 2021
Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue ray-project#17 for more details.

Signed-off-by: Jiaxin Shan <[email protected]>
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Sep 10, 2021

To make commit and change history clean. We can file separate PRs which will be easier for review.

  1. Deprecate existing implementations and rename reusable codes to /ray-contrib/ray-operator without modification
  2. Apply new changes based on above step and this gives us real code changes

I create a stale branch https://github.com/ray-project/ray-contrib/tree/stale as a backup branch just in case.

Jeffwan added a commit that referenced this issue Sep 10, 2021
Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue #17 for more details.

Signed-off-by: Jiaxin Shan <[email protected]>
Jeffwan added a commit to Jeffwan/kuberay that referenced this issue Sep 10, 2021
Rename msft-operator to ray-operator. See ray-project#17 for more details

Signed-off-by: Jiaxin Shan <[email protected]>
Jeffwan added a commit that referenced this issue Sep 10, 2021
Rename msft-operator to ray-operator. See #17 for more details

Signed-off-by: Jiaxin Shan <[email protected]>
@Jeffwan Jeffwan added the enhancement New feature or request label Sep 12, 2021
@Jeffwan Jeffwan self-assigned this Sep 13, 2021
@DmitriGekhtman
Copy link
Collaborator

Haven't a chance to look at things carefully.
I prefer out-of-tree autoscaler:
user or external autoscaler can specify workers to delete and desired scales for a finite collection of Ray worker pod types.
(scale==numWorker==numReplicas, whatever the choice of terminology)
I think that's most of the desired requirements we'd like to express from Anyscale side.
cc @sasha-s @yiranwang52 for anything you'd like to add

@akanso
Copy link
Collaborator

akanso commented Sep 16, 2021

@DmitriGekhtman the approach followed today, an external auto-scaler can update the raycluster custom resource to scale up/down.

Scaling up is straight forward, the auto-scaler would change the number of replicas.

Scaling down requires: 1 - changing the number of replicas 2- (optional) specify which specific pods to delete first.

if (currentNumber - desiredNumber) == length(listOfPodToDelete) then we don't delete random pods, otherwise after deleting the specific pods in the list, we remove random ones until the desiredNumber is met.

This allows for the most flexibility when scaling down, allowing to delete random pods or specific pods

@DmitriGekhtman
Copy link
Collaborator

@akanso
that sounds good to me

Is there currently support for adjusting the scale of different worker pod types independently?

This was referenced Oct 2, 2021
@Jeffwan Jeffwan closed this as completed Jan 19, 2022
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this issue Sep 24, 2023
Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue ray-project#17 for more details.

Signed-off-by: Jiaxin Shan <[email protected]>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this issue Sep 24, 2023
Rename msft-operator to ray-operator. See ray-project#17 for more details

Signed-off-by: Jiaxin Shan <[email protected]>
sutaakar pushed a commit to sutaakar/kuberay that referenced this issue Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants