-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate different operators into an unified golang ray operator #17
Comments
LGTM. |
I prefer to add
|
I think so far we have converged on
removing in the status use make sure PRs are reviewed by at least 1 extra company, to avoid having the contributors of the same company submit PRs and approve them. enhance testing, using release tags, split the Readme file into multiple ones and in different directories. [e.g. examples, ...] The ./msft-operator/ray-operator will used as the starting point, and the changes will be done on that implementation. We will remove any company name from this repo to make more user friendly. additional tools that are not part of the operator (e.g. kubectl extensions to show status, dashboards, etc.) can be stored under
|
Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue ray-project#17 for more details. Signed-off-by: Jiaxin Shan <[email protected]>
To make commit and change history clean. We can file separate PRs which will be easier for review.
I create a stale branch https://github.com/ray-project/ray-contrib/tree/stale as a backup branch just in case. |
Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue #17 for more details. Signed-off-by: Jiaxin Shan <[email protected]>
Rename msft-operator to ray-operator. See ray-project#17 for more details Signed-off-by: Jiaxin Shan <[email protected]>
Rename msft-operator to ray-operator. See #17 for more details Signed-off-by: Jiaxin Shan <[email protected]>
Haven't a chance to look at things carefully. |
@DmitriGekhtman the approach followed today, an external auto-scaler can update the Scaling up is straight forward, the auto-scaler would change the number of replicas. Scaling down requires: 1 - changing the number of replicas 2- (optional) specify which specific pods to delete first. if (currentNumber - desiredNumber) == length(listOfPodToDelete) then we don't delete random pods, otherwise after deleting the specific pods in the list, we remove random ones until the desiredNumber is met. This allows for the most flexibility when scaling down, allowing to delete random pods or specific pods |
@akanso Is there currently support for adjusting the scale of different worker pod types independently? |
Community has decided to move to unified golang ray operator and different companies won't maintain their own implementation later. See issue ray-project#17 for more details. Signed-off-by: Jiaxin Shan <[email protected]>
Rename msft-operator to ray-operator. See ray-project#17 for more details Signed-off-by: Jiaxin Shan <[email protected]>
move to TA pipeline
After talking with @chenk008 @caitengwei @akanso , I think we already have some consensus on this plan. Before we move forward, Let's discuss some detail items that still not clear. Please have a review and leave your comments
/cc @chenk008 @caitengwei @akanso @chaomengyuan @wilsonwang371
/cc @DmitriGekhtman @yiranwang52
API
It might be easy to follow one of the three styles and make the required changes to meet everyones's need. Ant Group use some low level objects to gain granular control and reusability. API. MSFT and Bytedance both uses PodTemplate for strong consistency with Pod specs. Distinctions between two some some details.
Let's use MSFT API as the template here and see if there's anything else need to be added.
Ant and MSFT uses different way to support this feature in the API. I personally think there're tradeoffs.
IdList
can exactly reflect accurate state of the cluster but the IdList could be long for large clusters.workersToDelete
overcomes large cluster paint point but information on RayClusterSpec could be outdated but still acceptable. I think at this moment, we both agree to haveworkersTodelete
supported firstly. If we have further requirement, we can extendScaleStrategy
for richer configurations.Ant Group
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/antgroup/deploy/ray-operator/api/v1alpha1/raycluster_types.go#L89-L93
MSFT
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L56-L60
HeadService
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L17-L19
I feel the ideal way to make it is to add one more field into HeadGroupSpec and ask controller to generate it. The only thing user wants to control is the what kind of service (including ingress) user want to create. It could be
ClusterIP
,NodePort
,LoadBalancer
orIngress
based on different network environment. Ports should be exposed based on head pod setting.RayClusterStatus
.https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L63-L71
This is not enough to reflect cluster status. I feel it's better to extend the status to concentrate more on the cluster level status like Bytedance did here. https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/bytedance/pkg/api/v1alpha1/raycluster_types.go#L87-L102
Seems both Ant and MSFT use out-of-tree autoscaler, Bytedance still uses in-tree. The changes on the operator side are
--autoscaling-config=$PATH
to head commands. Operator also needs to convert custom resource to an autoscaling config, store it in configmap and mount to head pod.--no-monitor
to head start commands.it would be good to add a switch to cluster.spec to control the logic. Do you have different opnions?
Kubebuilder uses webhook for validation and default setting. I feel it's a little bit heavy and not a performance efficient way.
I suggest to add basic CRD validation via and use defaulters.go instead.
Management
ray-contrib
aims to hold different community codes but currently there're only a few ray-operator implementations.or
Code review
What's the review styles we want to follow? Every PR should be accepted by reviewers from two different companies?
Release
Even every company may have downstream version, we'd like to effectively track and control changes. I suggest to follow semantic versioning.
I will prepare some release tools to help easily manage these stuff.
Testing and Tools
Presubmit Testing.
MSFT already enables Github workflows and trigger basic testing against the submitted PRs. I think we should keep doing this and add e2e testing to guarantee the code quality.
Host container image
This would be super helpful for community members to achieve one-click deployment.
I am not sure if Anyscale can create a robot token to get access to dockerHub. Github support encrypted secrets and we can use Github action to retrieve it and push to docker registry.
I notice both Ant Group and MSFT add BAZEL here and here. I feel this is complicated and not the native way to manage golang dependencies. Kubernetes community already remove BAZEL and I would suggest not to use it in the new project.
Kubebuilder already release 3.0, I think it would be good to upgrade to the latest version.
Let's also ping major stack version for the first release. Kubernetes 1.19.x and golang 1.15.
Please check any concerns one above plans and any features are missing.
The text was updated successfully, but these errors were encountered: