Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release blocker][Feature] Only Autoscaler can make decisions to delete Pods #1253

Merged
merged 3 commits into from
Jul 20, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jul 20, 2023

Why are these changes needed?

Without this PR, there are two mechanisms to delete Pods.

  1. Autoscaler: When Autoscaler wants to scale down the cluster, it will send a patch to update replicas and workersToDelete fields in RayCluster CR. Then, KubeRay operator will delete all Pods in workersToDelete.
  2. Random Pod deletion: If all Pods in workersToDelete have been deleted, and additional deletions are still required to reach the target state, the KubeRay operator will proceed to delete worker Pods randomly.

This PR provides a new behavior for Pod deletion:

  • Case 1: If Autoscaler is disabled, we will always enable random Pod deletion.
  • Case 2: If Autoscaler is enabled, the default behavior is disabling random Pod deletion. We also provide a feature flag ENABLE_RANDOM_POD_DELETE for users to go back to the old behavior if we ignore some edge cases.

Random Pod deletion is undesirable for numerous reasons. We should try to avoid it as much as possible. In addition, this behavior can avoid Case 4 mentioned in #1238 (comment).

Related issue number

#1238

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Test 1: No Autoscaler

helm install kuberay-operator kuberay/kuberay-operator --version 0.6.0-rc.0 --set image.repository=controller,image.tag=latest

# Create a RayCluster with 1 head and 1 worker
helm install raycluster kuberay/ray-cluster --version 0.6.0-rc.0

# Edit the worker's `replicas` to 2
# [Expected result]: 1 new worker will be created
kubectl edit rayclusters.ray.io raycluster-kuberay

# Edit the worker's `replicas` to 1
# [Expected result]: 1 random worker Pod will be deleted => random Pod deletion is enabled.
kubectl edit rayclusters.ray.io raycluster-kuberay

Test 2: Autoscaler

helm install kuberay-operator kuberay/kuberay-operator --version 0.6.0-rc.0 --set image.repository=controller,image.tag=latest

# Create an autoscaling-enabled RayCluster with 1 head and 1 worker
kubectl apply -f ray-cluster.autoscaler.yaml

# Execute a script in the head Pod.
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- bash

# https://gist.github.com/kevin85421/c35b69ff99cd30c55abbe4d5ccf2a18e
# Create 3 actors and each one requires a CPU, so Autoscaler scales up to 2 workers. (1 head = 1 CPU, 2 workers = 2 * 1 CPU)
python script.py 

# After `script.py` finishes, it will scale down to 1 worker because `minReplicas` is 1.

@kevin85421 kevin85421 changed the title WIP [release blocker][Feature] Only Autoscaler can make decisions to delete Pods Jul 20, 2023
@kevin85421 kevin85421 marked this pull request as ready for review July 20, 2023 07:16
@wjzhou-ep
Copy link

Thank you very much about this new feature. I do think this is a really good idea!!!

One source of truth is always preferable.

@kevin85421 kevin85421 requested a review from gvspraveen July 20, 2023 16:35
@kevin85421 kevin85421 merged commit a163a2e into ray-project:master Jul 20, 2023
if !enableInTreeAutoscaling || enableRandomPodDelete {
// diff < 0 means that we need to delete some Pods to meet the desired number of replicas.
randomlyRemovedWorkers := -diff
r.Log.Info("reconcilePods", "Number workers to delete randomly", randomlyRemovedWorkers, "Worker group", worker.GroupName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rephrase the log statement.
"Randomly pick xx workers to delete from worker group xxx."

}
r.Recorder.Eventf(instance, corev1.EventTypeNormal, "Deleted", "Deleted Pod %s", randomPodToDelete.Name)
} else {
r.Log.Info(fmt.Sprintf("Random Pod deletion is disabled for cluster %s. The only decision-maker for Pod deletions is Autoscaler.", instance.Name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Ray Autoscaler. (if that is what you mean :) )

kevin85421 added a commit to kevin85421/kuberay that referenced this pull request Jul 20, 2023
…te Pods (ray-project#1253)

Only Autoscaler can make decisions to delete Pods
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…te Pods (ray-project#1253)

Only Autoscaler can make decisions to delete Pods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants