Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Make replicas optional for WorkerGroupSpec #1443

Merged
merged 10 commits into from
Sep 25, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions helm-chart/kuberay-operator/crds/ray.io_rayclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6087,11 +6087,15 @@ spec:
them by name
type: string
maxReplicas:
description: MaxReplicas defaults to maxInt32
default: 2147483647
description: MaxReplicas denotes the maximum number of desired
Pods for this worker group.
format: int32
type: integer
minReplicas:
description: MinReplicas defaults to 1
default: 0
description: MinReplicas denotes the minimum number of desired
Pods for this worker group.
format: int32
type: integer
rayStartParams:
Expand All @@ -6101,7 +6105,9 @@ spec:
address, object-store-memory, ...'
type: object
replicas:
description: Replicas Number of desired pods in this pod group.
default: 0
description: Replicas is the number of desired Pods for this
worker group.
format: int32
type: integer
scaleStrategy:
Expand Down Expand Up @@ -11568,7 +11574,6 @@ spec:
- maxReplicas
- minReplicas
- rayStartParams
- replicas
- template
type: object
type: array
Expand Down
14 changes: 9 additions & 5 deletions helm-chart/kuberay-operator/crds/ray.io_rayjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6355,11 +6355,15 @@ spec:
them by name
type: string
maxReplicas:
description: MaxReplicas defaults to maxInt32
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
default: 2147483647
description: MaxReplicas denotes the maximum number of desired
Pods for this worker group.
format: int32
type: integer
minReplicas:
description: MinReplicas defaults to 1
default: 0
description: MinReplicas denotes the minimum number of desired
Pods for this worker group.
format: int32
type: integer
rayStartParams:
Expand All @@ -6369,8 +6373,9 @@ spec:
command: address, object-store-memory, ...'
type: object
replicas:
description: Replicas Number of desired pods in this pod
group.
default: 0
description: Replicas is the number of desired Pods for
this worker group.
format: int32
type: integer
scaleStrategy:
Expand Down Expand Up @@ -12094,7 +12099,6 @@ spec:
- maxReplicas
- minReplicas
- rayStartParams
- replicas
- template
type: object
type: array
Expand Down
14 changes: 9 additions & 5 deletions helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6329,11 +6329,15 @@ spec:
them by name
type: string
maxReplicas:
description: MaxReplicas defaults to maxInt32
default: 2147483647
description: MaxReplicas denotes the maximum number of desired
Pods for this worker group.
format: int32
type: integer
minReplicas:
description: MinReplicas defaults to 1
default: 0
description: MinReplicas denotes the minimum number of desired
Pods for this worker group.
format: int32
type: integer
rayStartParams:
Expand All @@ -6343,8 +6347,9 @@ spec:
command: address, object-store-memory, ...'
type: object
replicas:
description: Replicas Number of desired pods in this pod
group.
default: 0
description: Replicas is the number of desired Pods for
this worker group.
format: int32
type: integer
scaleStrategy:
Expand Down Expand Up @@ -12068,7 +12073,6 @@ spec:
- maxReplicas
- minReplicas
- rayStartParams
- replicas
- template
type: object
type: array
Expand Down
12 changes: 7 additions & 5 deletions ray-operator/apis/ray/v1alpha1/raycluster_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,12 +45,14 @@ type HeadGroupSpec struct {
type WorkerGroupSpec struct {
// we can have multiple worker groups, we distinguish them by name
GroupName string `json:"groupName"`
// Replicas Number of desired pods in this pod group. This is a pointer to distinguish between explicit
// zero and not specified. Defaults to 1.
Replicas *int32 `json:"replicas"`
// MinReplicas defaults to 1
// Replicas is the number of desired Pods for this worker group.
// +kubebuilder:default:=0
Replicas *int32 `json:"replicas,omitempty"`
// MinReplicas denotes the minimum number of desired Pods for this worker group.
// +kubebuilder:default:=0
MinReplicas *int32 `json:"minReplicas"`
// MaxReplicas defaults to maxInt32
// MaxReplicas denotes the maximum number of desired Pods for this worker group.
// +kubebuilder:default:=2147483647
MaxReplicas *int32 `json:"maxReplicas"`
// RayStartParams are the params of the start command: address, object-store-memory, ...
RayStartParams map[string]string `json:"rayStartParams"`
Expand Down
13 changes: 9 additions & 4 deletions ray-operator/config/crd/bases/ray.io_rayclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6087,11 +6087,15 @@ spec:
them by name
type: string
maxReplicas:
description: MaxReplicas defaults to maxInt32
default: 2147483647
description: MaxReplicas denotes the maximum number of desired
Pods for this worker group.
format: int32
type: integer
minReplicas:
description: MinReplicas defaults to 1
default: 0
description: MinReplicas denotes the minimum number of desired
Pods for this worker group.
format: int32
type: integer
rayStartParams:
Expand All @@ -6101,7 +6105,9 @@ spec:
address, object-store-memory, ...'
type: object
replicas:
description: Replicas Number of desired pods in this pod group.
default: 0
description: Replicas is the number of desired Pods for this
worker group.
format: int32
type: integer
scaleStrategy:
Expand Down Expand Up @@ -11568,7 +11574,6 @@ spec:
- maxReplicas
- minReplicas
- rayStartParams
- replicas
- template
type: object
type: array
Expand Down
14 changes: 9 additions & 5 deletions ray-operator/config/crd/bases/ray.io_rayjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6355,11 +6355,15 @@ spec:
them by name
type: string
maxReplicas:
description: MaxReplicas defaults to maxInt32
default: 2147483647
description: MaxReplicas denotes the maximum number of desired
Pods for this worker group.
format: int32
type: integer
minReplicas:
description: MinReplicas defaults to 1
default: 0
description: MinReplicas denotes the minimum number of desired
Pods for this worker group.
format: int32
type: integer
rayStartParams:
Expand All @@ -6369,8 +6373,9 @@ spec:
command: address, object-store-memory, ...'
type: object
replicas:
description: Replicas Number of desired pods in this pod
group.
default: 0
description: Replicas is the number of desired Pods for
this worker group.
format: int32
type: integer
scaleStrategy:
Expand Down Expand Up @@ -12094,7 +12099,6 @@ spec:
- maxReplicas
- minReplicas
- rayStartParams
- replicas
- template
type: object
type: array
Expand Down
14 changes: 9 additions & 5 deletions ray-operator/config/crd/bases/ray.io_rayservices.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6329,11 +6329,15 @@ spec:
them by name
type: string
maxReplicas:
description: MaxReplicas defaults to maxInt32
default: 2147483647
description: MaxReplicas denotes the maximum number of desired
Pods for this worker group.
format: int32
type: integer
minReplicas:
description: MinReplicas defaults to 1
default: 0
description: MinReplicas denotes the minimum number of desired
Pods for this worker group.
format: int32
type: integer
rayStartParams:
Expand All @@ -6343,8 +6347,9 @@ spec:
command: address, object-store-memory, ...'
type: object
replicas:
description: Replicas Number of desired pods in this pod
group.
default: 0
description: Replicas is the number of desired Pods for
this worker group.
format: int32
type: integer
scaleStrategy:
Expand Down Expand Up @@ -12068,7 +12073,6 @@ spec:
- maxReplicas
- minReplicas
- rayStartParams
- replicas
- template
type: object
type: array
Expand Down
19 changes: 3 additions & 16 deletions ray-operator/controllers/ray/raycluster_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -642,22 +642,9 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
// Reconcile worker pods now
for _, worker := range instance.Spec.WorkerGroupSpecs {
// workerReplicas will store the target number of pods for this worker group.
var workerReplicas int32
// Always honor MaxReplicas if it is set:
// If MaxReplicas is set and Replicas > MaxReplicas, use MaxReplicas as the
// effective target replica count and log the discrepancy.
// See https://github.com/ray-project/kuberay/issues/560.
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
if worker.MaxReplicas != nil && *worker.MaxReplicas < *worker.Replicas {
workerReplicas = *worker.MaxReplicas
r.Log.Info(
fmt.Sprintf(
"Replicas for worker group %s (%d) is greater than maxReplicas (%d). Using maxReplicas (%d) as the target replica count.",
worker.GroupName, *worker.Replicas, *worker.MaxReplicas, *worker.MaxReplicas,
),
)
} else {
workerReplicas = *worker.Replicas
}
var workerReplicas int32 = utils.GetWorkerGroupDesiredReplicas(worker)
r.Log.Info("reconcilePods", "desired workerReplicas (always adhering to minReplicas/maxReplica)", workerReplicas, "worker group", worker.GroupName, "maxReplicas", worker.MaxReplicas, "minReplicas", worker.MinReplicas, "replicas", worker.Replicas)

workerPods := corev1.PodList{}
filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeGroupLabelKey: worker.GroupName}
if err := r.List(ctx, &workerPods, client.InNamespace(instance.Namespace), filterLabels); err != nil {
Expand Down
88 changes: 88 additions & 0 deletions ray-operator/controllers/ray/raycluster_controller_fake_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2220,3 +2220,91 @@ func Test_RedisCleanup(t *testing.T) {
})
}
}

func TestReconcile_Replicas_Optional(t *testing.T) {
setupTest(t)

// This test makes some assumptions about the testRayCluster object.
// (1) 1 workerGroup (2) disable autoscaling
assert.Equal(t, 1, len(testRayCluster.Spec.WorkerGroupSpecs), "This test assumes only one worker group.")

// Disable autoscaling so that the random Pod deletion is enabled.
testRayCluster.Spec.EnableInTreeAutoscaling = pointer.BoolPtr(false)
testRayCluster.Spec.WorkerGroupSpecs[0].ScaleStrategy.WorkersToDelete = []string{}

tests := map[string]struct {
replicas *int32
minReplicas *int32
maxReplicas *int32
desiredReplicas int
}{
"Replicas is nil": {
// If `Replicas` is nil, the controller will set the desired state of the workerGroup to `MinReplicas` Pods.
// [Note]: It is not possible for `Replicas` to be nil in practice because it has a default value in the CRD.
replicas: nil,
minReplicas: pointer.Int32Ptr(1),
maxReplicas: pointer.Int32Ptr(10000),
desiredReplicas: 1,
},
"Replicas is smaller than MinReplicas": {
// If `Replicas` is smaller than `MinReplicas`, the controller will set the desired state of the workerGroup to `MinReplicas` Pods.
replicas: pointer.Int32Ptr(0),
minReplicas: pointer.Int32Ptr(1),
maxReplicas: pointer.Int32Ptr(10000),
desiredReplicas: 1,
},
"Replicas is larger than MaxReplicas": {
// If `Replicas` is larger than `MaxReplicas`, the controller will set the desired state of the workerGroup to `MaxReplicas` Pods.
replicas: pointer.Int32Ptr(4),
minReplicas: pointer.Int32Ptr(1),
maxReplicas: pointer.Int32Ptr(3),
desiredReplicas: 3,
},
}

for name, tc := range tests {
t.Run(name, func(t *testing.T) {
cluster := testRayCluster.DeepCopy()
cluster.Spec.WorkerGroupSpecs[0].Replicas = tc.replicas
cluster.Spec.WorkerGroupSpecs[0].MinReplicas = tc.minReplicas
cluster.Spec.WorkerGroupSpecs[0].MaxReplicas = tc.maxReplicas

// This test makes some assumptions about the testPods object.
// `testPods` contains 6 pods, including 1 head pod and 5 worker pods.
assert.Equal(t, 6, len(testPods), "This test assumes the testPods object contains 6 pods.")
numHeadPods := 1
oldNumWorkerPods := len(testPods) - numHeadPods

// Initialize a fake client with newScheme and runtimeObjects.
fakeClient := clientFake.NewClientBuilder().WithRuntimeObjects(testPods...).Build()
ctx := context.Background()

// Get the pod list from the fake client.
podList := corev1.PodList{}
err := fakeClient.List(ctx, &podList, client.InNamespace(namespaceStr))
assert.Nil(t, err, "Fail to get pod list")
assert.Equal(t, oldNumWorkerPods+numHeadPods, len(podList.Items), "Init pod list len is wrong")

// Initialize a new RayClusterReconciler.
testRayClusterReconciler := &RayClusterReconciler{
Client: fakeClient,
Recorder: &record.FakeRecorder{},
Scheme: scheme.Scheme,
Log: ctrl.Log.WithName("controllers").WithName("RayCluster"),
}

// Since the desired state of the workerGroup is 1 replica,
// the controller will delete 4 worker Pods.
err = testRayClusterReconciler.reconcilePods(ctx, cluster)
assert.Nil(t, err, "Fail to reconcile Pods")

err = fakeClient.List(ctx, &podList, &client.ListOptions{
LabelSelector: workerSelector,
Namespace: namespaceStr,
})
assert.Nil(t, err, "Fail to get pod list after reconcile")
assert.Equal(t, tc.desiredReplicas, len(podList.Items),
"Replica number is wrong after reconcile expect %d actual %d", tc.desiredReplicas, len(podList.Items))
})
}
}
17 changes: 16 additions & 1 deletion ray-operator/controllers/ray/utils/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -215,11 +215,26 @@ func GenerateIdentifier(clusterName string, nodeType rayv1alpha1.RayNodeType) st
return fmt.Sprintf("%s-%s", clusterName, nodeType)
}

func GetWorkerGroupDesiredReplicas(workerGroupSpec rayv1alpha1.WorkerGroupSpec) int32 {
// Always adhere to min/max replicas constraints. If minReplicas > maxReplicas, minReplicas will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If minReplicas > maxReplicas shouldn't we raise an error instead of silently using minReplicas? At the very least should we add a warning log?

var workerReplicas int32
if workerGroupSpec.Replicas == nil || *workerGroupSpec.Replicas < *workerGroupSpec.MinReplicas {
// Replicas is impossible to be nil as it has a default value assigned in the CRD.
// Add this check to make testing easier.
workerReplicas = *workerGroupSpec.MinReplicas
} else if *workerGroupSpec.Replicas > *workerGroupSpec.MaxReplicas {
workerReplicas = *workerGroupSpec.MaxReplicas
} else {
workerReplicas = *workerGroupSpec.Replicas
}
return workerReplicas
}

// CalculateDesiredReplicas calculate desired worker replicas at the cluster level
func CalculateDesiredReplicas(cluster *rayv1alpha1.RayCluster) int32 {
count := int32(0)
for _, nodeGroup := range cluster.Spec.WorkerGroupSpecs {
count += *nodeGroup.Replicas
count += GetWorkerGroupDesiredReplicas(nodeGroup)
}

return count
Expand Down
Loading