-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unit tests for raycluster_controller reconcilePods function #219
Add unit tests for raycluster_controller reconcilePods function #219
Conversation
You could make all tests run with both values for the flag (PrioritizeWorkersToDelete) |
Labels: map[string]string{ | ||
common.RayClusterLabelKey: instanceName, | ||
common.RayNodeTypeLabelKey: string(rayiov1alpha1.HeadNode), | ||
common.RayNodeGroupLabelKey: groupNameStr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be "headgroup", not "small-group". The reconcilePods method does filter to get headPods and workerPods. Does your setup reproduce that correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will update the group name. This test setup just considers a part of pod resources.
|
||
assert.Equal(t, int(expectReplicaNum), len(podList.Items), | ||
"Replica number is wrong after reconcile expect %d actual %d", expectReplicaNum, len(podList.Items)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check that WorkersToDelete has been cleared at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned earlier, we should make sure WorkersToDelete has been deleted in the CR - not just in testRayCluster
"Replica number is wrong after reconcile expect %d actual %d", expectReplicaNum, len(podList.Items)) | ||
|
||
for i := 0; i < len(podList.Items); i++ { | ||
if contains(testRayCluster.Spec.WorkerGroupSpecs[0].ScaleStrategy.WorkersToDelete, podList.Items[i].Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part was not clear to me when reading the code -- I would expect the WorkersToDelete
to be cleared already. But maybe that's only the case in the PrioritizeWorkersToDelete = true
codepath :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the same comment applies to the tests below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will update the code
assert.Nil(t, err, "Fail to get pod list") | ||
assert.Equal(t, len(testPods), len(podList.Items), "Init pod list len is wrong") | ||
|
||
// Simulate 2 pod container crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that actually simulates a crash of the pod -- the pod spec wouldn't get deleted even if that happens, rather it would to into ERRORED
or similar state I think. Maybe that's better to test here. Pod specs only get deleted if somebody does that (and that's also great to test for, but it probably shouldn't be called a crash in that case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will rename it as pod container removed state. Will try to add ERRORED unit test
Thanks a lot for putting these tests together, this is really great! I left some small comments (which might just expose my lack of knowledge about how the code actually works). |
r.Recorder.Eventf(instance, v1.EventTypeNormal, "Deleted", "Deleted pod %s", pod.Name) | ||
} | ||
} | ||
worker.ScaleStrategy.WorkersToDelete = []string{} | ||
instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete = []string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug fix to clear the instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that precisely what the line above is doing?
worker = instance.Spec.WorkerGroupSpecs[index]
as per the outer for loop :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no. WorkerGroupSpecs []WorkerGroupSpec
is not WorkerGroupSpecs []*WorkerGroupSpec
. So update worker
will not update instance.Spec.WorkerGroupSpecs[index]
.
This is found by the unit tests. So instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete = []string{}
is needed.
Check this for more info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this bug.
Could you replace line 239 with:
for index := range instance.Spec.WorkerGroupSpecs {
worker = &instance.Spec.WorkerGroupSpecs[index]
then you would not require both lines 272 and 273
However, even after you do this, does the CR actually get updated? The only place I see this happening is within updateStatus() which has:
if err := r.Status().Update(context.Background(), instance); err != nil {
Doesn't this only update the Status and not the Spec? See https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/client#StatusClient
So how does this clear WorkersToDelete in the CR?
Separately (strictly speaking) the operator should not be changing the Spec, it should only update the Status given the declarative nature of the k8s programming style. But in this case I would say it is OK given that ScaleStrategy is not really declarative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would like to discuss the sync logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, this
for index := range instance.Spec.WorkerGroupSpecs {
worker = &instance.Spec.WorkerGroupSpecs[index]
has a conflict with diff < 0
(line 310) branch code.
After thinking about this a bit I feel we should not be clearing WorkersToDelete. The abstraction we should follow is:
Some rules to follow:
Having said all this, I argue that the operator should not update WorkersToDelete. But we should make sure that repeated iterations of the reconciler do not become less efficient as a consequence. On reading the we can make one small change to address this: There are 4 places in the code that iterator over the pods in WorkersToDelete - that look like this:
We should simply add the following at the beginning of each of these loops:
We can do this most efficiently perhaps by changing runningPods to be a list of pod names instead of a list of pods. |
Thanks for writing down your thoughts -- not having the operator update the WorkersToDelete makes sense to me and with your optimization that should be fine. We can implement this first and test it and then if it works well, we can hopefully move to a more declarative way to specify the goal state. For example if we replace the workersToDelete with a target list of pods that should do the trick :) |
…project#219) * Write raycluster_controller reconcilePods unit tests * Improve unit tests code * Fix lint issue and Improve unit tests code * Run goimports * Fix a bug of workersToDelete update and update unit tests * Fix a bug of workersToDelete update and update unit tests * Update unit tests log * Update workersToDelete local var to avoid unuseful kube api server delete call * Remove code of updating instance spec workersToDelete Co-authored-by: Taikun Liu <[email protected]>
Why are these changes needed?
This diff adds unit tests for Kuberay reconcilePods function to improve the code reliability.
Related issue number
None
Checks