Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit tests for raycluster_controller reconcilePods function #219

Merged

Conversation

Waynegates
Copy link
Contributor

Why are these changes needed?

This diff adds unit tests for Kuberay reconcilePods function to improve the code reliability.

Related issue number

None

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@sriram-anyscale sriram-anyscale self-requested a review April 4, 2022 02:08
@sriram-anyscale
Copy link
Collaborator

You could make all tests run with both values for the flag (PrioritizeWorkersToDelete)

Labels: map[string]string{
common.RayClusterLabelKey: instanceName,
common.RayNodeTypeLabelKey: string(rayiov1alpha1.HeadNode),
common.RayNodeGroupLabelKey: groupNameStr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "headgroup", not "small-group". The reconcilePods method does filter to get headPods and workerPods. Does your setup reproduce that correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will update the group name. This test setup just considers a part of pod resources.


assert.Equal(t, int(expectReplicaNum), len(podList.Items),
"Replica number is wrong after reconcile expect %d actual %d", expectReplicaNum, len(podList.Items))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that WorkersToDelete has been cleared at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will add it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned earlier, we should make sure WorkersToDelete has been deleted in the CR - not just in testRayCluster

"Replica number is wrong after reconcile expect %d actual %d", expectReplicaNum, len(podList.Items))

for i := 0; i < len(podList.Items); i++ {
if contains(testRayCluster.Spec.WorkerGroupSpecs[0].ScaleStrategy.WorkersToDelete, podList.Items[i].Name) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part was not clear to me when reading the code -- I would expect the WorkersToDelete to be cleared already. But maybe that's only the case in the PrioritizeWorkersToDelete = true codepath :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the same comment applies to the tests below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will update the code

assert.Nil(t, err, "Fail to get pod list")
assert.Equal(t, len(testPods), len(podList.Items), "Init pod list len is wrong")

// Simulate 2 pod container crash.
Copy link
Collaborator

@pcmoritz pcmoritz Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that actually simulates a crash of the pod -- the pod spec wouldn't get deleted even if that happens, rather it would to into ERRORED or similar state I think. Maybe that's better to test here. Pod specs only get deleted if somebody does that (and that's also great to test for, but it probably shouldn't be called a crash in that case).

Copy link
Contributor Author

@Waynegates Waynegates Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will rename it as pod container removed state. Will try to add ERRORED unit test

@pcmoritz
Copy link
Collaborator

pcmoritz commented Apr 4, 2022

Thanks a lot for putting these tests together, this is really great! I left some small comments (which might just expose my lack of knowledge about how the code actually works).

r.Recorder.Eventf(instance, v1.EventTypeNormal, "Deleted", "Deleted pod %s", pod.Name)
}
}
worker.ScaleStrategy.WorkersToDelete = []string{}
instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete = []string{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug fix to clear the instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that precisely what the line above is doing?

worker = instance.Spec.WorkerGroupSpecs[index]

as per the outer for loop :)

Copy link
Contributor Author

@Waynegates Waynegates Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no. WorkerGroupSpecs []WorkerGroupSpec is not WorkerGroupSpecs []*WorkerGroupSpec. So update worker will not update instance.Spec.WorkerGroupSpecs[index].
This is found by the unit tests. So instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete = []string{} is needed.
Check this for more info.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this bug.

Could you replace line 239 with:

for index := range instance.Spec.WorkerGroupSpecs {
        worker = &instance.Spec.WorkerGroupSpecs[index]

then you would not require both lines 272 and 273

However, even after you do this, does the CR actually get updated? The only place I see this happening is within updateStatus() which has:

if err := r.Status().Update(context.Background(), instance); err != nil {

Doesn't this only update the Status and not the Spec? See https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/client#StatusClient

So how does this clear WorkersToDelete in the CR?

Separately (strictly speaking) the operator should not be changing the Spec, it should only update the Status given the declarative nature of the k8s programming style. But in this case I would say it is OK given that ScaleStrategy is not really declarative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would like to discuss the sync logic.

Copy link
Contributor Author

@Waynegates Waynegates Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, this

for index := range instance.Spec.WorkerGroupSpecs {
        worker = &instance.Spec.WorkerGroupSpecs[index]

has a conflict with diff < 0 (line 310) branch code.

@sriram-anyscale
Copy link
Collaborator

After thinking about this a bit I feel we should not be clearing WorkersToDelete. The abstraction we should follow is:

  • The requestor (autoscaler in this case) looks at the real world situation and decides what the goal state should be and updates the Spec part of the CR to match this goal state. This should be declarative but in our case we have some procedural components also - in WorkersToDelete. We are stating how to get to the goal state also. This is somewhat imperfect, but ok for now.
  • The operator should look at the real world situation and the goal state as described in the Spec and reconcile to make the real world be more aligned towards the goal state. It then updates the Status part of the CR to say what it did.

Some rules to follow:

  • The requestor should not depend on the current state of the Spec in what it does (then it introduces race conditions with the operator). Instead it should work in an idempotent manner. I.e., if the requestor loop runs many times without the operator running, it should not update the Spec to new values.
  • The operator should read from the Spec, but never update the Spec. The operator should not be specifying what the goal state should be, it just tries to get the real world to move towards the goal state. It should only write to the Status.
  • The only entities that read from the Status are dashboards, monitoring tools, etc. No action taken by the requestor or operator should depend on what is in the Status.

Having said all this, I argue that the operator should not update WorkersToDelete. But we should make sure that repeated iterations of the reconciler do not become less efficient as a consequence. On reading the we can make one small change to address this:

There are 4 places in the code that iterator over the pods in WorkersToDelete - that look like this:

		for _, podsToDelete := range worker.ScaleStrategy.WorkersToDelete {

We should simply add the following at the beginning of each of these loops:

   <if podsToDelete is not in runningPods> break

We can do this most efficiently perhaps by changing runningPods to be a list of pod names instead of a list of pods.

@pcmoritz
Copy link
Collaborator

pcmoritz commented Apr 5, 2022

Thanks for writing down your thoughts -- not having the operator update the WorkersToDelete makes sense to me and with your optimization that should be fine. We can implement this first and test it and then if it works well, we can hopefully move to a more declarative way to specify the goal state. For example if we replace the workersToDelete with a target list of pods that should do the trick :)

@sriram-anyscale sriram-anyscale merged commit 57b4f87 into ray-project:master Apr 5, 2022
@Waynegates Waynegates deleted the brucez/reconcileUnitTest branch April 6, 2022 05:58
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…project#219)

* Write raycluster_controller reconcilePods unit tests

* Improve unit tests code

* Fix lint issue and Improve unit tests code

* Run goimports

* Fix a bug of workersToDelete update and update unit tests

* Fix a bug of workersToDelete update and update unit tests

* Update unit tests log

* Update workersToDelete local var to avoid unuseful kube api server delete call

* Remove code of updating instance spec workersToDelete

Co-authored-by: Taikun Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants