-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix live-updates for kubernetes Job getting stuck occasionally #3685
Fix live-updates for kubernetes Job getting stuck occasionally #3685
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
@@ -49,6 +50,12 @@ func (b *multikueueAdapter) SyncJob(ctx context.Context, localClient client.Clie | |||
return err | |||
} | |||
|
|||
log := ctrl.LoggerFrom(ctx) | |||
if ptr.Deref(localJob.Spec.Suspend, false) { // Ensure the job is resumed before updating its status; otherwise, it will fail when patching the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the implementation - is this actually a bug? If so, we should add a release note describing the scenario that is fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that remote job was unsuspended before local job. That's why after update status
kueue/pkg/controller/jobs/job/job_multikueue_adapter.go
Lines 67 to 72 in c3abf9c
if features.Enabled(features.MultiKueueBatchJobWithManagedBy) { | |
return clientutil.PatchStatus(ctx, localClient, &localJob, func() (bool, error) { | |
localJob.Status = remoteJob.Status | |
return true, nil | |
}) | |
} |
kueue/pkg/controller/jobframework/reconciler.go
Lines 478 to 480 in c3abf9c
err := r.startJob(ctx, job, object, wl) | |
if err != nil { | |
log.Error(err, "Unsuspending job") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you maybe explain a bit more what is the update that is failing trying to update in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I know what you mean, k8s core blocks update of pod template if the job is running. And it determines that the job is running because status.startTime Is not bil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe what is the scenario when the test fails, could this impact end-to-end users of MultiKueue? |
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @mszadkow
@@ -49,6 +50,12 @@ func (b *multikueueAdapter) SyncJob(ctx context.Context, localClient client.Clie | |||
return err | |||
} | |||
|
|||
log := ctrl.LoggerFrom(ctx) | |||
if ptr.Deref(localJob.Spec.Suspend, false) { // Ensure the job is resumed before updating its status; otherwise, it will fail when patching the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a test case to verify skipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more tests! agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be covered in integration test probably. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug seems hard to reproduce in a test case as it requires causing a failure in the reconciler of the local job while sync it with the remote one.
However, this bug was originally caught in multikueue e2e tests and after running them several times with this fix, the error didn't occur anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug seems hard to reproduce in a test case as it requires causing a failure in the reconciler of the local job while sync it with the remote one.
I see, in that case I don't have a better test idea, maybe a unit test.
However, this bug was originally caught in multikueue e2e tests and after running them several times with this fix, the error didn't occur anymore.
Yes, this is covered, just flaky, so I'm ok to accept as is. Feel free to follow up with unit tests or extra tests when working on #3730
Proposal:
|
IIUC this is not just a fix for flaky tests, but an important bugfix for the issue which could affect production systems. |
@IrvingMg please update description with
|
c3abf9c
to
05200ea
Compare
Open #3730 |
/lgtm |
LGTM label has been added. Git tree hash: cc29b808ec1b919c9c640a1820f6ad354dfee0d6
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: IrvingMg, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-0.9 |
@mimowo: new pull request created: #3731 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
/kind flake
What this PR does / why we need it:
Adds a check when syncing jobs on multikueue to avoid trying to patch the spec before resuming the job.
Which issue(s) this PR fixes:
Fixes #3581
Special notes for your reviewer:
Does this PR introduce a user-facing change?