Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix live-updates for kubernetes Job getting stuck occasionally #3685

Conversation

IrvingMg
Copy link
Contributor

@IrvingMg IrvingMg commented Nov 28, 2024

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Adds a check when syncing jobs on multikueue to avoid trying to patch the spec before resuming the job.

Which issue(s) this PR fixes:

Fixes #3581

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a bug which occasionally prevented updates to the PodTemplate of the Job on the management cluster
when starting a Job (e.g. updating nodeSelectors), when using `MultiKueueBatchJobWithManagedBy` enabled.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. labels Nov 28, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 28, 2024
Copy link

netlify bot commented Nov 28, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 05200ea
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67501d61592cff000815e0cf

@IrvingMg IrvingMg changed the title Fix flaky e2e test - Creating a multikueue admission check Should run a job on worker if admitted Fix flaky - Creating a multikueue admission check Should run a job on worker if admitted Nov 28, 2024
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 2, 2024
@@ -49,6 +50,12 @@ func (b *multikueueAdapter) SyncJob(ctx context.Context, localClient client.Clie
return err
}

log := ctrl.LoggerFrom(ctx)
if ptr.Deref(localJob.Spec.Suspend, false) { // Ensure the job is resumed before updating its status; otherwise, it will fail when patching the spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the implementation - is this actually a bug? If so, we should add a release note describing the scenario that is fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that remote job was unsuspended before local job. That's why after update status

if features.Enabled(features.MultiKueueBatchJobWithManagedBy) {
return clientutil.PatchStatus(ctx, localClient, &localJob, func() (bool, error) {
localJob.Status = remoteJob.Status
return true, nil
})
}
we can't update spec on jobframework
err := r.startJob(ctx, job, object, wl)
if err != nil {
log.Error(err, "Unsuspending job")
.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you maybe explain a bit more what is the update that is failing trying to update in the spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I know what you mean, k8s core blocks update of pod template if the job is running. And it determines that the job is running because status.startTime Is not bil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: here and here are the relevant places in the core k8s which reject the spec update if the status is already updated (status.startTime != nil).

@mimowo
Copy link
Contributor

mimowo commented Dec 2, 2024

Please describe what is the scenario when the test fails, could this impact end-to-end users of MultiKueue?

@IrvingMg IrvingMg marked this pull request as draft December 2, 2024 09:56
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2024
@mbobrovskyi
Copy link
Contributor

/assign

Copy link
Contributor

@mbobrovskyi mbobrovskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @mszadkow

@@ -49,6 +50,12 @@ func (b *multikueueAdapter) SyncJob(ctx context.Context, localClient client.Clie
return err
}

log := ctrl.LoggerFrom(ctx)
if ptr.Deref(localJob.Spec.Suspend, false) { // Ensure the job is resumed before updating its status; otherwise, it will fail when patching the spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a test case to verify skipping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more tests! agree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be covered in integration test probably. right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug seems hard to reproduce in a test case as it requires causing a failure in the reconciler of the local job while sync it with the remote one.

However, this bug was originally caught in multikueue e2e tests and after running them several times with this fix, the error didn't occur anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug seems hard to reproduce in a test case as it requires causing a failure in the reconciler of the local job while sync it with the remote one.

I see, in that case I don't have a better test idea, maybe a unit test.

However, this bug was originally caught in multikueue e2e tests and after running them several times with this fix, the error didn't occur anymore.

Yes, this is covered, just flaky, so I'm ok to accept as is. Feel free to follow up with unit tests or extra tests when working on #3730

@k8s-ci-robot k8s-ci-robot requested a review from mszadkow December 3, 2024 08:40
@mimowo
Copy link
Contributor

mimowo commented Dec 3, 2024

Proposal:
/release-note-edit

Fix a bug which occasionally prevented updates to the PodTemplate of the Job on the management cluster
when starting a Job (e.g. updating nodeSelectors), when using `MultiKueueBatchJobWithManagedBy` enabled.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Dec 3, 2024
@mimowo
Copy link
Contributor

mimowo commented Dec 3, 2024

IIUC this is not just a fix for flaky tests, but an important bugfix for the issue which could affect production systems.
I agree it would be great to also apply the same approach for other Job CRDs - they may or may not have similar
validation, but we also care about consistency. For other Jobs I'm ok with a follow up, if this makes this PR land before 0.9.2.

@mbobrovskyi
Copy link
Contributor

@IrvingMg please update description with

Fixes #3581

@IrvingMg
Copy link
Contributor Author

IrvingMg commented Dec 4, 2024

IIUC this is not just a fix for flaky tests, but an important bugfix for the issue which could affect production systems. I agree it would be great to also apply the same approach for other Job CRDs - they may or may not have similar validation, but we also care about consistency. For other Jobs I'm ok with a follow up, if this makes this PR land before 0.9.2.

Open #3730

@IrvingMg IrvingMg marked this pull request as ready for review December 4, 2024 09:32
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 4, 2024
@mimowo
Copy link
Contributor

mimowo commented Dec 4, 2024

/lgtm
/approve
/retitle Fix live-updates for kubernetes Job getting stuck occasionally

@k8s-ci-robot k8s-ci-robot changed the title Fix flaky - Creating a multikueue admission check Should run a job on worker if admitted Fix live-updates for kubernetes Job getting stuck occasionally Dec 4, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 4, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: cc29b808ec1b919c9c640a1820f6ad354dfee0d6

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IrvingMg, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 4, 2024
@k8s-ci-robot k8s-ci-robot merged commit 8cc3157 into kubernetes-sigs:main Dec 4, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.10 milestone Dec 4, 2024
@mbobrovskyi mbobrovskyi deleted the fix/flaky-creating-multikueue-admission-check branch December 4, 2024 10:19
@mimowo
Copy link
Contributor

mimowo commented Dec 4, 2024

/cherry-pick release-0.9

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: new pull request created: #3731

In response to this:

/cherry-pick release-0.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
6 participants