-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update wl/job handling for the case, when the job is finished #1383
Update wl/job handling for the case, when the job is finished #1383
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
@achernevskii Why do we need this change? I couldn't see what is bug. |
It's not a bug. The behavior has changed since After this discussion I've decided to go back to the original reconciler logic: |
if err := r.stopJob(ctx, job, w, StopReasonNoMatchingWorkload, "No matching Workload"); err != nil { | ||
return nil, fmt.Errorf("stopping job with no matching workload: %w", err) | ||
} | ||
if err := r.stopJob(ctx, job, w, StopReasonNoMatchingWorkload, "No matching Workload"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only call stop if it's not already finished. Otherwise it's just a waste of API calls.
Is there a case where we need this for pod groups? It doesn't look like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypothetically, a job could be finished but unsuspended. I don't think it's possible for any of the existing integrations, but it could be possible from the GeneticJob
interface perspective.
If we're trying to minimize the number of API calls in this case, shouldn't we call IsSuspended
here instead of Finished
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsSuspended is just about the suspend field. It would still return false for the job if it's finished but the suspend field is false. That's why we were still checking for IsFinished as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Finished==true
will always mean that no pods related to the job are active?
For the current reconciler, workload will be recreated. So if a job is finished, but not suspended, and new workload is not admitted, the job will be stopped:
https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobframework/reconciler.go#L367
With this implementation we have to be sure that the job is stopped when we find out that workload is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Finished==true will always mean that no pods related to the job are active?
IIUC, Finished==true
means that the job has succeeded or failed regardless of being suspended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So
Finished==true
will always mean that no pods related to the job are active?
It should, unless the job implemented IsFinished wrong.
That makes sense. Thanks for the clarification. /remove-kind-bug |
/remove-kind bug |
if err != nil { | ||
log.Error(err, "Updating workload status") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 this might have escaped previous reviews.
If the Workload update fails, we should return the error, so it can be reconciled again (except for NotFound errors).
Another thing that escaped:
Finalizing the job should happen after we update the Workload. Otherwise, the job and the Workload (which is a dependent object) could potentially be deleted before it was transitioned to Finished. This could be fine, but it's better to give complete information before removing the Workload, in case a user is watching for the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the job finalization/wl status update order: #1177
pkg/controller/jobs/kubeflow/jobs/mxjob/mxjob_controller_test.go
Outdated
Show resolved
Hide resolved
@@ -1408,6 +1408,47 @@ func TestReconciler(t *testing.T) { | |||
}, | |||
workloadCmpOpts: defaultWorkloadCmpOpts, | |||
}, | |||
"if a pod group is finished and the wl is deleted, new workload shouldn't be created": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add another one (if it isn't there)?
If a pod group is not finished (for example, it has a non-terminated gated Pod), a workload should be recreated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're indirectly checking for this scenario here:
https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/pod/pod_controller_test.go#L1280
But I've added a test with gated pod anyway.
8b89668
to
b11d1dc
Compare
* Workload is not being recreated, when the job is finished. * If related workload is not found, the job will be stopped only if it's not finished.
b11d1dc
to
9683b08
Compare
if wl != nil && !apimeta.IsStatusConditionTrue(wl.Status.Conditions, kueue.WorkloadFinished) { | ||
err := workload.UpdateStatus(ctx, r.client, wl, condition.Type, condition.Status, condition.Reason, condition.Message, constants.JobControllerName) | ||
if err != nil && !apierrors.IsNotFound(err) { | ||
log.Error(err, "Updating workload status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this line, as returned errors are logged as well.
5a62b5c
to
9eb11c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: achernevskii, alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
LGTM label has been added. Git tree hash: e01b5e0a14ad32dc03cc44e76754e9a23505bec4
|
/kind bug |
/remove-kind cleanup |
/release-note-edit
|
…etes-sigs#1383) * Update wl/job handling for the case, when the job is finished * Workload is not being recreated, when the job is finished. * If related workload is not found, the job will be stopped only if it's not finished. * Add job finalization if wl has a finished condition
What type of PR is this?
/kind bug
What this PR does / why we need it:
Workload is not being recreated, when the job is finished.
Job will be stopped if it's not suspended and matching workload is not found. This action doesn't depend on the job.Finished condition anymore.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?