-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated #2219
WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated #2219
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
||
if tc.wantWorkload != nil { | ||
if requeueState := tc.wantWorkload.Status.RequeueState; requeueState != nil && requeueState.RequeueAt != nil { | ||
gotRequeueState := gotWorkload.Status.RequeueState | ||
if gotRequeueState != nil && gotRequeueState.RequeueAt != nil { | ||
if !gotRequeueState.RequeueAt.Equal(requeueState.RequeueAt) { | ||
t.Errorf("Unexpected requeueState.requeueAt; gotRequeueAt %v needs to be after requeueAt %v", requeueState.RequeueAt, gotRequeueState.RequeueAt) | ||
} | ||
} else { | ||
t.Errorf("Unexpected nil requeueState.requeuAt; requeueState.requeueAt shouldn't be nil") | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This verification has been invalid since we introduced the fake clock.
cc @mimowo |
if wl.Status.Admission != nil { | ||
metrics.ReportEvictedWorkloads(string(wl.Status.Admission.ClusterQueue), kueue.WorkloadEvictedByDeactivation) | ||
if !apimeta.IsStatusConditionTrue(wl.Status.Conditions, kueue.WorkloadEvicted) { | ||
workload.SetEvictedCondition(&wl, kueue.WorkloadEvictedByDeactivation, "The workload is deactivated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't like about this is that we are losing context as to why the workload was deactivated.
In addition to adding the evicted condition here, could we add it inside reconcileNotReadyTimeout
so that, if the Kueue controller doesn't restart of fail to send the status update, we keep that information in the Evicted reason and message?
Or maybe we can just deduce that the deactivation is because of the backoffLimit by looking at the requeueState.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we have the status.requeueState
in this function, just before we clean it just above, can we use it in the message, for example to say "The workload is deactivated due to 5 failed attempts to run all pods within the PodsReady timeout"? We probably don't want to set "WorkloadInactive" reason prior to setting active=false
.
I think long-term the cleanest would be if active
was in status, then we could do the following in one request:
active: false
conditions:
Evicted: true
Reason: PodsReadyTimeoutExceeded
Message: The PodsReady timeout of 5min is exceeded
Deactivated: true
Reason: PodsReadyBackoffExceeded
Message: Deactivated the workload after 5 attempts due to exceeding the PodsReady timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't like about this is that we are losing context as to why the workload was deactivated.
In addition to adding the evicted condition here, could we add it inside reconcileNotReadyTimeout so that, if the Kueue controller doesn't restart of fail to send the status update, we keep that information in the Evicted reason and message?
Or maybe we can just deduce that the deactivation is because of the backoffLimit by looking at the requeueState.
I think it would be better to deduce the reason here to avoid 2 API calls (spec and status) in the reconcileNotReadyTimeout
. I think that we can say the workload is deactivated by exceeding the limit in the case of .spec.active=false
AND .status.requeueState.requeueAt == nil
AND .status.requeueState.count != nil
.
@alculquicondor WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we have the status.requeueState in this function, just before we clean it just above, can we use it in the message, for example to say "The workload is deactivated due to 5 failed attempts to run all pods within the PodsReady timeout"? We probably don't want to set "WorkloadInactive" reason prior to setting active=false.
@mimowo Actually, we have the similar discussion here: #2175
So, could we work on enhancing the reason as another feature to focus on fixing the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. I suspect you want to have #2175 in 0.7?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, could we work on enhancing the reason as another feature to focus on fixing the bug?
sgtm, I was actually suggesting that the current place is ok, since we have the data at hand if we want to build the message. I would be fine to build the message like "The workload is deactivated due to exceeding the PodsReady timeout 5 times in a row.". I'm also ok leave it for #2175.
ginkgo.By("evicted re-admitted workload should have 2 in the re-queue count") | ||
util.ExpectWorkloadToHaveRequeueCount(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), ptr.To[int32](2)) | ||
time.Sleep(podsReadyTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a real sleep?
can it be made a few miliseconds just for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @tenzen-y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor Thank you for reminding this. Does it mean that decreasing the timeout would be better?
podsReadyTimeout = 3 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. 3s is too long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm even wondering if we could reduce 3s for other places, maybe we should review that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
I found that the current podsready integration test is not correct. So, I will fix the test in another PR, first. |
2cbde11
to
d49bab4
Compare
…is deactivated Signed-off-by: Yuki Iwai <[email protected]>
d49bab4
to
ff4ca7f
Compare
((!workload.HasRequeueState(&wl) && ptr.Equal(r.waitForPodsReady.requeuingBackoffLimitCount, ptr.To[int32](0))) || | ||
(workload.HasRequeueState(&wl) && wl.Status.RequeueState.RequeueAt == nil && | ||
ptr.Equal(wl.Status.RequeueState.Count, r.waitForPodsReady.requeuingBackoffLimitCount))) { | ||
message = fmt.Sprintf("%s by exceeded the maximum number of re-queuing retries", message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo @alculquicondor I added the dedicated message like this based on your recommendation. HDYT?
Also, let me try to add a dedicated reason in another enhancement PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the dedicated message is useful
I found that the test issues depend on resetting the whole of the |
@@ -193,7 +193,7 @@ var _ = ginkgo.Describe("SchedulerWithWaitForPodsReady", func() { | |||
|
|||
var _ = ginkgo.Context("Short PodsReady timeout", func() { | |||
ginkgo.BeforeEach(func() { | |||
podsReadyTimeout = 3 * time.Second | |||
podsReadyTimeout = 250 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decreased the timeout to 250 ms in all short podsready timeout cases.
HDYT? @mimowo @alculquicondor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice win! This will speed up the tests. Can we even go below that, say 10ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to use the 10ms, but test case the below failed:
kueue/test/integration/scheduler/podsready/scheduler_test.go
Lines 261 to 263 in ff4ca7f
util.ExpectWorkloadToHaveRequeueState(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), &kueue.RequeueState{ | |
Count: ptr.To[int32](1), | |
}, false) |
[FAILED] Timed out after 25.000s.
The function passed to Eventually failed at /Users/tenzen/go/src/sigs.k8s.io/kueue/test/util/util.go:354 with:
Expected object to be comparable, diff: &v1beta1.RequeueState{
- Count: &2,
+ Count: &1,
... // 1 ignored field
}
In [It] at: /Users/tenzen/go/src/sigs.k8s.io/kueue/test/integration/scheduler/podsready/scheduler_test.go:261 @ 05/27/24 16:59:35.799
I guess that the workload controller could not observe the transition of the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe this is because the scheduler admits the workload before we can set active: false
. This might be the reason the time was set to 3s to make sure such race conditions don't happen. Probably reducing to 1s might be safer.
Also, for the new tests, maybe we could consider moving them here: https://github.com/kubernetes-sigs/kueue/blob/ff4ca7f566e4a12e13ffc3db62cbe896e7993685/test/integration/controller/jobs/job/job_controller_test.go#L1743C39-L1743C92. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe this is because the scheduler admits the workload before we can set active: false.
Yeah, I agree with you.
This might be the reason the time was set to 3s to make sure such race conditions don't happen.
This guess doesn't seem to be correct since you added this case when you implemented the podsready feature here: 9d9336b
When we added the podsready feature, we didn't have the deactivation feature.
Probably reducing to 1s might be safer.
SGTM
Also, for the new tests, maybe we could consider moving them here: https://github.com/kubernetes-sigs/kueue/blob/ff4ca7f566e4a12e13ffc3db62cbe896e7993685/test/integration/controller/jobs/job/job_controller_test.go#L1743C39-L1743C92.
I think that this case still valid to put here (scheduler+controller) since it would be better to keep observe not to happen the unexpected race conditions bwteen the scheduler and controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guess doesn't seem to be correct since you added this case when you implemented the podsready feature here: 9d9336b
makes sense, I don't remember why it was 3s, I'm ok to reduce it to as low value as possible, provided the tests continue to pass.
I think that this case still valid to put here (scheduler+controller) since it would be better to keep observe not to happen the unexpected race conditions bwteen the scheduler and controllers.
Makes sense, on the other hand it exposes us to the race conditions. If you still think the tests are valuable with scheduler I'm ok to keep them here. Maybe we can consider two types of short timeouts: 10ms for tests which are not vulnerable to the race conditions, and longer 250ms or 1s for tests which are vulnerable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, on the other hand it exposes us to the race conditions. If you still think the tests are valuable with scheduler I'm ok to keep them here. Maybe we can consider two types of short timeouts: 10ms for tests which are not vulnerable to the race conditions, and longer 250ms or 1s for tests which are vulnerable?
That sounds reasonable, but we need to restructure podsready test. So, could we just increase 250ms to 1s in this PR, and then could we try to restructure podsready integration test in followup PR?
@mimowo WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable, but we need to restructure podsready test. So, could we just increase 250ms to 1s in this PR, and then could we try to restructure podsready integration test in followup PR?
sgtm, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Also I opened an issue: #2285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just nits
((!workload.HasRequeueState(&wl) && ptr.Equal(r.waitForPodsReady.requeuingBackoffLimitCount, ptr.To[int32](0))) || | ||
(workload.HasRequeueState(&wl) && wl.Status.RequeueState.RequeueAt == nil && | ||
ptr.Equal(wl.Status.RequeueState.Count, r.waitForPodsReady.requeuingBackoffLimitCount))) { | ||
message = fmt.Sprintf("%s by exceeded the maximum number of re-queuing retries", message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message = fmt.Sprintf("%s by exceeded the maximum number of re-queuing retries", message) | |
message = fmt.Sprintf("%s due to exceeding the maximum number of re-queuing retries", message) |
nit, I think it will read better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, thank you for this recommendation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -193,7 +193,7 @@ var _ = ginkgo.Describe("SchedulerWithWaitForPodsReady", func() { | |||
|
|||
var _ = ginkgo.Context("Short PodsReady timeout", func() { | |||
ginkgo.BeforeEach(func() { | |||
podsReadyTimeout = 3 * time.Second | |||
podsReadyTimeout = 250 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice win! This will speed up the tests. Can we even go below that, say 10ms?
@@ -1892,6 +1892,7 @@ var _ = ginkgo.Describe("Job controller interacting with scheduler", ginkgo.Orde | |||
var _ = ginkgo.Describe("Job controller interacting with Workload controller when waitForPodsReady with requeuing strategy is enabled", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() { | |||
var ( | |||
backoffBaseSeconds int32 | |||
backLimitCount *int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backLimitCount *int32 | |
backoffLimitCount *int32 |
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ginkgo.By("checking the workload is evicted by deactivated due to PodsReadyTimeout") | ||
gomega.Eventually(func(g gomega.Gomega) { | ||
g.Expect(k8sClient.Get(ctx, wlKey, wl)).Should(gomega.Succeed()) | ||
g.Expect(wl.Status.RequeueState).Should(gomega.BeNil()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's assert also spec.active: false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good point! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ginkgo.By("checking the 'prod' workload is admitted") | ||
util.ExpectWorkloadsToHaveQuotaReservation(ctx, k8sClient, prodClusterQ.Name, prodWl) | ||
util.ExpectQuotaReservedWorkloadsTotalMetric(prodClusterQ, 1) | ||
util.ExpectAdmittedWorkloadsTotalMetric(prodClusterQ, 1) | ||
ginkgo.By("exceed the timeout for the 'prod' workload") | ||
time.Sleep(podsReadyTimeout) | ||
util.AwaitWorkloadEvictionByPodsReadyTimeoutAndSetRequeuedCondition(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), podsReadyTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong view here, but I prefer not to mix too many responsibilities into a single function, the name looks long. Maybe we can have two functions: AwaitWorkloadEvictionByPodsReadyTimeout
and SetRequeuedCondition
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable. Let me split this helper into two ones.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
… test Signed-off-by: Yuki Iwai <[email protected]>
…to exceeding the backoffLimitCount Signed-off-by: Yuki Iwai <[email protected]>
…into AwaitWorkloadEvictionByPodsReadyTimeout and SetRequeuedConditionWithPodsReadyTimeout Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
if err := workload.ApplyAdmissionStatus(ctx, r.client, &wl, true); err != nil { | ||
return ctrl.Result{}, fmt.Errorf("setting eviction: %w", err) | ||
} else { | ||
var updated, evicted bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is better to declare evicted
as *string
as it is used to bump the metric.
I worry that if we have a bool
, then someone can forget updating line 219 when adding a new scenario for deactivation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo Uhm, do you assume that the evicted
has the reason for eviction? (here is InactiveWorkload
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, never mind, I thought for another scenario that Pods ready, but in that case it would still be WorkloadInactive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, after we introduce a dedicated reason, we may refactor here somehow, though.
ginkgo.By("the reactivated workload should not be deactivated by the scheduler unless exceeding the backoffLimitCount") | ||
gomega.Eventually(func(g gomega.Gomega) { | ||
g.Expect(k8sClient.Get(ctx, client.ObjectKeyFromObject(prodWl), prodWl)).Should(gomega.Succeed()) | ||
prodWl.Spec.Active = ptr.To(true) | ||
g.Expect(k8sClient.Update(ctx, prodWl)).Should(gomega.Succeed()) | ||
}, util.Timeout, util.Interval).Should(gomega.Succeed(), "Reactivate inactive Workload") | ||
gomega.Eventually(func(g gomega.Gomega) { | ||
g.Expect(k8sClient.Get(ctx, client.ObjectKeyFromObject(prodWl), prodWl)).Should(gomega.Succeed()) | ||
g.Expect(prodWl.Status.RequeueState).Should(gomega.BeNil()) | ||
}, util.Timeout, util.Interval).Should(gomega.Succeed()) | ||
util.FinishEvictionForWorkloads(ctx, k8sClient, prodWl) | ||
util.ExpectWorkloadsToHaveQuotaReservation(ctx, k8sClient, prodClusterQ.Name, prodWl) | ||
util.ExpectQuotaReservedWorkloadsTotalMetric(prodClusterQ, 4) | ||
util.ExpectAdmittedWorkloadsTotalMetric(prodClusterQ, 4) | ||
util.AwaitWorkloadEvictionByPodsReadyTimeout(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), podsReadyTimeout) | ||
util.SetRequeuedConditionWithPodsReadyTimeout(ctx, k8sClient, client.ObjectKeyFromObject(prodWl)) | ||
util.FinishEvictionForWorkloads(ctx, k8sClient, prodWl) | ||
util.ExpectWorkloadToHaveRequeueState(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), &kueue.RequeueState{ | ||
Count: ptr.To[int32](1), | ||
}, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong view, but it feels there is quite a potential for race conditions and flakes here.
Could we just end the test on the previous check, that the RequeueState
is nil, and the metric is incremented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong view, but it feels there is quite a potential for race conditions and flakes here.
I'm sure your concerns, but I think we need to have reactivation test to observe such #2174 bug.
So, let's replace the line 306-308 with verification of spec.active=true
.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure your concerns, but I think we need to have reactivation test to observe such #2174 bug.
That is true, but since we now check that requeueState
is cleaned, I'm not sure how a similar bug could happen, maybe if this was stored in-memory, but I suppose the risk is minimal.
So, let's replace the line 306-308 with verification of spec.active=true.
Testing for spec.active=true
would be better for sure.
But also ExpectWorkloadsToHaveQuotaReservation
is time-sensitive and has a potential for raciness, because the workload gets evicted soon later AwaitWorkloadEvictionByPodsReadyTimeout
. Maybe just check that the metrics were bumped - this approach would not be time-sensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo I think that there is no such race condition since this integration test doesn't have jobframework reconciler, which means the Job is never readmitted unless performing the FinishEvictionForWorkloads
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. So if the PodsReady timeout happens quickly then you would still pass the ExpectWorkloadsToHaveQuotaReservation
(L300) check, because it is not removed until you call FinishEvictionForWorkloads
. It convinces me there is no race condition indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. I agree that comment would be helpful.
}, util.Timeout, util.Interval).Should(gomega.Succeed()) | ||
util.FinishEvictionForWorkloads(ctx, k8sClient, prodWl) | ||
util.ExpectWorkloadsToHaveQuotaReservation(ctx, k8sClient, prodClusterQ.Name, prodWl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to clarify the flow in this test by a comment along the lines:
// We await for re-admission. Then, the workload keeps the QuotaReserved
// condition even after timeout until FinishEvictionForWorkloads is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
0fcaab1
to
f879c5c
Compare
}, util.Timeout, util.Interval).Should(gomega.Succeed()) | ||
// We await for re-admission. Then, the workload keeps the QuotaReserved condition | ||
// even after timeout until FinishEvictionForWorkloads is called in line 307. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// even after timeout until FinishEvictionForWorkloads is called in line 307. | |
// even after timeout until FinishEvictionForWorkloads is called below. |
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Yuki Iwai <[email protected]>
f879c5c
to
f49c1a2
Compare
/lgtm |
LGTM label has been added. Git tree hash: 2e93d6ec4b714b30d4180273d66424659345fd0b
|
/release-note-edit
|
…is deactivated (kubernetes-sigs#2219) * WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated Signed-off-by: Yuki Iwai <[email protected]> * Add a test case if the Workload is deactivated in the Job integration test Signed-off-by: Yuki Iwai <[email protected]> * Improve the message readability when the workload is deactivated due to exceeding the backoffLimitCount Signed-off-by: Yuki Iwai <[email protected]> * Split AwaitWorkloadEvictionByPodsReadyTimeoutAndSetRequeuedCondition into AwaitWorkloadEvictionByPodsReadyTimeout and SetRequeuedConditionWithPodsReadyTimeout Signed-off-by: Yuki Iwai <[email protected]> * Increase the PodsReadyTimeout from 250ms to 1s Signed-off-by: Yuki Iwai <[email protected]> * Add a comment to clarify the test flow Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]>
What type of PR is this?
/kind bug
What this PR does / why we need it:
If the Workload is deactivated, the workload resets the requeueState to avoid a race condition between the kueue-scheduler and the workload controller in the case of the deactivated workload.
When the workload is deactivated by exceeding the backoffLimitCount, the workload is added the
The workload is deactivated due to exceeding the maximum number of re-queuing retries
. I will introduce a dedicated reason in another PR to focus on fixing the bug.Which issue(s) this PR fixes:
Fixes #2174
Special notes for your reviewer:
The cherry-pick was manually created here #2220.
Because we have significant changes in the mechanism of re-queuing between v0.6 and the main branch.
Does this PR introduce a user-facing change?