WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated #2219

tenzen-y · 2024-05-17T15:52:32Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

If the Workload is deactivated, the workload resets the requeueState to avoid a race condition between the kueue-scheduler and the workload controller in the case of the deactivated workload.

When the workload is deactivated by exceeding the backoffLimitCount, the workload is added the The workload is deactivated due to exceeding the maximum number of re-queuing retries. I will introduce a dedicated reason in another PR to focus on fixing the bug.

Which issue(s) this PR fixes:

Fixes #2174

Special notes for your reviewer:

The cherry-pick was manually created here #2220.
Because we have significant changes in the mechanism of re-queuing between v0.6 and the main branch.

Does this PR introduce a user-facing change?

WaitForPodsReady: Fix a bug that causes the reactivated Workload to be immediately deactivated even though it doesn't exceed the backoffLimit.

k8s-ci-robot · 2024-05-17T15:52:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/OWNERS~~ [tenzen-y]
~~test/OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-05-17T15:52:49Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`f49c1a2`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6654a226c077d50008907b39

tenzen-y · 2024-05-17T15:53:49Z

pkg/controller/core/workload_controller_test.go

-
-			if tc.wantWorkload != nil {
-				if requeueState := tc.wantWorkload.Status.RequeueState; requeueState != nil && requeueState.RequeueAt != nil {
-					gotRequeueState := gotWorkload.Status.RequeueState
-					if gotRequeueState != nil && gotRequeueState.RequeueAt != nil {
-						if !gotRequeueState.RequeueAt.Equal(requeueState.RequeueAt) {
-							t.Errorf("Unexpected requeueState.requeueAt; gotRequeueAt %v needs to be after requeueAt %v", requeueState.RequeueAt, gotRequeueState.RequeueAt)
-						}
-					} else {
-						t.Errorf("Unexpected nil requeueState.requeuAt; requeueState.requeueAt shouldn't be nil")
-					}
-				}
-			}


This verification has been invalid since we introduced the fake clock.

alculquicondor · 2024-05-17T16:50:56Z

cc @mimowo

alculquicondor · 2024-05-17T19:13:13Z

pkg/controller/core/workload_controller.go

-		if wl.Status.Admission != nil {
-			metrics.ReportEvictedWorkloads(string(wl.Status.Admission.ClusterQueue), kueue.WorkloadEvictedByDeactivation)
+		if !apimeta.IsStatusConditionTrue(wl.Status.Conditions, kueue.WorkloadEvicted) {
+			workload.SetEvictedCondition(&wl, kueue.WorkloadEvictedByDeactivation, "The workload is deactivated")


What I don't like about this is that we are losing context as to why the workload was deactivated.

In addition to adding the evicted condition here, could we add it inside reconcileNotReadyTimeout so that, if the Kueue controller doesn't restart of fail to send the status update, we keep that information in the Evicted reason and message?

Or maybe we can just deduce that the deactivation is because of the backoffLimit by looking at the requeueState.

IIUC, we have the status.requeueState in this function, just before we clean it just above, can we use it in the message, for example to say "The workload is deactivated due to 5 failed attempts to run all pods within the PodsReady timeout"? We probably don't want to set "WorkloadInactive" reason prior to setting active=false.

I think long-term the cleanest would be if active was in status, then we could do the following in one request:

active: false conditions: Evicted: true Reason: PodsReadyTimeoutExceeded Message: The PodsReady timeout of 5min is exceeded Deactivated: true Reason: PodsReadyBackoffExceeded Message: Deactivated the workload after 5 attempts due to exceeding the PodsReady timeout.

What I don't like about this is that we are losing context as to why the workload was deactivated.

In addition to adding the evicted condition here, could we add it inside reconcileNotReadyTimeout so that, if the Kueue controller doesn't restart of fail to send the status update, we keep that information in the Evicted reason and message?

Or maybe we can just deduce that the deactivation is because of the backoffLimit by looking at the requeueState.

I think it would be better to deduce the reason here to avoid 2 API calls (spec and status) in the reconcileNotReadyTimeout . I think that we can say the workload is deactivated by exceeding the limit in the case of .spec.active=false AND .status.requeueState.requeueAt == nil AND .status.requeueState.count != nil.
@alculquicondor WDYT?

IIUC, we have the status.requeueState in this function, just before we clean it just above, can we use it in the message, for example to say "The workload is deactivated due to 5 failed attempts to run all pods within the PodsReady timeout"? We probably don't want to set "WorkloadInactive" reason prior to setting active=false.

@mimowo Actually, we have the similar discussion here: #2175
So, could we work on enhancing the reason as another feature to focus on fixing the bug?

SGTM. I suspect you want to have #2175 in 0.7?

I suspect you want to have #2175 in 0.7?

Yes, I would like to include #2175 in v0.7.

So, could we work on enhancing the reason as another feature to focus on fixing the bug?

sgtm, I was actually suggesting that the current place is ok, since we have the data at hand if we want to build the message. I would be fine to build the message like "The workload is deactivated due to exceeding the PodsReady timeout 5 times in a row.". I'm also ok leave it for #2175.

alculquicondor · 2024-05-17T19:18:37Z

test/integration/scheduler/podsready/scheduler_test.go

 			ginkgo.By("evicted re-admitted workload should have 2 in the re-queue count")
 			util.ExpectWorkloadToHaveRequeueCount(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), ptr.To[int32](2))
+			time.Sleep(podsReadyTimeout)


is this a real sleep?
can it be made a few miliseconds just for this test?

ping @tenzen-y

@alculquicondor Thank you for reminding this. Does it mean that decreasing the timeout would be better?

kueue/test/integration/scheduler/podsready/scheduler_test.go

Line 196 in bd83e79

podsReadyTimeout = 3 * time.Second

yes. 3s is too long

Let me try it.

yeah, I'm even wondering if we could reduce 3s for other places, maybe we should review that

mimowo

LGTM overall

tenzen-y · 2024-05-22T19:13:44Z

I found that the current podsready integration test is not correct. So, I will fix the test in another PR, first.

…is deactivated Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y · 2024-05-26T14:31:45Z

pkg/controller/core/workload_controller.go

+				((!workload.HasRequeueState(&wl) && ptr.Equal(r.waitForPodsReady.requeuingBackoffLimitCount, ptr.To[int32](0))) ||
+					(workload.HasRequeueState(&wl) && wl.Status.RequeueState.RequeueAt == nil &&
+						ptr.Equal(wl.Status.RequeueState.Count, r.waitForPodsReady.requeuingBackoffLimitCount))) {
+				message = fmt.Sprintf("%s by exceeded the maximum number of re-queuing retries", message)


@mimowo @alculquicondor I added the dedicated message like this based on your recommendation. HDYT?
Also, let me try to add a dedicated reason in another enhancement PR.

I think the dedicated message is useful

tenzen-y · 2024-05-26T14:33:35Z

I found that the current podsready integration test is not correct. So, I will fix the test in another PR, first.

I found that the test issues depend on resetting the whole of the .status.requeueState field. So, I fixed the issues in this PR.

tenzen-y · 2024-05-26T14:34:56Z

test/integration/scheduler/podsready/scheduler_test.go

@@ -193,7 +193,7 @@ var _ = ginkgo.Describe("SchedulerWithWaitForPodsReady", func() {

 	var _ = ginkgo.Context("Short PodsReady timeout", func() {
 		ginkgo.BeforeEach(func() {
-			podsReadyTimeout = 3 * time.Second
+			podsReadyTimeout = 250 * time.Millisecond


I decreased the timeout to 250 ms in all short podsready timeout cases.
HDYT? @mimowo @alculquicondor

nice win! This will speed up the tests. Can we even go below that, say 10ms?

I tried to use the 10ms, but test case the below failed:

kueue/test/integration/scheduler/podsready/scheduler_test.go

Lines 261 to 263 in ff4ca7f

util.ExpectWorkloadToHaveRequeueState(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), &kueue.RequeueState{

Count: ptr.To[int32](1),

}, false)

[FAILED] Timed out after 25.000s. The function passed to Eventually failed at /Users/tenzen/go/src/sigs.k8s.io/kueue/test/util/util.go:354 with: Expected object to be comparable, diff: &v1beta1.RequeueState{ - Count: &2, + Count: &1, ... // 1 ignored field } In [It] at: /Users/tenzen/go/src/sigs.k8s.io/kueue/test/integration/scheduler/podsready/scheduler_test.go:261 @ 05/27/24 16:59:35.799

I guess that the workload controller could not observe the transition of the state.

Hm, maybe this is because the scheduler admits the workload before we can set active: false. This might be the reason the time was set to 3s to make sure such race conditions don't happen. Probably reducing to 1s might be safer.

Also, for the new tests, maybe we could consider moving them here: https://github.com/kubernetes-sigs/kueue/blob/ff4ca7f566e4a12e13ffc3db62cbe896e7993685/test/integration/controller/jobs/job/job_controller_test.go#L1743C39-L1743C92. WDYT?

Hm, maybe this is because the scheduler admits the workload before we can set active: false.

Yeah, I agree with you.

This might be the reason the time was set to 3s to make sure such race conditions don't happen.

This guess doesn't seem to be correct since you added this case when you implemented the podsready feature here: 9d9336b
When we added the podsready feature, we didn't have the deactivation feature.

Probably reducing to 1s might be safer.

SGTM

Also, for the new tests, maybe we could consider moving them here: https://github.com/kubernetes-sigs/kueue/blob/ff4ca7f566e4a12e13ffc3db62cbe896e7993685/test/integration/controller/jobs/job/job_controller_test.go#L1743C39-L1743C92.

I think that this case still valid to put here (scheduler+controller) since it would be better to keep observe not to happen the unexpected race conditions bwteen the scheduler and controllers.

This guess doesn't seem to be correct since you added this case when you implemented the podsready feature here: 9d9336b

makes sense, I don't remember why it was 3s, I'm ok to reduce it to as low value as possible, provided the tests continue to pass.

I think that this case still valid to put here (scheduler+controller) since it would be better to keep observe not to happen the unexpected race conditions bwteen the scheduler and controllers.

Makes sense, on the other hand it exposes us to the race conditions. If you still think the tests are valuable with scheduler I'm ok to keep them here. Maybe we can consider two types of short timeouts: 10ms for tests which are not vulnerable to the race conditions, and longer 250ms or 1s for tests which are vulnerable?

Makes sense, on the other hand it exposes us to the race conditions. If you still think the tests are valuable with scheduler I'm ok to keep them here. Maybe we can consider two types of short timeouts: 10ms for tests which are not vulnerable to the race conditions, and longer 250ms or 1s for tests which are vulnerable?

That sounds reasonable, but we need to restructure podsready test. So, could we just increase 250ms to 1s in this PR, and then could we try to restructure podsready integration test in followup PR?

@mimowo WDYT?

That sounds reasonable, but we need to restructure podsready test. So, could we just increase 250ms to 1s in this PR, and then could we try to restructure podsready integration test in followup PR?

sgtm, thanks

Done.
Also I opened an issue: #2285

mimowo

LGTM overall, just nits

mimowo · 2024-05-27T07:20:26Z

pkg/controller/core/workload_controller.go

+				((!workload.HasRequeueState(&wl) && ptr.Equal(r.waitForPodsReady.requeuingBackoffLimitCount, ptr.To[int32](0))) ||
+					(workload.HasRequeueState(&wl) && wl.Status.RequeueState.RequeueAt == nil &&
+						ptr.Equal(wl.Status.RequeueState.Count, r.waitForPodsReady.requeuingBackoffLimitCount))) {
+				message = fmt.Sprintf("%s by exceeded the maximum number of re-queuing retries", message)


Suggested change

message = fmt.Sprintf("%s by exceeded the maximum number of re-queuing retries", message)

message = fmt.Sprintf("%s due to exceeding the maximum number of re-queuing retries", message)

nit, I think it will read better

SGTM, thank you for this recommendation!

mimowo · 2024-05-27T07:21:15Z

test/integration/scheduler/podsready/scheduler_test.go

@@ -193,7 +193,7 @@ var _ = ginkgo.Describe("SchedulerWithWaitForPodsReady", func() {

 	var _ = ginkgo.Context("Short PodsReady timeout", func() {
 		ginkgo.BeforeEach(func() {
-			podsReadyTimeout = 3 * time.Second
+			podsReadyTimeout = 250 * time.Millisecond


nice win! This will speed up the tests. Can we even go below that, say 10ms?

mimowo · 2024-05-27T07:28:24Z

test/integration/controller/jobs/job/job_controller_test.go

@@ -1892,6 +1892,7 @@ var _ = ginkgo.Describe("Job controller interacting with scheduler", ginkgo.Orde
 var _ = ginkgo.Describe("Job controller interacting with Workload controller when waitForPodsReady with requeuing strategy is enabled", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() {
 	var (
 		backoffBaseSeconds int32
+		backLimitCount     *int32


Suggested change

backLimitCount *int32

backoffLimitCount *int32

nit

mimowo · 2024-05-27T07:29:58Z

test/integration/controller/jobs/job/job_controller_test.go

+			ginkgo.By("checking the workload is evicted by deactivated due to PodsReadyTimeout")
+			gomega.Eventually(func(g gomega.Gomega) {
+				g.Expect(k8sClient.Get(ctx, wlKey, wl)).Should(gomega.Succeed())
+				g.Expect(wl.Status.RequeueState).Should(gomega.BeNil())


nit: let's assert also spec.active: false.

Oh, good point! Thanks!

mimowo · 2024-05-27T07:33:52Z

test/integration/scheduler/podsready/scheduler_test.go

 			ginkgo.By("checking the 'prod' workload is admitted")
 			util.ExpectWorkloadsToHaveQuotaReservation(ctx, k8sClient, prodClusterQ.Name, prodWl)
 			util.ExpectQuotaReservedWorkloadsTotalMetric(prodClusterQ, 1)
 			util.ExpectAdmittedWorkloadsTotalMetric(prodClusterQ, 1)
-			ginkgo.By("exceed the timeout for the 'prod' workload")
-			time.Sleep(podsReadyTimeout)
+			util.AwaitWorkloadEvictionByPodsReadyTimeoutAndSetRequeuedCondition(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), podsReadyTimeout)


No strong view here, but I prefer not to mix too many responsibilities into a single function, the name looks long. Maybe we can have two functions: AwaitWorkloadEvictionByPodsReadyTimeout and SetRequeuedCondition?

That sounds reasonable. Let me split this helper into two ones.
Thanks!

… test Signed-off-by: Yuki Iwai <[email protected]>

…to exceeding the backoffLimitCount Signed-off-by: Yuki Iwai <[email protected]>

…into AwaitWorkloadEvictionByPodsReadyTimeout and SetRequeuedConditionWithPodsReadyTimeout Signed-off-by: Yuki Iwai <[email protected]>

Signed-off-by: Yuki Iwai <[email protected]>

mimowo · 2024-05-27T10:22:49Z

pkg/controller/core/workload_controller.go

-		if err := workload.ApplyAdmissionStatus(ctx, r.client, &wl, true); err != nil {
-			return ctrl.Result{}, fmt.Errorf("setting eviction: %w", err)
+	} else {
+		var updated, evicted bool


Maybe it is better to declare evicted as *string as it is used to bump the metric.

I worry that if we have a bool, then someone can forget updating line 219 when adding a new scenario for deactivation. WDYT?

@mimowo Uhm, do you assume that the evicted has the reason for eviction? (here is InactiveWorkload)?

Oh, never mind, I thought for another scenario that Pods ready, but in that case it would still be WorkloadInactive

Yeah, after we introduce a dedicated reason, we may refactor here somehow, though.

mimowo · 2024-05-27T10:54:00Z

test/integration/scheduler/podsready/scheduler_test.go

+			ginkgo.By("the reactivated workload should not be deactivated by the scheduler unless exceeding the backoffLimitCount")
 			gomega.Eventually(func(g gomega.Gomega) {
 				g.Expect(k8sClient.Get(ctx, client.ObjectKeyFromObject(prodWl), prodWl)).Should(gomega.Succeed())
 				prodWl.Spec.Active = ptr.To(true)
 				g.Expect(k8sClient.Update(ctx, prodWl)).Should(gomega.Succeed())
-			}, util.Timeout, util.Interval).Should(gomega.Succeed(), "Reactivate inactive Workload")
-			gomega.Eventually(func(g gomega.Gomega) {
-				g.Expect(k8sClient.Get(ctx, client.ObjectKeyFromObject(prodWl), prodWl)).Should(gomega.Succeed())
-				g.Expect(prodWl.Status.RequeueState).Should(gomega.BeNil())
 			}, util.Timeout, util.Interval).Should(gomega.Succeed())
+			util.FinishEvictionForWorkloads(ctx, k8sClient, prodWl)
+			util.ExpectWorkloadsToHaveQuotaReservation(ctx, k8sClient, prodClusterQ.Name, prodWl)
+			util.ExpectQuotaReservedWorkloadsTotalMetric(prodClusterQ, 4)
+			util.ExpectAdmittedWorkloadsTotalMetric(prodClusterQ, 4)
+			util.AwaitWorkloadEvictionByPodsReadyTimeout(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), podsReadyTimeout)
+			util.SetRequeuedConditionWithPodsReadyTimeout(ctx, k8sClient, client.ObjectKeyFromObject(prodWl))
+			util.FinishEvictionForWorkloads(ctx, k8sClient, prodWl)
+			util.ExpectWorkloadToHaveRequeueState(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), &kueue.RequeueState{
+				Count: ptr.To[int32](1),
+			}, false)


No strong view, but it feels there is quite a potential for race conditions and flakes here.

Could we just end the test on the previous check, that the RequeueState is nil, and the metric is incremented?

No strong view, but it feels there is quite a potential for race conditions and flakes here.

I'm sure your concerns, but I think we need to have reactivation test to observe such #2174 bug.

So, let's replace the line 306-308 with verification of spec.active=true.
WDYT?

I'm sure your concerns, but I think we need to have reactivation test to observe such #2174 bug.

That is true, but since we now check that requeueState is cleaned, I'm not sure how a similar bug could happen, maybe if this was stored in-memory, but I suppose the risk is minimal.

So, let's replace the line 306-308 with verification of spec.active=true.

Testing for spec.active=true would be better for sure.

But also ExpectWorkloadsToHaveQuotaReservation is time-sensitive and has a potential for raciness, because the workload gets evicted soon later AwaitWorkloadEvictionByPodsReadyTimeout. Maybe just check that the metrics were bumped - this approach would not be time-sensitive.

@mimowo I think that there is no such race condition since this integration test doesn't have jobframework reconciler, which means the Job is never readmitted unless performing the FinishEvictionForWorkloads, right?

Ah, got it. So if the PodsReady timeout happens quickly then you would still pass the ExpectWorkloadsToHaveQuotaReservation (L300) check, because it is not removed until you call FinishEvictionForWorkloads. It convinces me there is no race condition indeed.

Yes, exactly. I agree that comment would be helpful.

mimowo · 2024-05-27T14:19:19Z

test/integration/scheduler/podsready/scheduler_test.go

 			}, util.Timeout, util.Interval).Should(gomega.Succeed())
+			util.FinishEvictionForWorkloads(ctx, k8sClient, prodWl)
+			util.ExpectWorkloadsToHaveQuotaReservation(ctx, k8sClient, prodClusterQ.Name, prodWl)


I would suggest to clarify the flow in this test by a comment along the lines:

// We await for re-admission. Then, the workload keeps the QuotaReserved // condition even after timeout until FinishEvictionForWorkloads is called.

Thank you for the suggestion!

mimowo · 2024-05-27T15:07:32Z

test/integration/scheduler/podsready/scheduler_test.go

 			}, util.Timeout, util.Interval).Should(gomega.Succeed())
+			// We await for re-admission. Then, the workload keeps the QuotaReserved condition
+			// even after timeout until FinishEvictionForWorkloads is called in line 307.


Suggested change

// even after timeout until FinishEvictionForWorkloads is called in line 307.

// even after timeout until FinishEvictionForWorkloads is called below.

nit

Signed-off-by: Yuki Iwai <[email protected]>

mimowo · 2024-05-27T15:10:14Z

/lgtm
Thanks!

k8s-ci-robot · 2024-05-27T15:10:19Z

LGTM label has been added.

Git tree hash: 2e93d6ec4b714b30d4180273d66424659345fd0b

alculquicondor · 2024-05-30T14:20:20Z

/release-note-edit

WaitForPodsReady: Fix a bug that causes the reactivated Workload to be immediately deactivated even though it doesn't exceed the backoffLimit.

…is deactivated (kubernetes-sigs#2219) * WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated Signed-off-by: Yuki Iwai <[email protected]> * Add a test case if the Workload is deactivated in the Job integration test Signed-off-by: Yuki Iwai <[email protected]> * Improve the message readability when the workload is deactivated due to exceeding the backoffLimitCount Signed-off-by: Yuki Iwai <[email protected]> * Split AwaitWorkloadEvictionByPodsReadyTimeoutAndSetRequeuedCondition into AwaitWorkloadEvictionByPodsReadyTimeout and SetRequeuedConditionWithPodsReadyTimeout Signed-off-by: Yuki Iwai <[email protected]> * Increase the PodsReadyTimeout from 250ms to 1s Signed-off-by: Yuki Iwai <[email protected]> * Add a comment to clarify the test flow Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]>

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 17, 2024

k8s-ci-robot requested review from alculquicondor and mimowo May 17, 2024 15:52

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 17, 2024

tenzen-y commented May 17, 2024

View reviewed changes

tenzen-y mentioned this pull request May 17, 2024

[release-0.6] WaitForPodsReady: Reset .status.requeueState once the workload is deactivated #2220

Merged

alculquicondor reviewed May 17, 2024

View reviewed changes

mimowo reviewed May 22, 2024

View reviewed changes

mimowo mentioned this pull request May 22, 2024

v1beta2 wishlist #768

Open

tenzen-y changed the title ~~WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated~~ WIP: WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated May 22, 2024

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 22, 2024

dgrove-oss mentioned this pull request May 24, 2024

enabling manageJobsWithoutQueueName breaks OpenShift clusters #2277

Closed

tenzen-y force-pushed the reset-backoff-count-when-reached-requeuing-limit-count branch 8 times, most recently from 2cbde11 to d49bab4 Compare May 25, 2024 07:48

WaitForPodsReady: Reset .status.requeueState.count once the workload …

ff4ca7f

…is deactivated Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y force-pushed the reset-backoff-count-when-reached-requeuing-limit-count branch from d49bab4 to ff4ca7f Compare May 25, 2024 07:51

tenzen-y changed the title ~~WIP: WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated~~ WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated May 26, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2024

tenzen-y commented May 26, 2024

View reviewed changes

mimowo reviewed May 27, 2024

View reviewed changes

tenzen-y added 4 commits May 27, 2024 17:16

Add a test case if the Workload is deactivated in the Job integration…

1061fd4

… test Signed-off-by: Yuki Iwai <[email protected]>

Improve the message readability when the workload is deactivated due …

607a5df

…to exceeding the backoffLimitCount Signed-off-by: Yuki Iwai <[email protected]>

Split AwaitWorkloadEvictionByPodsReadyTimeoutAndSetRequeuedCondition …

757d186

…into AwaitWorkloadEvictionByPodsReadyTimeout and SetRequeuedConditionWithPodsReadyTimeout Signed-off-by: Yuki Iwai <[email protected]>

Increase the PodsReadyTimeout from 250ms to 1s

e235856

Signed-off-by: Yuki Iwai <[email protected]>

mimowo reviewed May 27, 2024

View reviewed changes

tenzen-y mentioned this pull request May 27, 2024

Restructure PodsReadyTimeout integration tests with short timeout #2285

Closed

mimowo reviewed May 27, 2024

View reviewed changes

tenzen-y force-pushed the reset-backoff-count-when-reached-requeuing-limit-count branch from 0fcaab1 to f879c5c Compare May 27, 2024 14:39

mimowo reviewed May 27, 2024

View reviewed changes

Add a comment to clarify the test flow

f49c1a2

Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y force-pushed the reset-backoff-count-when-reached-requeuing-limit-count branch from f879c5c to f49c1a2 Compare May 27, 2024 15:09

k8s-ci-robot assigned mimowo May 27, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2024

k8s-ci-robot merged commit 464aca4 into kubernetes-sigs:main May 27, 2024
15 checks passed

k8s-ci-robot added this to the v0.7 milestone May 27, 2024

	util.ExpectWorkloadToHaveRequeueState(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), &kueue.RequeueState{
	Count: ptr.To[int32](1),
	}, false)

	message = fmt.Sprintf("%s by exceeded the maximum number of re-queuing retries", message)
	message = fmt.Sprintf("%s due to exceeding the maximum number of re-queuing retries", message)

	// even after timeout until FinishEvictionForWorkloads is called in line 307.
	// even after timeout until FinishEvictionForWorkloads is called below.

WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated #2219

WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated #2219

Conversation

tenzen-y commented May 17, 2024 • edited by k8s-ci-robot Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented May 17, 2024

netlify bot commented May 17, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Choose a reason for hiding this comment

alculquicondor commented May 17, 2024

alculquicondor May 17, 2024 • edited Loading

Choose a reason for hiding this comment

mimowo May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y May 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo left a comment

Choose a reason for hiding this comment

tenzen-y commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented May 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y May 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo May 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented May 27, 2024

k8s-ci-robot commented May 27, 2024

alculquicondor commented May 30, 2024

tenzen-y commented May 17, 2024 •

edited by k8s-ci-robot

Loading

netlify bot commented May 17, 2024 •

edited

Loading

alculquicondor May 17, 2024 •

edited

Loading

mimowo May 20, 2024 •

edited

Loading

tenzen-y May 21, 2024 •

edited

Loading

tenzen-y May 27, 2024 •

edited

Loading

mimowo May 27, 2024 •

edited

Loading