-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use exponential backoff for failing migrations #8530
Conversation
/uncc @maiqueb |
/cc @vladikr |
Can you guys give a look and share thoughts about the approach/default values etc? I still have to implement unit and e2e tests. |
b8605a4
to
efad57d
Compare
efad57d
to
3cce3f1
Compare
I will take a look at this today. |
/retest |
3cce3f1
to
544e68e
Compare
544e68e
to
4ac36d0
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jean-edouard The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4ac36d0
to
05e8997
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@acardace Hi, PR looks great! I have a few questions, please see below
@@ -1113,6 +1170,12 @@ func (c *MigrationController) sync(key string, migration *virtv1.VirtualMachineI | |||
return c.handlePreHandoffMigrationCancel(migration, vmi, pod) | |||
} | |||
|
|||
if err = c.handleMigrationBackoff(key, vmi, migration); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can have an err
return if the c.listMigrationsMatchingVMI(vmi.Namespace, vmi.Name)
fails, in that case the error type is not MigrationBackoffReason
, we will need to check the returned error type before we generate an event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thank you for spotting that!
return metav1.Time{} | ||
} | ||
|
||
outOffBackoffTS := getFailedTS(migrations[1]).Add(backoff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xpivarc Hi, you mean to use status.migrtaionState.endTimestamp
instead of PhaseTransitionTimestamp
?
deleteEvents(eventListOpts) | ||
}) | ||
|
||
It("[Serial] after a successful migration backoff should be cleared", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't run it in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are deleting events. While with some effort we can make this parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just sticking to the existing functions.
50a96a0
to
c0e4a6c
Compare
return metav1.Time{} | ||
} | ||
|
||
outOffBackoffTS := getFailedTS(migrations[1]).Add(backoff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@acardace Can we address the case of a cancelled migration?
@@ -1112,6 +1172,11 @@ func (c *MigrationController) sync(key string, migration *virtv1.VirtualMachineI | |||
if migration.DeletionTimestamp != nil { | |||
return c.handlePreHandoffMigrationCancel(migration, vmi, pod) | |||
} | |||
if err = c.handleMigrationBackoff(key, vmi, migration); errors.Is(err, migrationBackoffError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@enp0s3 isn't the canceled migration case handled at kubevirt/pkg/virt-controller/watch/migration.go Line 1172 in c0e4a6c
|
c0e4a6c
to
65eece2
Compare
/retest-required |
65eece2
to
12c8c50
Compare
With this patch when migrating a VM fails an increasingly exponential backoff will be applied before retrying. Signed-off-by: Antonio Cardace <[email protected]>
12c8c50
to
58ad400
Compare
/lgtm @acardace Awesome, thanks for addressing the comments! |
@acardace: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cherrypick release-0.58 |
@acardace: new pull request created: #8784 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Following the bugzilla discussion, I am curious on why we did not solve the pod accumulation and instead went for a back-off. Here some thoughts where I would be happy to hear your thought-line:
|
Stating for others looking into this:
Maximum backoff is The second part. Right now the backoff would be applied even if you would fix the problem. Here we could only recommend a workaround - remove the last VMIM object.
No. As of now we can't even know if the migration is targeted to another node.
Eviction is often retried continuously. We can try to look into suggesting backoff to clients but they could still ignore the suggestion. Even with the pod GC the problem would remain. It seems reasonable to constrain ourselves only to VMIM created by eviction for now and revisit the overall backoff later. |
Yes it probably makes sense to make it explicit in code, the limit as Lubo said is implicit and governed by the migration object GC, it anyway is 320 seconds.
No, but what if it's the source node that has issues? not sure if it makes sense to tweak it like this.
|
/cherry-pick release-0.53 |
@stu-gott: #8530 failed to apply on top of branch "release-0.53":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
When migrating a VM fails an increasingly exponential backoff will be applied before retrying.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
https://bugzilla.redhat.com/show_bug.cgi?id=2124528
Special notes for your reviewer:
Release note: