-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress" #7207
Comments
Once of the solutions that we can try here is if we replace the restore update call with the retry-with-backoff logic, then the reconcile won't exit until max retries and/or success. We already did similar changes for create calls in #6830 |
@shubham-pampattiwar I don't think this expectation is realistic with the current velero controller design. "Velero controller to reconcile on the InProgress CR and and then appropriately update its status to Complete." I think your retry-with-backoff for the post-reconcile status update is the right approach. |
The backup controller also has similar logic around the log |
This may happen to any CR when it goes to the final state, we need to apply the retry with backoff to all CRs. |
At this moment, we are not very convinced that "In Progress" -> "Complete" is more error-prone than other API calls, and we probably don't want to retry on every API call. Therefore, let's keep this open. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands. |
unstale |
I think a retry should be attempted at least for the CR's final status update, so that other components relying on the backup/restore CR status could react appropriately |
Fix posted here Can you confirm how long it took for connection to come back? Wanna double check if the retries use would allow for enough time for this case. |
Is there a limits to where we want to retry? Do we want retry in every status phase changes across backup/restore? |
The default retry backoff is too short(it seems like only a few second based on testing). A two-minute retry is reasonable when there is API outage due to cert rotation. |
@Missxiaoguo please check again. Thanks! |
After today's meeting we have made some agreements towards retry approach and I will be making a design to cover UX changes as well as implementation notes. Targeting v1.15 |
Design PR opened #8063 |
another user with transient api server isssue: #8116 (comment) |
What steps did you take and what happened:
Velero was performing a restore when the API server was rolling out to a new version. It had trouble connecting to the API server, but eventually, the restore was successful. However, since the API server was still in the middle of rolling out, Velero failed to update the restore CR status and gave up. After the connection was restored, it didn't attempt to update, causing the restore CR to be stuck at "In progress" indefinitely. This can lead to incorrect decisions for other components that rely on the backup/restore CR status to determine completion.
What did you expect to happen:
Velero controller to reconcile on the
InProgress
CR and and then appropriately update its status toComplete
.The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Velero logs:
Restore CR stuck at
InProgress
status:Environment:
velero version
):velero client config get features
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: