Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make in-progress backup/restore as failed when doing the reconcile #4833

Merged
merged 1 commit into from
Apr 27, 2022

Conversation

ywk253100
Copy link
Contributor

Make in-progress backup/restore as failed when doing the reconcile to avoid hanging in in-progress status

Signed-off-by: Wenkai Yin(尹文开) [email protected]

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

sseago
sseago previously approved these changes Apr 14, 2022
@shawn-hurley
Copy link
Contributor

Can someone explain to me why we can't recover here?

@kaovilai
Copy link
Member

If I have to guess (without looking at the code), the plugin system only report success/failures of backupitemaction and velero does not have an in-progress queue saved, only a counter. A pod restart results in lost of in-progress queue and would requires a start from scratch which is equivalent to recreating another backup.

@sseago
Copy link
Collaborator

sseago commented Apr 14, 2022

@shawn-hurley I believe this came up on a prior discussion and as I understood it, there's a long-term fix needed -- Velero needs to be able to continue a backup or restore in-progress. That will be a significant refactoring. In the meantime, this PR provides a short-term fix/workaround -- this status update provides a more accurate status reporting on the status quo -- when the pod is bounced while a backup or restore is in progress, Velero will never finish it. It is, effectively, failed -- this PR makes that explicit, so that when a user looks at the current set of backups or restores, they won't see more than one in-progress item when one of them is not in progress (and never will be).

@kaovilai
Copy link
Member

kaovilai commented Apr 14, 2022

Adding a note that this is related to #4757 and #4728 and that further discussions on this topic may also consider adding garbage collection failure to a similar status field. Syncing failures such as #4483 may be another status candidate.

@sseago
Copy link
Collaborator

sseago commented Apr 14, 2022

@kaovilai It's not so much the plugin system as the controller itself hasn't been designed to resume a previously-started backup or restore. The whole backup/restore processing will need to be refactored to support this. Plugin actions only occur at a specific point prior to backing up or restoring a single item, so I don't think that's nearly as relevant as being able to pick up a backup/restore where we were -- a way of identifying backed-up or restored items on restart would be part of this. Actually, since we don't persist the backup to object storage until the end, for backups, adding resuming behavior may well be a matter of having to just restart the backup. For restores, we could probably check the backup/restore annotations on any existing resources prior to attempting to restore -- but since we already have logic around that (i.e. not restoring what's there), even for restores, "resuming" may just mean "starting over". That's probably something that needs its own design document to be drafted and reviewed apart from (and after) this PR is merged.

@ywk253100
Copy link
Contributor Author

@shawn-hurley @kaovilai Just like @sseago explained, this is a short-term fix.
A long-term fix should be well-considered and designed. We'll keep the issue open until the long-term fix introduced

@ywk253100 ywk253100 added this to the 1.9.0 milestone Apr 15, 2022
@ywk253100
Copy link
Contributor Author

@reasonerjt The upload progress PR introduces a new phase Uploading, so it doesn't impact the handling logic for InProgress phase. And the uploading progress PR also handles the restarting for Uploading CRs.

@reasonerjt
Copy link
Contributor

This is better than keeping the backup/restore in "inprogress" state after an accidental restart of the velero process.

I'll approve.

@ywk253100
Copy link
Contributor Author

@sseago @reasonerjt I have rebased the PR to fix the conflict, please take a look again

sseago
sseago previously approved these changes Apr 25, 2022
Make in-progress backup/restore as failed when doing the reconcile to avoid hanging in in-progress status

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
@ywk253100
Copy link
Contributor Author

@sseago I made a tiny change from your last review: set the status.failureReason as well when the backup failed
Please take a look at this PR again if you have a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants