Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress" #7207

shubham-pampattiwar · 2023-12-13T16:38:14Z

What steps did you take and what happened:

Velero was performing a restore when the API server was rolling out to a new version. It had trouble connecting to the API server, but eventually, the restore was successful. However, since the API server was still in the middle of rolling out, Velero failed to update the restore CR status and gave up. After the connection was restored, it didn't attempt to update, causing the restore CR to be stuck at "In progress" indefinitely. This can lead to incorrect decisions for other components that rely on the backup/restore CR status to determine completion.

What did you expect to happen:

Velero controller to reconcile on the InProgress CR and and then appropriately update its status to Complete.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Velero logs:

time="2023-12-08T04:02:21Z" level=warning msg="Cluster resource restore warning: could not restore, CustomResourceDefinition \"klusterlets.operator.open-cluster-management.io\" already exists. Warning: the in-cluster version is different than the backed-up version."  logSource="/remotesource/velero/app/pkg/controller/restore_controller.go:506" restore=openshift-adp/acm-klusterlet
time="2023-12-08T04:02:21Z" level=warning msg="Cluster resource restore warning: refresh discovery after restoring CRDs: Get \"
[https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
\": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:506" restore=openshift-adp/acm-klusterlet

################Restore completed
time="2023-12-08T04:02:21Z" level=info msg="restore completed" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:513" restore=openshift-adp/acm-klusterlet

time="2023-12-08T04:02:21Z" level=error msg="Get \"
[https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
\": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/datamover/datamover.go:143"
time="2023-12-08T04:02:21Z" level=error msg="Error removing VSRs after partially failed restore" error="Get \"
[https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
\": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:571"

################FAIL to update restore CR status
time="2023-12-08T04:02:21Z" level=info msg="Error updating restore's final status" Restore=openshift-adp/acm-klusterlet error="Patch \"
[https://172.30.0.1:443/apis/velero.io/v1/namespaces/openshift-adp/restores/acm-klusterlet|https://172.30.0.1/apis/velero.io/v1/namespaces/openshift-adp/restores/acm-klusterlet]
\": dial 
tcp 172.30.0.1:443: connect: connection refused" error.file="/remote-source/velero/app/pkg/controller/restore_controller.go:216" error.function="github.com/vmware-tanzu/velero/pkg/controller.(*restoreReconciler).Reconcile" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:216"
...
################connection is back and it's stable
time="2023-12-08T04:02:58Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=openshift-adp/oadp-2 controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:152"
time="2023-12-08T04:02:58Z" level=info msg="BackupStorageLocations is validc, marking as available" backup-storage-location=openshift-adp/oadp-2 controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:137"

Restore CR stuck at InProgress status:

apiVersion: velero.io/v1
kind: Restore
metadata:
  annotations:
    lca.openshift.io/apply-wave: "1"
  creationTimestamp: "2023-12-08T04:01:59Z"
  generation: 4
  labels:
    velero.io/storage-location: default
  name: acm-klusterlet
  namespace: openshift-adp
  resourceVersion: "45514"
  uid: 392ad1c8-2a1a-4228-9eaa-d3cb28101de3
spec:
  backupName: acm-klusterlet
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  - csinodes.storage.k8s.io
  - volumeattachments.storage.k8s.io
  - backuprepositories.velero.io
  hooks: {}
  itemOperationTimeout: 1h0m0s
status:
  phase: InProgress
  progress:
    itemsRestored: 1
    totalItems: 63
  startTimestamp: "2023-12-08T04:01:59Z"

Environment:

Velero version (use velero version):
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

shubham-pampattiwar · 2023-12-13T16:40:12Z

Once of the solutions that we can try here is if we replace the restore update call with the retry-with-backoff logic, then the reconcile won't exit until max retries and/or success. We already did similar changes for create calls in #6830

sseago · 2023-12-13T17:27:36Z

@shubham-pampattiwar I don't think this expectation is realistic with the current velero controller design. "Velero controller to reconcile on the InProgress CR and and then appropriately update its status to Complete."

I think your retry-with-backoff for the post-reconcile status update is the right approach.

qiuming-best · 2023-12-18T06:00:07Z

The backup controller also has similar logic around the log error updating backup's final status, which also should be handled.

reasonerjt · 2024-02-02T08:47:32Z

This may happen to any CR when it goes to the final state, we need to apply the retry with backoff to all CRs.

reasonerjt · 2024-03-13T08:46:36Z

At this moment, we are not very convinced that "In Progress" -> "Complete" is more error-prone than other API calls, and we probably don't want to retry on every API call. Therefore, let's keep this open.

github-actions · 2024-05-14T01:48:23Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

kaovilai · 2024-05-14T04:16:19Z

unstale

Missxiaoguo · 2024-05-22T18:33:40Z

I think a retry should be attempted at least for the CR's final status update, so that other components relying on the backup/restore CR status could react appropriately

kaovilai · 2024-05-31T02:17:39Z

Fix posted here

Can you confirm how long it took for connection to come back? Wanna double check if the retries use would allow for enough time for this case.

kaovilai · 2024-05-31T15:00:03Z

Is there a limits to where we want to retry? Do we want retry in every status phase changes across backup/restore?

Missxiaoguo · 2024-05-31T16:56:25Z

The default retry backoff is too short(it seems like only a few second based on testing). A two-minute retry is reasonable when there is API outage due to cert rotation.
Another note is that backup status update should be handled as well. We have seen the same issue for backup CR(stuck in progress) in our scale lab.

kaovilai · 2024-05-31T23:10:39Z

@Missxiaoguo please check again. Thanks!

kaovilai · 2024-07-17T00:40:14Z

After today's meeting we have made some agreements towards retry approach and I will be making a design to cover UX changes as well as implementation notes.

Targeting v1.15

kaovilai · 2024-07-30T23:29:21Z

Design PR opened #8063

kaovilai · 2024-08-29T15:46:10Z

another user with transient api server isssue: #8116 (comment)

qiuming-best added the Restore label Dec 18, 2023

reasonerjt added area/resilience Help wanted labels Dec 18, 2023

qiuming-best self-assigned this Jan 8, 2024

qiuming-best added the 1.14-candidate label Jan 8, 2024

weshayutin added this to OADP Jan 8, 2024

reasonerjt added 2024 Q1 reviewed defer-candidate labels Feb 2, 2024

reasonerjt removed defer-candidate 1.14-candidate labels Mar 13, 2024

github-actions bot added the staled label May 14, 2024

shubham-pampattiwar assigned kaovilai May 14, 2024

github-actions bot removed the staled label May 15, 2024

Lyndon-Li added the backlog label May 20, 2024

Lyndon-Li unassigned qiuming-best May 20, 2024

kaovilai mentioned this issue May 30, 2024

Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

Closed

3 tasks

kaovilai mentioned this issue Jun 5, 2024

Mark InProgress backup/restore as failed upon requeuing #7863

Closed

3 tasks

Lyndon-Li added the 1.15-candidate label Jul 17, 2024

reasonerjt removed the 1.15-candidate label Jul 23, 2024

reasonerjt added this to the v1.15 milestone Jul 23, 2024

kaovilai mentioned this issue Jul 30, 2024

Add status patching retry configuration design. #8063

Merged

3 tasks

This was referenced Jul 31, 2024

The statuses of the CRs should be updated when got errors during the reconciling #7799

Closed

Retry completion status patch for backup and restore resources #8068

Merged

ywk253100 closed this as completed in #8068 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress" #7207

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress" #7207

shubham-pampattiwar commented Dec 13, 2023

shubham-pampattiwar commented Dec 13, 2023

sseago commented Dec 13, 2023

qiuming-best commented Dec 18, 2023

reasonerjt commented Feb 2, 2024

reasonerjt commented Mar 13, 2024

github-actions bot commented May 14, 2024

kaovilai commented May 14, 2024

Missxiaoguo commented May 22, 2024

kaovilai commented May 31, 2024

kaovilai commented May 31, 2024 •

edited

Loading

Missxiaoguo commented May 31, 2024

kaovilai commented May 31, 2024

kaovilai commented Jul 17, 2024

kaovilai commented Jul 30, 2024

kaovilai commented Aug 29, 2024

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress" #7207

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress" #7207

Comments

shubham-pampattiwar commented Dec 13, 2023

shubham-pampattiwar commented Dec 13, 2023

sseago commented Dec 13, 2023

qiuming-best commented Dec 18, 2023

reasonerjt commented Feb 2, 2024

reasonerjt commented Mar 13, 2024

github-actions bot commented May 14, 2024

kaovilai commented May 14, 2024

Missxiaoguo commented May 22, 2024

kaovilai commented May 31, 2024

kaovilai commented May 31, 2024 • edited Loading

Missxiaoguo commented May 31, 2024

kaovilai commented May 31, 2024

kaovilai commented Jul 17, 2024

kaovilai commented Jul 30, 2024

kaovilai commented Aug 29, 2024

kaovilai commented May 31, 2024 •

edited

Loading