Persistent Volume Claims Remain Pending and Timeout During Restore #7890

yuanqijing · 2024-06-14T03:38:18Z

What steps did you take and what happened:

I attempted to restore only the PVCs using Velero. However, the restored PVCs remained in the Pending state and did not bind until the restore operation timed out.

step1: backup namespace default, this namespace only contains a bonded pvc object. velero backup create vv2 --snapshot-move-data --include-namespaces default

step2: delete pvc.

step3: restore pvc from backup vv2. velero restore create --from-backup vv2

the restore Phase is FinalizingPartiallyFailed.

What did you expect to happen:

the PVC should be restored.

The following information will help us better understand what's going on:

velero restore describe <restorename> result

Anything else you would like to add:

the code remove volume.kubernetes.io/selected-node annotation, but velero agent is block on checking on this annotation.
https://github.com/vmware-tanzu/velero/blob/main/pkg/restore/actions/csi/pvc_action.go#L156-L160

Environment:

Velero version (use velero version): 1.14
Velero features (use velero client config get features): features:
Kubernetes version (use kubectl version): 1.22
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

yuanqijing · 2024-06-14T03:51:41Z

/assign

Lyndon-Li · 2024-06-14T06:42:34Z

I guess the volumeBindingMode for the PVC to be restored is WaitForFirstConsumer, if so this is the expected behavior.

This is deliberately to honor the WaitForFirstConsumer behavior --- when creating a PVC, the PV won't be provisioned and bound until a pod consumes the PVC and is scheduled. The AnnSelectedNode annotation is added by the scheduler at that time.
Preserving the annotation from the backup only cheats the check here , but it doesn't change the situation that a PV won't be provisioned without a pod.

That is to say, the current data movement restore doesn't work to restore volume data only for the ones with WaitForFirstConsumer mode.
Another design #7481 will cover this.

Lyndon-Li · 2024-06-14T06:45:25Z

@yuanqijing Please help to share some details for your requirement on restoring volume data only (or why you don't want to restore entire workload including pods), so that we can prioritize our work on #7481

yuanqijing · 2024-06-14T08:38:18Z

@Lyndon-Li Understood, the primary problem is whether to follow the storage class WaitForFirstConsumer design. The AnnSelectedNode annotation is a potential conflict in Velero's logic when restoring PVCs, the code at csi/pvc_action.go#L156-L160 which clears the AnnSelectedNode annotation seems to conflict with the logic that allows users to select a node, as found in change_pvc_node_selector.go#L115-L117.

Upon reviewing the documentation on changing the PVC selected node, it appears that changing the node selector isn't feasible under the current implementation, due to this conflicting logic.

Could we consider revising the design of the change-pvc-selected-node functionality? Specifically, might it be feasible to allow direct manual binding to a selected node if a user has explicitly chosen one, rather than waiting for a pod to consume the PVC?

Lyndon-Li · 2024-06-14T08:46:32Z

I don't think it is related to the node selecting, but it is all about the behavior of WaitForFirstConsumer --- without a scheduled pod mounting the PVC, the PV won't be provisioned. So actually we are not waiting for the appearance of the annotation, but for the readiness of the PV.

yuanqijing · 2024-06-14T09:01:24Z

i think If we specify a node for the PVC (through node selecting), we don't need to care about following the WaitForFirstConsumer policy. We can manually bind a (restored)PV to this PVC, which is exactly what the Velero agent is supposed to do. However, the current Velero logic is still waiting for the annotation check, and now the node selecting is not take effect.

Lyndon-Li · 2024-06-14T09:15:20Z

I think you are trying to bypass what WaitForFirstConsumer constrains, but keep in mind that it is meaningful for this constraint --- in some non-flatten topologies, a pod may not be scheduled to all the nodes, WaitForFirstConsumer guarantees that after provisioning a PV, the pod that wants to consume the PV is able to do it, that is, it always waits for the readiness of the pod, once the pod runs in a node, it gets to know which node should the PV attach to.
Otherwise, if a PV is provisioned and attached ahead of the pod schedule, say in node1 but the pod cannot run in node1, the pod will not be able to consume the PV. This is one of the problems that WaitForFirstConsumer tries to solve.

yuanqijing · 2024-06-14T09:23:00Z

If all restored PVCs adhere to the WaitForFirstConsumer constraints, can it be understood that users cannot specify nodes for the PVCs? But, what is the purpose of the feature described at changing the PVC selected node, this allows user to select node for pvc.

Lyndon-Li · 2024-06-14T10:28:15Z

This is a good question, changing the PVC selected node is an old feature from PR #2377.

As you can see from this comment , with the new versions of Kubernetes, the fix is not required --- the PVC could be re-provision if the node indicating by the annotation doesn't exist and the annotation will also be removed.
Moreover, as you can see from this Kubernetes doc, the annotation is an output of the scheduler, not used as an instruction to the scheduler.

Therefore, I think this RIA plugin is neither required nor working in the correct behavior in the new versions of Kubernetes.
Let me discuss with the community on how to deal with it and follow up if any.

yuanqijing · 2024-06-14T10:51:44Z

Thank you for your response. I appreciate the clarification and am looking forward to the community's resolution on this.

Lyndon-Li · 2024-06-19T02:21:16Z

We've decided to deprecate the Changing PVC selected-node feature from 1.15, opened issue #7904 for starting the deprecation process from 1.15.
Opened issue #7903 to document the limitation of it not working with WaitForFirstConsumer volumes in 1.14.1.

yuanqijing mentioned this issue Jun 14, 2024

bugfix: default keep pvc node annotations #7891

Closed

3 tasks

github-actions bot assigned yuanqijing Jun 14, 2024

Lyndon-Li added the area/datamover label Jun 14, 2024

yuanqijing removed their assignment Jun 14, 2024

Lyndon-Li added Restore and removed area/datamover labels Jun 14, 2024

Lyndon-Li closed this as completed Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent Volume Claims Remain Pending and Timeout During Restore #7890

Persistent Volume Claims Remain Pending and Timeout During Restore #7890

yuanqijing commented Jun 14, 2024

yuanqijing commented Jun 14, 2024

Lyndon-Li commented Jun 14, 2024 •

edited

Loading

Lyndon-Li commented Jun 14, 2024

yuanqijing commented Jun 14, 2024

Lyndon-Li commented Jun 14, 2024

yuanqijing commented Jun 14, 2024 •

edited

Loading

Lyndon-Li commented Jun 14, 2024

yuanqijing commented Jun 14, 2024

Lyndon-Li commented Jun 14, 2024

yuanqijing commented Jun 14, 2024

Lyndon-Li commented Jun 19, 2024 •

edited

Loading

Persistent Volume Claims Remain Pending and Timeout During Restore #7890

Persistent Volume Claims Remain Pending and Timeout During Restore #7890

Comments

yuanqijing commented Jun 14, 2024

yuanqijing commented Jun 14, 2024

Lyndon-Li commented Jun 14, 2024 • edited Loading

Lyndon-Li commented Jun 14, 2024

yuanqijing commented Jun 14, 2024

Lyndon-Li commented Jun 14, 2024

yuanqijing commented Jun 14, 2024 • edited Loading

Lyndon-Li commented Jun 14, 2024

yuanqijing commented Jun 14, 2024

Lyndon-Li commented Jun 14, 2024

yuanqijing commented Jun 14, 2024

Lyndon-Li commented Jun 19, 2024 • edited Loading

Lyndon-Li commented Jun 14, 2024 •

edited

Loading

yuanqijing commented Jun 14, 2024 •

edited

Loading

Lyndon-Li commented Jun 19, 2024 •

edited

Loading