Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent Volume Claims Remain Pending and Timeout During Restore #7890

Closed
yuanqijing opened this issue Jun 14, 2024 · 11 comments
Closed

Persistent Volume Claims Remain Pending and Timeout During Restore #7890

yuanqijing opened this issue Jun 14, 2024 · 11 comments
Labels

Comments

@yuanqijing
Copy link

What steps did you take and what happened:

I attempted to restore only the PVCs using Velero. However, the restored PVCs remained in the Pending state and did not bind until the restore operation timed out.

step1: backup namespace default, this namespace only contains a bonded pvc object. velero backup create vv2 --snapshot-move-data --include-namespaces default

step2: delete pvc.

step3: restore pvc from backup vv2. velero restore create --from-backup vv2

the restore Phase is FinalizingPartiallyFailed.

What did you expect to happen:

the PVC should be restored.

The following information will help us better understand what's going on:

velero restore describe <restorename> result

Anything else you would like to add:

the code remove volume.kubernetes.io/selected-node annotation, but velero agent is block on checking on this annotation.
https://github.com/vmware-tanzu/velero/blob/main/pkg/restore/actions/csi/pvc_action.go#L156-L160

Environment:

  • Velero version (use velero version): 1.14
  • Velero features (use velero client config get features): features:
  • Kubernetes version (use kubectl version): 1.22
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@yuanqijing
Copy link
Author

/assign

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 14, 2024

I guess the volumeBindingMode for the PVC to be restored is WaitForFirstConsumer, if so this is the expected behavior.

This is deliberately to honor the WaitForFirstConsumer behavior --- when creating a PVC, the PV won't be provisioned and bound until a pod consumes the PVC and is scheduled. The AnnSelectedNode annotation is added by the scheduler at that time.
Preserving the annotation from the backup only cheats the check here , but it doesn't change the situation that a PV won't be provisioned without a pod.

That is to say, the current data movement restore doesn't work to restore volume data only for the ones with WaitForFirstConsumer mode.
Another design #7481 will cover this.

@Lyndon-Li
Copy link
Contributor

@yuanqijing Please help to share some details for your requirement on restoring volume data only (or why you don't want to restore entire workload including pods), so that we can prioritize our work on #7481

@yuanqijing
Copy link
Author

@Lyndon-Li Understood, the primary problem is whether to follow the storage class WaitForFirstConsumer design. The AnnSelectedNode annotation is a potential conflict in Velero's logic when restoring PVCs, the code at csi/pvc_action.go#L156-L160 which clears the AnnSelectedNode annotation seems to conflict with the logic that allows users to select a node, as found in change_pvc_node_selector.go#L115-L117.

Upon reviewing the documentation on changing the PVC selected node, it appears that changing the node selector isn't feasible under the current implementation, due to this conflicting logic.

Could we consider revising the design of the change-pvc-selected-node functionality? Specifically, might it be feasible to allow direct manual binding to a selected node if a user has explicitly chosen one, rather than waiting for a pod to consume the PVC?

@Lyndon-Li
Copy link
Contributor

I don't think it is related to the node selecting, but it is all about the behavior of WaitForFirstConsumer --- without a scheduled pod mounting the PVC, the PV won't be provisioned. So actually we are not waiting for the appearance of the annotation, but for the readiness of the PV.

@yuanqijing
Copy link
Author

yuanqijing commented Jun 14, 2024

i think If we specify a node for the PVC (through node selecting), we don't need to care about following the WaitForFirstConsumer policy. We can manually bind a (restored)PV to this PVC, which is exactly what the Velero agent is supposed to do. However, the current Velero logic is still waiting for the annotation check, and now the node selecting is not take effect.

@Lyndon-Li
Copy link
Contributor

I think you are trying to bypass what WaitForFirstConsumer constrains, but keep in mind that it is meaningful for this constraint --- in some non-flatten topologies, a pod may not be scheduled to all the nodes, WaitForFirstConsumer guarantees that after provisioning a PV, the pod that wants to consume the PV is able to do it, that is, it always waits for the readiness of the pod, once the pod runs in a node, it gets to know which node should the PV attach to.
Otherwise, if a PV is provisioned and attached ahead of the pod schedule, say in node1 but the pod cannot run in node1, the pod will not be able to consume the PV. This is one of the problems that WaitForFirstConsumer tries to solve.

@yuanqijing
Copy link
Author

If all restored PVCs adhere to the WaitForFirstConsumer constraints, can it be understood that users cannot specify nodes for the PVCs? But, what is the purpose of the feature described at changing the PVC selected node, this allows user to select node for pvc.

@Lyndon-Li
Copy link
Contributor

This is a good question, changing the PVC selected node is an old feature from PR #2377.

As you can see from this comment , with the new versions of Kubernetes, the fix is not required --- the PVC could be re-provision if the node indicating by the annotation doesn't exist and the annotation will also be removed.
Moreover, as you can see from this Kubernetes doc, the annotation is an output of the scheduler, not used as an instruction to the scheduler.

Therefore, I think this RIA plugin is neither required nor working in the correct behavior in the new versions of Kubernetes.
Let me discuss with the community on how to deal with it and follow up if any.

@yuanqijing
Copy link
Author

Thank you for your response. I appreciate the clarification and am looking forward to the community's resolution on this.

@yuanqijing yuanqijing removed their assignment Jun 14, 2024
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 19, 2024

We've decided to deprecate the Changing PVC selected-node feature from 1.15, opened issue #7904 for starting the deprecation process from 1.15.
Opened issue #7903 to document the limitation of it not working with WaitForFirstConsumer volumes in 1.14.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants