Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent Volume with Reclaim Policy Delete gets recreated #2739

Closed
joschi36 opened this issue Jul 21, 2020 · 7 comments
Closed

Persistent Volume with Reclaim Policy Delete gets recreated #2739

joschi36 opened this issue Jul 21, 2020 · 7 comments
Labels
Needs info Waiting for information Volumes Relating to volume backup and restore

Comments

@joschi36
Copy link

joschi36 commented Jul 21, 2020

What steps did you take and what happened:
We did a backup with Velero and without any form of volume backup (restic/volumesnapshot).
When restoring the PersistentVolume got recreated.

What did you expect to happen:
We expected that the manifest of the PersistentVolume gets applied as is with the same meta information.

The output of the following commands will help us better understand what's going on:

  • velero restore logs joschi-disaster-test
Getting client for /v1, Kind=PersistentVolume" logSource="pkg/restore/restore.go:746" restore=velero/joschi-disaster-test-2020072117012
Dynamically re-provisioning persistent volume because it doesn't have a snapshot and its reclaim policy is Delete." logSource="pkg/restore/restore.go:933" restore=velero/joschi-disaster-test-2020072117012
Dynamically re-provisioning persistent volume because it doesn't have a snapshot and its reclaim policy is Delete." logSource="pkg/restore/restore.go:933" restore=velero/joschi-disaster-test-2020072117012
Dynamically re-provisioning persistent volume because it doesn't have a snapshot and its reclaim policy is Delete." logSource="pkg/restore/restore.go:933" restore=velero/joschi-disaster-test-2020072117012
Dynamically re-provisioning persistent volume because it doesn't have a snapshot and its reclaim policy is Delete." logSource="pkg/restore/restore.go:933" restore=velero/joschi-disaster-test-2020072117012
Dynamically re-provisioning persistent volume because it doesn't have a snapshot and its reclaim policy is Delete." logSource="pkg/restore/restore.go:933" restore=velero/joschi-disaster-test-2020072117012
Restoring resource 'persistentvolumeclaims' into namespace 'loki'" logSource="pkg/restore/restore.go:702" restore=velero/joschi-disaster-test-2020072117012
Getting client for /v1, Kind=PersistentVolumeClaim" logSource="pkg/restore/restore.go:746" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing AddPVFromPVCAction" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:44" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Adding PV csi-dcb6d759bd as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:66" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Skipping persistentvolumes/csi-dcb6d759bd because it's already been restored." logSource="pkg/restore/restore.go:844" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Done executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:91" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Done executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:74" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Resetting PersistentVolumeClaim loki/storage-loki-0 for dynamic provisioning" logSource="pkg/restore/restore.go:1037" restore=velero/joschi-disaster-test-2020072117012
Attempting to restore PersistentVolumeClaim: storage-loki-0" logSource="pkg/restore/restore.go:1070" restore=velero/joschi-disaster-test-2020072117012
Restoring resource 'persistentvolumeclaims' into namespace 'monitoring'" logSource="pkg/restore/restore.go:702" restore=velero/joschi-disaster-test-2020072117012
Getting client for /v1, Kind=PersistentVolumeClaim" logSource="pkg/restore/restore.go:746" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing AddPVFromPVCAction" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:44" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Adding PV csi-7aa81280b9 as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:66" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Skipping persistentvolumes/csi-7aa81280b9 because it's already been restored." logSource="pkg/restore/restore.go:844" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Done executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:91" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Done executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:74" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Resetting PersistentVolumeClaim monitoring/prometheus-grafana-storage for dynamic provisioning" logSource="pkg/restore/restore.go:1037" restore=velero/joschi-disaster-test-2020072117012
Attempting to restore PersistentVolumeClaim: prometheus-grafana-storage" logSource="pkg/restore/restore.go:1070" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing AddPVFromPVCAction" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:44" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Adding PV csi-602da2215a as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:66" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Skipping persistentvolumes/csi-602da2215a because it's already been restored." logSource="pkg/restore/restore.go:844" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Done executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:91" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Done executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:74" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Resetting PersistentVolumeClaim monitoring/prometheus-k8s-db-prometheus-k8s-0 for dynamic provisioning" logSource="pkg/restore/restore.go:1037" restore=velero/joschi-disaster-test-2020072117012
Attempting to restore PersistentVolumeClaim: prometheus-k8s-db-prometheus-k8s-0" logSource="pkg/restore/restore.go:1070" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing AddPVFromPVCAction" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:44" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Adding PV csi-b53b65bd89 as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:66" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Skipping persistentvolumes/csi-b53b65bd89 because it's already been restored." logSource="pkg/restore/restore.go:844" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Done executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:91" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Resetting PersistentVolumeClaim monitoring/prometheus-k8s-db-prometheus-k8s-1 for dynamic provisioning" logSource="pkg/restore/restore.go:1037" restore=velero/joschi-disaster-test-2020072117012
Attempting to restore PersistentVolumeClaim: prometheus-k8s-db-prometheus-k8s-1" logSource="pkg/restore/restore.go:1070" restore=velero/joschi-disaster-test-2020072117012
Done executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:74" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Restoring resource 'persistentvolumeclaims' into namespace 'robot-shop'" logSource="pkg/restore/restore.go:702" restore=velero/joschi-disaster-test-2020072117012
Getting client for /v1, Kind=PersistentVolumeClaim" logSource="pkg/restore/restore.go:746" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing AddPVFromPVCAction" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:44" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Adding PV csi-7b017a83e6 as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:66" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Skipping persistentvolumes/csi-7b017a83e6 because it's already been restored." logSource="pkg/restore/restore.go:844" restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Done executing ChangePVCNodeSelectorAction" cmd=/velero logSource="pkg/restore/change_pvc_node_selector.go:91" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:964" restore=velero/joschi-disaster-test-2020072117012
Executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:63" pluginName=velero restore=velero/joschi-disaster-test-2020072117012
Resetting PersistentVolumeClaim robot-shop/data-redis-0 for dynamic provisioning" logSource="pkg/restore/restore.go:1037" restore=velero/joschi-disaster-test-2020072117012
Attempting to restore PersistentVolumeClaim: data-redis-0" logSource="pkg/restore/restore.go:1070" restore=velero/joschi-disaster-test-2020072117012
Done executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:74" pluginName=velero restore=velero/joschi-disaster-test-2020072117012

Anything else you would like to add:
We can't use the Reclaim Policy Retain as our storage integration (Dell EMC CSI Isilon) will not delete the volumes on the storage system.
My idea would be to define a flag which makes it configurable to choose whether to use the actual logic in

switch {
case hasSnapshot(name, ctx.volumeSnapshots):
oldName := obj.GetName()
shouldRenamePV, err := shouldRenamePV(ctx, obj, resourceClient)
if err != nil {
errs.Add(namespace, err)
return warnings, errs
}
var shouldRestoreSnapshot bool
if !shouldRenamePV {
// Check if the PV exists in the cluster before attempting to create
// a volume from the snapshot, in order to avoid orphaned volumes (GH #609)
shouldRestoreSnapshot, err = ctx.shouldRestore(name, resourceClient)
if err != nil {
errs.Add(namespace, errors.Wrapf(err, "error waiting on in-cluster persistentvolume %s", name))
return warnings, errs
}
} else {
// if we're renaming the PV, we're going to give it a new random name,
// so we can assume it doesn't already exist in the cluster and therefore
// we should proceed with restoring from snapshot.
shouldRestoreSnapshot = true
}
if shouldRestoreSnapshot {
// even if we're renaming the PV, obj still has the old name here, because the pvRestorer
// uses the original name to look up metadata about the snapshot.
ctx.log.Infof("Restoring persistent volume from snapshot.")
updatedObj, err := ctx.pvRestorer.executePVAction(obj)
if err != nil {
errs.Add(namespace, fmt.Errorf("error executing PVAction for %s: %v", resourceID, err))
return warnings, errs
}
obj = updatedObj
}
if shouldRenamePV {
var pvName string
if oldName == obj.GetName() {
// pvRestorer hasn't modified the PV name, we need to rename the PV
pvName, err = ctx.pvRenamer(oldName)
if err != nil {
errs.Add(namespace, errors.Wrapf(err, "error renaming PV"))
return warnings, errs
}
} else {
// VolumeSnapshotter could have modified the PV name through function `SetVolumeID`,
pvName = obj.GetName()
}
ctx.renamedPVs[oldName] = pvName
obj.SetName(pvName)
// add the original PV name as an annotation
annotations := obj.GetAnnotations()
if annotations == nil {
annotations = map[string]string{}
}
annotations["velero.io/original-pv-name"] = oldName
obj.SetAnnotations(annotations)
}
case hasResticBackup(obj, ctx):
ctx.log.Infof("Dynamically re-provisioning persistent volume because it has a restic backup to be restored.")
ctx.pvsToProvision.Insert(name)
// return early because we don't want to restore the PV itself, we want to dynamically re-provision it.
return warnings, errs
case hasDeleteReclaimPolicy(obj.Object):
ctx.log.Infof("Dynamically re-provisioning persistent volume because it doesn't have a snapshot and its reclaim policy is Delete.")
ctx.pvsToProvision.Insert(name)
// return early because we don't want to restore the PV itself, we want to dynamically re-provision it.
return warnings, errs
default:
ctx.log.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and its reclaim policy is not Delete.")
// we call the pvRestorer here to clear out the PV's claimRef, so it can be re-claimed
// when its PVC is restored.
updatedObj, err := ctx.pvRestorer.executePVAction(obj)
if err != nil {
errs.Add(namespace, fmt.Errorf("error executing PVAction for %s: %v", resourceID, err))
return warnings, errs
}
obj = updatedObj
}
or just to create the PersistentVolume manifest as is (as in the default section
default:
ctx.log.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and its reclaim policy is not Delete.")
// we call the pvRestorer here to clear out the PV's claimRef, so it can be re-claimed
// when its PVC is restored.
updatedObj, err := ctx.pvRestorer.executePVAction(obj)
if err != nil {
errs.Add(namespace, fmt.Errorf("error executing PVAction for %s: %v", resourceID, err))
return warnings, errs
}
obj = updatedObj
).

Environment:

  • Velero version (use velero version): v1.4.0
  • Velero features (use velero client config get features): none
  • Kubernetes version (use kubectl version): v1.16.3-2+f1f3428e8468b6-dirty
  • Kubernetes installer & version: Own installation
  • Cloud provider or hardware configuration: Own implementation in CloudStack
  • OS (e.g. from /etc/os-release): CentOS Linux 7

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@joschi36
Copy link
Author

joschi36 commented Jul 29, 2020

Maybe @skriss could have helped here, as he wrote the current logic in

default:
ctx.log.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and its reclaim policy is not Delete.")
// we call the pvRestorer here to clear out the PV's claimRef, so it can be re-claimed
// when its PVC is restored.
updatedObj, err := ctx.pvRestorer.executePVAction(obj)
if err != nil {
errs.Add(namespace, fmt.Errorf("error executing PVAction for %s: %v", resourceID, err))
return warnings, errs
}
obj = updatedObj

But as far as I see he isn't maintainer anymore. Do you @nrb know about the current logic?

@nrb
Copy link
Contributor

nrb commented Jul 29, 2020

@joschi36 Sorry about the delay in response.

I don't think this use case is one we've encountered before, so I need to learn a little more about it.

You're saying you have a PV manifest with no snapshot an a reclaim policy of Delete..is it actually deleted from Kubernetes at restore time? It appears so since Velero is recreating it.

Is this volume also deleted from the Isilon storage system, or is the intention to re-associate with an existing volume in storage at restore time?

@nrb nrb added the Needs info Waiting for information label Jul 29, 2020
@joschi36
Copy link
Author

joschi36 commented Jul 29, 2020

Background:
Our K8s backup/restore plans are discoupled with our volume backup/restore plans.
The storage system we are using is also completely discoupled from K8s
Also, we need to use the Reclaim Policy Delete.

Case:
The idea is that when an Incident happens (for instance deletion of all server nodes) we want to recreate the cluster and reapply all manifests, also the PV manifests as they were.

What happens:
But as Velero does not reapply PVs with the Reclaim Policy Delete and instead dynamically recreates PVs with the PVCs manifests, we don't mount the same volume in the background as before.

Expectation:
What we want is that we have the same behavior as the Reclaim Policy Retain, so the PVs manifest gets reapplied with the same manifest without recreating, so same volume gets mounted as before the incident.

Clarifications:
The volumes didn't get deleted in the background, just new volumes were created instead of using the already existing ones.

Solution:
One Solution would be to have a flag to control this behavior instead of relying on the Reclaim Policy option, which would make sense to me. Maybe I understand something wrong about the current behavior or its decision to behave like this.

Thanks for your help @nrb
// updated the layout a bit

@joschi36
Copy link
Author

Hey @nrb can you explain the behavior, because we would be happy to create a PR to extend the current switch with a flag in velero restore create.
Something like --disable-reprovisioning.

@nrb
Copy link
Contributor

nrb commented Aug 26, 2020

Sorry for the delay in response.

I think I see the disconnect here - it's this line:

Our K8s backup/restore plans are discoupled with our volume backup/restore plans.
The storage system we are using is also completely discoupled from K8s

So in most cases where folks are using Velero, the storage is not completely decoupled from Kubernetes such that Velero should completely ignore the PV's reclaim policy.

Velero does not re-apply the PV with a reclaim policy of Delete on the assumption that most installations don't want to keep those, otherwise they would have a different reclaim policy such as Retain. So, it instead completely re-provisions a fresh PV.

The assumption comes from these docs:

For dynamically provisioned PersistentVolumes, the default reclaim policy is "Delete". This means that a dynamically provisioned volume is automatically deleted when a user deletes the corresponding PersistentVolumeClaim. This automatic behavior might be inappropriate if the volume contains precious data. In that case, it is more appropriate to use the "Retain" policy. With the "Retain" policy, if a user deletes a PersistentVolumeClaim, the corresponding PersistentVolume is not be deleted. Instead, it is moved to the Released phase, where all of its data can be manually recovered.

I'll need to think more about PR #2869 - I don't know that completely removing it makes sense for all cases, but I see why it does in your case, or that a flag disabling it would be a good option for those systems where they don't follow expected behavior exactly as defined.

@nrb
Copy link
Contributor

nrb commented Sep 2, 2020

@joschi36 I've closed out the PR, but I want to link the Kubernetes documentation to the Delete policy here.

The delete policy is intended to remove the volume from the underlying storage system entirely. If the currently storage system isn't doing that, then it should be using Retain.

Velero is recreating volumes here because:

  1. It expects that PVs with a Delete policy are completely gone from the storage system on a restore, so there is no valid volumeID for the PV, and when restoring the manifest, it would be in a Lost state most of the time (this seems to be the key difference in your use case).
  2. There is no snapshot, so there is nothing to restore data from.

If you're using Velero only for the Kubernetes manifests and managing storage backup completely separately, then this is a use case we'll need to look into separately rather than completely removing the code, because Velero was intended to manage both in concert.

I think the --disable-reprovisioning flag can make sense here for this use case as a companion to the --disable-snapshots flag, as it would accommodate the de-coupled nature of your setup, which we hadn't really considered before.

@dsu-igeek
Copy link
Contributor

Closing this, please re-open if there are continuing issues

@carlisia carlisia added the Volumes Relating to volume backup and restore label Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs info Waiting for information Volumes Relating to volume backup and restore
Projects
None yet
Development

No branches or pull requests

4 participants