Cant delete PV #266

philipp1992 · 2020-07-10T06:14:14Z

Hi

using vanilla kubernetes 1.18 with vsphere 7.0 and csi 2.0

Everything works fine, but when i delete a PVC or PV, the following happens:

kubectl describe pv

Warning VolumeFailedDelete 85s csi.vsphere.vmware.com_vsphere-csi-controller-5c4d7b6ffc-sxtxv_90ec2c56-c50c-4acf-9c03-b506726b5800 rpc error: code = Internal desc = failed to delete volume: "9c310d6c-dea0-488c-a3bd-c91d86fc00c2". Error: failed to delete volume: "9c310d6c-dea0-488c-a3bd-c91d86fc00c2", fault: "(*types.LocalizedMethodFault)(0xc00047ebe0)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (types.CnsFault) {\n BaseMethodFault: (types.BaseMethodFault) ,\n Reason: (string) (len=63) "CNS: Failed to delete disk:Fault cause: vim.fault.InvalidState\n"\n },\n LocalizedMessage: (string) (len=79) "CnsFault error: CNS: Failed to delete disk:Fault cause: vim.fault.InvalidState\n"\n})\n", opID: "7ae7c7f7"

divyenpatel · 2020-07-10T19:57:05Z

@philipp1992 you are hitting this issue

https://vsphere-csi-driver.sigs.k8s.io/known_issues.html#issue_5

Issue 5: The CSI delete volume is getting called before detach.

    Impact: There could be a possibility of CSI getting Delete Volume before ControllerUnpublish.
    Upstream issue is tracked at: https://github.com/kubernetes/kubernetes/issues/84226

    Workaround:
        Delete the Pod with force: kubectl delete pods <pod> --grace-period=0 --force
        Find VolumeAttachment for the volume that remained undeleted. Get Node from this VolumeAttachment.
        Manually detach the disk from the Node VM.
        Edit this VolumeAttachment and remove the finalizer. It will get deleted.
        Use govc to manually delete the FCD.
        Edit Pending PV and remove the finalizer. It will get deleted.

divyenpatel · 2020-07-10T20:00:02Z

The issue is fixed in the external-provisioner. If this is a test environment, can you upgrade provisioner image to
quay.io/k8scsi/csi-provisioner:v2.0.0-rc2 and confirm?

misterikkit · 2020-08-14T22:23:18Z

@divyenpatel I think I hit the same thing when testing the csi driver v2.0.0 with csi-provisioner v1.6.0. But I don't fully understand the fix.

Looking deeper into the logs, it is true that csi-provisioner calls Delete before csi-attacher calls ControllerUnpublishVolume, but both operations are retried until they succeed. The vim.fault.InvalidState error seems to indicate, "you may not delete a volume while it is attached." But I don't understand why the detach requests continue to fail.

When I examined vsphere-csi-controller logs, it seems like detach fails with this error:

{"level":"info","time":"2020-08-13T22:32:55.844382723Z","caller":"vanilla/controller.go:538","msg":"ControllerUnpublishVolume: called with args {VolumeId:f2adab51-9381-43df-ba18-4f27b9865b15 NodeId:ci-misterikkit-08-11-2020-86c96b9fb4-jn7zl Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"9c4fe752-e943-4a65-b928-4c508ded2569"}
{"level":"error","time":"2020-08-13T22:32:55.883404964Z","caller":"vanilla/controller.go:562","msg":"volumeID \"f2adab51-9381-43df-ba18-4f27b9865b15\" not found in QueryVolume","TraceId":"9c4fe752-e943-4a65-b928-4c508ded2569","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:562\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}

divyenpatel · 2020-08-14T23:02:34Z

@misterikkit this is because of the backend vSphere API bug.

Here in this issue when Delete API is called, it is untagging container volume first and then attempts to delete the volume, but since volume is already attached to VM, delete fails. API does not tag back volume is container volume. During detach, we first query vCenter to determine if Volume is Block or File and if it is file volume we skip detach. Since volume is already untagged as container volume, vSphere does not return this volume, so ControllerUnpublishVolume fails with volume not found.

This issue is fixed in upcoming release vSphere 7.0u1

For prior vSphere releases, we recommend customers to use the latest version of csi-provisioner where fix is already made to prevent detach/delete race.

longwa · 2020-08-18T21:41:43Z

I'm running with TKG 1.1.2 on 6.7U3, is there any fix for the 1.x version of the CSI driver?

We are running csi-provisioner version 1.4.0 at the moment (whatever is default for TKG).

misterikkit · 2020-08-19T21:09:19Z

The reason that csi-provisioner was moved to 2.0 is that RBAC changes are required for the fix to work. For that reason, I doubt that the fix will be backported.

…

longwa · 2020-08-20T01:53:02Z

It seems like a pretty fatal flaw in the 6.7 implementation if it randomly fails to delete a PV. We are also seeing a recurring attempt to delete something every 5 minutes in the vCenter logs, presumably related to this same issue. If just basic creation and deletion of a PV don’t work properly it doesn’t seem like they should even enable it for 6.7U3.

…

On Aug 19, 2020, at 5:09 PM, Jonathan Basseri ***@***.***> wrote: The reason that csi-provisioner was moved to 2.0 is that RBAC changes are required for the fix to work. For that reason, I doubt that the fix will be backported. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

RaunakShah · 2020-08-20T02:43:00Z

@longwa this situation only happens when there's a race between trying to delete a PVC and Pod (for example when deleting a namespace). Can you run the steps @divyenpatel has mentioned in #266 (comment) and see if that helps?

We are also seeing a recurring attempt to delete something every 5 minutes in the vCenter logs, presumably related to this same issue.

Yes, kubernetes attempts to delete the backend volume in a loop until it succeeds.

longwa · 2020-08-20T15:09:17Z

@RaunakShah So the workaround steps are how to delete a Pod and not have the volume get in this state? I don't see any VolumeAttachments for the failed PV in this case so I'm assuming the workaround isn't useful after you have already gotten the error?

Also, what is causing the retry and how can I stop it? It will never succeed from what I can tell and is spamming our Event logs with failures every 5 minutes. I'll have to see about using govc to delete the FCD. I don't have any visibility to the IVD volumes in the CNS UI or anywhere else so I'm really pretty blind about what's going on here.

misterikkit · 2020-08-20T16:00:47Z

@longwa Kubernetes design is to retry operations it sees as temporary failures. The PV object is still present with a deletionTimestamp so k8s keeps trying to delete it. To fully delete the volume, make sure you follow each step in the workaround. I think you still need to do these ones:

Manually detach the disk from the Node VM.
Use govc to manually delete the FCD.
Edit Pending PV and remove the finalizer. It will get deleted.

SandeepPissay · 2020-09-17T19:02:55Z

/assign @divyenpatel

jingxu97 · 2021-10-13T20:09:58Z

we still hit this issue. csi-provisioner is 2.2.0 version which should have the fix already. I am not sure any other logic might cause container volume untagged too?

The issue only happen on 6.7u3. I think as @divyenpatel mentioned vCenter 7.0.1 fixed this issue. But wondering since csi-provisioner checked volume is still attached, and should not issue delete (verified log about this). What caused this happen?

jingxu97 · 2021-10-13T20:12:16Z

/reopen

k8s-ci-robot · 2021-10-13T20:12:26Z

@jingxu97: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2022-01-11T20:48:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-02-10T20:52:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-03-12T21:34:07Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-03-12T21:34:18Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SandeepPissay added the v2.0.1 Candidate for v2.0.1 label Sep 17, 2020

k8s-ci-robot assigned divyenpatel Sep 17, 2020

divyenpatel mentioned this issue Sep 30, 2020

deployment yaml files for v2.0.1 release #403

Merged

divyenpatel added this to the v2.0.1 Patch Release milestone Sep 30, 2020

k8s-ci-robot closed this as completed in #403 Oct 15, 2020

k8s-ci-robot reopened this Oct 13, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2022

k8s-ci-robot closed this as completed Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant delete PV #266

Cant delete PV #266

philipp1992 commented Jul 10, 2020

divyenpatel commented Jul 10, 2020

divyenpatel commented Jul 10, 2020 •

edited

Loading

misterikkit commented Aug 14, 2020

divyenpatel commented Aug 14, 2020

longwa commented Aug 18, 2020

misterikkit commented Aug 19, 2020 via email

longwa commented Aug 20, 2020 via email

RaunakShah commented Aug 20, 2020

longwa commented Aug 20, 2020

misterikkit commented Aug 20, 2020

SandeepPissay commented Sep 17, 2020

jingxu97 commented Oct 13, 2021

jingxu97 commented Oct 13, 2021

k8s-ci-robot commented Oct 13, 2021

k8s-triage-robot commented Jan 11, 2022

k8s-triage-robot commented Feb 10, 2022

k8s-triage-robot commented Mar 12, 2022

k8s-ci-robot commented Mar 12, 2022

Cant delete PV #266

Cant delete PV #266

Comments

philipp1992 commented Jul 10, 2020

divyenpatel commented Jul 10, 2020

divyenpatel commented Jul 10, 2020 • edited Loading

misterikkit commented Aug 14, 2020

divyenpatel commented Aug 14, 2020

longwa commented Aug 18, 2020

misterikkit commented Aug 19, 2020 via email

longwa commented Aug 20, 2020 via email

RaunakShah commented Aug 20, 2020

longwa commented Aug 20, 2020

misterikkit commented Aug 20, 2020

SandeepPissay commented Sep 17, 2020

jingxu97 commented Oct 13, 2021

jingxu97 commented Oct 13, 2021

k8s-ci-robot commented Oct 13, 2021

k8s-triage-robot commented Jan 11, 2022

k8s-triage-robot commented Feb 10, 2022

k8s-triage-robot commented Mar 12, 2022

k8s-ci-robot commented Mar 12, 2022

divyenpatel commented Jul 10, 2020 •

edited

Loading