Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant delete PV #266

Closed
philipp1992 opened this issue Jul 10, 2020 · 18 comments · Fixed by #403
Closed

Cant delete PV #266

philipp1992 opened this issue Jul 10, 2020 · 18 comments · Fixed by #403
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. v2.0.1 Candidate for v2.0.1

Comments

@philipp1992
Copy link

Hi

using vanilla kubernetes 1.18 with vsphere 7.0 and csi 2.0

Everything works fine, but when i delete a PVC or PV, the following happens:

kubectl describe pv

Warning VolumeFailedDelete 85s csi.vsphere.vmware.com_vsphere-csi-controller-5c4d7b6ffc-sxtxv_90ec2c56-c50c-4acf-9c03-b506726b5800 rpc error: code = Internal desc = failed to delete volume: "9c310d6c-dea0-488c-a3bd-c91d86fc00c2". Error: failed to delete volume: "9c310d6c-dea0-488c-a3bd-c91d86fc00c2", fault: "(*types.LocalizedMethodFault)(0xc00047ebe0)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (types.CnsFault) {\n BaseMethodFault: (types.BaseMethodFault) ,\n Reason: (string) (len=63) "CNS: Failed to delete disk:Fault cause: vim.fault.InvalidState\n"\n },\n LocalizedMessage: (string) (len=79) "CnsFault error: CNS: Failed to delete disk:Fault cause: vim.fault.InvalidState\n"\n})\n", opID: "7ae7c7f7"

@divyenpatel
Copy link
Member

@philipp1992 you are hitting this issue

https://vsphere-csi-driver.sigs.k8s.io/known_issues.html#issue_5

Issue 5: The CSI delete volume is getting called before detach.

    Impact: There could be a possibility of CSI getting Delete Volume before ControllerUnpublish.
    Upstream issue is tracked at: https://github.com/kubernetes/kubernetes/issues/84226

    Workaround:
        Delete the Pod with force: kubectl delete pods <pod> --grace-period=0 --force
        Find VolumeAttachment for the volume that remained undeleted. Get Node from this VolumeAttachment.
        Manually detach the disk from the Node VM.
        Edit this VolumeAttachment and remove the finalizer. It will get deleted.
        Use govc to manually delete the FCD.
        Edit Pending PV and remove the finalizer. It will get deleted.

@divyenpatel
Copy link
Member

divyenpatel commented Jul 10, 2020

The issue is fixed in the external-provisioner. If this is a test environment, can you upgrade provisioner image to
quay.io/k8scsi/csi-provisioner:v2.0.0-rc2 and confirm?

@misterikkit
Copy link

@divyenpatel I think I hit the same thing when testing the csi driver v2.0.0 with csi-provisioner v1.6.0. But I don't fully understand the fix.

Looking deeper into the logs, it is true that csi-provisioner calls Delete before csi-attacher calls ControllerUnpublishVolume, but both operations are retried until they succeed. The vim.fault.InvalidState error seems to indicate, "you may not delete a volume while it is attached." But I don't understand why the detach requests continue to fail.

When I examined vsphere-csi-controller logs, it seems like detach fails with this error:

{"level":"info","time":"2020-08-13T22:32:55.844382723Z","caller":"vanilla/controller.go:538","msg":"ControllerUnpublishVolume: called with args {VolumeId:f2adab51-9381-43df-ba18-4f27b9865b15 NodeId:ci-misterikkit-08-11-2020-86c96b9fb4-jn7zl Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"9c4fe752-e943-4a65-b928-4c508ded2569"}
{"level":"error","time":"2020-08-13T22:32:55.883404964Z","caller":"vanilla/controller.go:562","msg":"volumeID \"f2adab51-9381-43df-ba18-4f27b9865b15\" not found in QueryVolume","TraceId":"9c4fe752-e943-4a65-b928-4c508ded2569","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:562\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}

@divyenpatel
Copy link
Member

@misterikkit this is because of the backend vSphere API bug.

Here in this issue when Delete API is called, it is untagging container volume first and then attempts to delete the volume, but since volume is already attached to VM, delete fails. API does not tag back volume is container volume. During detach, we first query vCenter to determine if Volume is Block or File and if it is file volume we skip detach. Since volume is already untagged as container volume, vSphere does not return this volume, so ControllerUnpublishVolume fails with volume not found.

This issue is fixed in upcoming release vSphere 7.0u1

For prior vSphere releases, we recommend customers to use the latest version of csi-provisioner where fix is already made to prevent detach/delete race.

@longwa
Copy link

longwa commented Aug 18, 2020

I'm running with TKG 1.1.2 on 6.7U3, is there any fix for the 1.x version of the CSI driver?

We are running csi-provisioner version 1.4.0 at the moment (whatever is default for TKG).

@misterikkit
Copy link

misterikkit commented Aug 19, 2020 via email

@longwa
Copy link

longwa commented Aug 20, 2020 via email

@RaunakShah
Copy link
Contributor

@longwa this situation only happens when there's a race between trying to delete a PVC and Pod (for example when deleting a namespace). Can you run the steps @divyenpatel has mentioned in #266 (comment) and see if that helps?

We are also seeing a recurring attempt to delete something every 5 minutes in the vCenter logs, presumably related to this same issue.

Yes, kubernetes attempts to delete the backend volume in a loop until it succeeds.

@longwa
Copy link

longwa commented Aug 20, 2020

@RaunakShah So the workaround steps are how to delete a Pod and not have the volume get in this state? I don't see any VolumeAttachments for the failed PV in this case so I'm assuming the workaround isn't useful after you have already gotten the error?

Also, what is causing the retry and how can I stop it? It will never succeed from what I can tell and is spamming our Event logs with failures every 5 minutes. I'll have to see about using govc to delete the FCD. I don't have any visibility to the IVD volumes in the CNS UI or anywhere else so I'm really pretty blind about what's going on here.

@misterikkit
Copy link

@longwa Kubernetes design is to retry operations it sees as temporary failures. The PV object is still present with a deletionTimestamp so k8s keeps trying to delete it. To fully delete the volume, make sure you follow each step in the workaround. I think you still need to do these ones:

  • Manually detach the disk from the Node VM.
  • Use govc to manually delete the FCD.
  • Edit Pending PV and remove the finalizer. It will get deleted.

@SandeepPissay SandeepPissay added the v2.0.1 Candidate for v2.0.1 label Sep 17, 2020
@SandeepPissay
Copy link
Contributor

/assign @divyenpatel

@jingxu97
Copy link

we still hit this issue. csi-provisioner is 2.2.0 version which should have the fix already. I am not sure any other logic might cause container volume untagged too?

The issue only happen on 6.7u3. I think as @divyenpatel mentioned vCenter 7.0.1 fixed this issue. But wondering since csi-provisioner checked volume is still attached, and should not issue delete (verified log about this). What caused this happen?

@jingxu97
Copy link

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Oct 13, 2021
@k8s-ci-robot
Copy link
Contributor

@jingxu97: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. v2.0.1 Candidate for v2.0.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants