Skip to content
This repository has been archived by the owner on Sep 7, 2022. It is now read-only.

Attaching and deleting volumes seems fuzzy #500

Closed
dgabrysch opened this issue Jul 20, 2018 · 15 comments · Fixed by kubernetes/kubernetes#67825
Closed

Attaching and deleting volumes seems fuzzy #500

dgabrysch opened this issue Jul 20, 2018 · 15 comments · Fixed by kubernetes/kubernetes#67825
Assignees
Labels

Comments

@dgabrysch
Copy link

/kind bug

What happened:

Using vsphere as cloud provider, we created a storageclass and use PersistenVolumeClaims.

VMDKs get created and attached to machines - but when a pod dies, it takes like 5-15 minutes to attach the disk to another VM. The k8s API - or the volume plugin tries to attach a disk when it is still attached to another machine, same for deletion: The volume plugin tries to delete a disk subjectively a million times, this causes error showing in the VSphere console.

Is this normal?

image

What you expected to happen:

No errors in VSphere

How to reproduce it (as minimally and precisely as possible):

Simply kill a pod using a volume or delete a deployment

Anything else we need to know?:

Environment:

  • Kubernetes version:
    Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:14:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:
    VSphere

  • OS:
    NAME="Red Hat Enterprise Linux Server"
    VERSION="7.4 (Maipo)"
    ID="rhel"
    ID_LIKE="fedora"
    VARIANT="Server"
    VARIANT_ID="server"
    VERSION_ID="7.4"
    PRETTY_NAME="Red Hat Enterprise Linux Server 7.4 (Maipo)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:redhat:enterprise_linux:7.4:GA:server"
    HOME_URL="https://www.redhat.com/"
    BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.4"

@frapposelli
Copy link

/sig vmware
/assign @divyenpatel
PTAL

@divyenpatel
Copy link

@dgabrysch

When Pod is deleted, we should not see disk deletion attempts in VC tasks. Disk should be detached and re-attached to another node, if new pod is provisioned on different node vm.

If pod is in the running state, and we delete the PVC, then this is expected. Kubernetes will keep re-trying deleting PV, but since PV is already attached to the node vm and is in use, disk can not be deleted.

How is the pod provisioned?
Deployment, StatefulSet, DaemonSet or individual pod?

If you can shared yamls, and sequence of steps we can try to reproduce it here to help debug this issue.

@dgabrysch
Copy link
Author

dgabrysch commented Aug 2, 2018

Hi @divyenpatel,

of course I can :)

Btw - the yamls are working, had to remove sensitive data a.s.o.

So, this is my example pod:

kind: Deployment
apiVersion: apps/v1beta2
metadata:
labels:
k8s-app: nfs-server-for-testpods
name: nfs-server-for-testpods
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: nfs-server-for-testpods
template:
metadata:
labels:
k8s-app: nfs-server-for-testpods
spec:
containers:
- name: nfs-server-for-testpods
image: /repo:name:v.0.8
command: [/nfs-server.sh]
args: ['/exports/nfs-export']
securityContext:
capabilities:
add:
- SYS_ADMIN
ports:
- containerPort: 2049
protocol: TCP

    volumeMounts:
    - name: nfs-server-volumemount
      mountPath: "/exports/nfs-export"
  volumes:
  - name: nfs-server-volumemount
    persistentVolumeClaim:
      claimName: nfs-server-test-for-testpods

This is my storageclass:
Name: vmdk-on-vmfs
IsDefaultClass: No
Annotations:
Provisioner: kubernetes.io/vsphere-volume
Parameters: datastore=,diskformat=zeroedthick,fstype=ext3
ReclaimPolicy: Retain
Events:

This is the PersistenVolumeClaim:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nfs-server-test-for-testpods
namespace: default
annotations:
volume.beta.kubernetes.io/storage-class: vmdk-on-vmfs
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi

This creates this pvc:
nfs-server-test-for-testpods Bound pvc-75103492-9662-11e8-8dc8-005056a561b8 3Gi RWO vmdk-on-vmfs 19s

Here is a view of the vsphere logs:
image

Here is the pv:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-75103492-9662-11e8-8dc8-005056a561b8 3Gi RWO Retain Bound default/nfs-server-test-for-testpods vmdk-on-vmfs 3m

So far, so fine.

Now I delete the pod with "kubectl delete pod", and get this:
image

When I do a "describe pod":
Events:
Type Reason Age From Message


Normal Scheduled 1m default-scheduler Successfully assigned nfs-server-for-testpods-57d466bbbc-xqp7m to "the other machines host name"
Normal SuccessfulMountVolume 1m kubelet, lx-k8s-x03 MountVolume.SetUp succeeded for volume "default-token-p5wc4"
Warning FailedMount 1m attachdetach-controller AttachVolume.Attach failed for volume "pvc-75103492-9662-11e8-8dc8-005056a561b8" : Failed to add disk scsi0:2.

And then we wait...like forever :)

Kind regards,

David

@divyenpatel
Copy link

Is this disk already attached to some other VM, which is powered on?

Based on the error message: Failed to add disk scsiX:Y. Looks like error is - https://kb.vmware.com/s/article/2001217

@dgabrysch
Copy link
Author

@divyenpatel - sure, but isn’t that the point? I mean a pod breaks, gets rescheduled, then the disk should be reattached, the bug you mentioned in the KB is happening, but shouldn’t it work with the VSphere storage provider code like this?

We checked this with a colleague, actually it looks like the k8s api is making calls to vsphere to reattach a disk, but the disk is still in use because it hasn’t been detached correctly. This is what I mean by fuzzy :)

@nikopen
Copy link

nikopen commented Aug 21, 2018

@frapposelli @divyenpatel we're also hitting this in prod.

  1. pod X with attached volume X lives in node X
  2. pod X dies and is re-scheduled to another node Y
  3. volume X is stuck in node X and cannot be re-scheduled to node Y for 5-15 minutes

What should happen is:

  1. Kubernetes (vsphere cloud controller?) detecting the re-scheduling of a pod with a volume that needs to go to another node
  2. speaking to VSphere to release the volume from that node on a timely manner
  3. attach it to the node where the pod is re-scheduled

Is this easily fixable / could you show us the relevant code?

@dgabrysch
Copy link
Author

I don’t know if I mentioned that, but we also wait like 10-15 minutes until a pod finally gets its disk

@divyenpatel
Copy link

@nikopen @dgabrysch Here is my observation.

Created Storage Class and Pod and PVC using following YAMLs

Tasks observed on vcenter.

image

We can see Disk got created and pod scheduled on kubernetes-node3 and disk is attached to the node VM.

$ kubectl get deployments
NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
wordpress-mysql   1         1         1            1           8m
$ kubectl get pods
NAME                               READY     STATUS    RESTARTS   AGE
wordpress-mysql-556d6bd778-xktqp   1/1       Running   0          8m

After Pod is up and running. I deleted the Pod.

$ kubectl delete pod wordpress-mysql-556d6bd778-xktqp
pod "wordpress-mysql-556d6bd778-xktqp" deleted

New Pod deployment is initiated and this time pod got provisioned on kubernetes-node1

$ kubectl get pods
NAME                               READY     STATUS              RESTARTS   AGE
wordpress-mysql-556d6bd778-62cnv   0/1       ContainerCreating   0          6s

I see couple of failed attempt to attach volume to kubernetes-node1 VM.

image

This is because disk is already attached to node VM - kubernetes-node3 and this VM is powered on. As soon as disk is detached from kubernetes-node3, we can see disk is successfully attached to the kubernetes-node1 VM.

Events on the new pod wordpress-mysql-556d6bd778-62cnv .

Events:
  Type     Reason                 Age              From                       Message
  ----     ------                 ----             ----                       -------
  Normal   Scheduled              1m               default-scheduler          Successfully assigned wordpress-mysql-556d6bd778-62cnv to kubernetes-node1
  Normal   SuccessfulMountVolume  1m               kubelet, kubernetes-node1  MountVolume.SetUp succeeded for volume "default-token-2x9xl"
  Warning  FailedMount            1m (x2 over 1m)  attachdetach-controller    AttachVolume.Attach failed for volume "pvc-d81997fe-a664-11e8-8448-00505691bc49" : Failed to add disk 'scsi1:0'.
  Normal   SuccessfulMountVolume  1m               kubelet, kubernetes-node1  MountVolume.SetUp succeeded for volume "pvc-d81997fe-a664-11e8-8448-00505691bc49"
  Normal   Pulling                1m               kubelet, kubernetes-node1  pulling image "mysql:5.6"
  Normal   Pulled                 1m               kubelet, kubernetes-node1  Successfully pulled image "mysql:5.6"
  Normal   Created                1m               kubelet, kubernetes-node1  Created container
  Normal   Started                1m               kubelet, kubernetes-node1  Started container

I am observing few seconds of delay and not 10 to 15 minutes.
I have tested above scenario with v1.9.11-beta.0.13+089473511b6775 version.

@nikopen
Copy link

nikopen commented Aug 23, 2018

Thank you for the detailed repro-attempt @divyenpatel !

Would you say that the delay we encounter is because of the initial VM being generally unresponsive and unable to release the VMDK because of that? And that the cloud controller code already handles volume movements like other cloud controllers?

@dgabrysch can you also attempt to repro the issue?

@nikopen
Copy link

nikopen commented Aug 23, 2018

@divyenpatel @dgabrysch What I tried:

any normal pod with a volume attached:

  1. cordon the node which the pod lives in
  2. do a sudo docker kill of the container itself within the node
  3. at the same time, do a kubectl delete po/[pod] --force --grace-period=0

This should mirror a hard eviction event, by force killing the container and rescheduling it to another node.

The events from kube-controller-manager about 20 seconds after the eviction:

0823 09:38:34.870346       6 reconciler.go:231] attacherDetacher.DetachVolume started for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"
I0823 09:38:34.875144       6 operation_generator.go:1165] Verified volume is safe to detach for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"
I0823 09:38:35.228887       6 operation_generator.go:387] DetachVolume.Detach succeeded for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"

I0823 09:38:54.228531       6 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"
I0823 09:38:55.491572       6 operation_generator.go:309] AttachVolume.Attach succeeded for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"

It essentially took around 1 minute to fully reschedule, with the initial node being responsive (just cordoned).

@nikopen
Copy link

nikopen commented Aug 23, 2018

However, the bug is the following:

  1. cordon the node which the pod is in
  2. kubectl delete po/[pod] --force --grace-period=0
  3. the pod is immediately rescheduled to a new node. Grab the new node from a kubectl describe [pod] and attempt to Ping it or SSH into it.
  4. you can see that pings/ssh fail to reach the new node. kubectl get node shows it as 'NotReady'.

That node will 'unfreeze' only when the volume becomes available and is attached to the new node. The rescheduled pod might get stuck in a 'scheduling->eviction->scheduling' loop as the new node is 'notready'/frozen. It generally depends on how much load the cluster has, but even in a dormant situation it gives at least 1 minute of freezing.

This is the bug we're hitting. Can you confirm it @divyenpatel ?

@bradbeam
Copy link

Is #502 related/dupe?

@nikopen
Copy link

nikopen commented Aug 23, 2018

seems like it :)

@nikopen
Copy link

nikopen commented Aug 30, 2018

Fixed with kubernetes#67825 , this can be closed

@redbaron
Copy link

redbaron commented Oct 9, 2018

@divyenpatel , #500 (comment) in you showed that vmware quickly gives up reattaching disk, whole operation is just 1 seconds, but in our (and others) case operation was taking 25 seconds before failing, during that time original VM is frozen, in some cases it is long enough for kernel start taking CPU offline, which triggers VM restart by VSphere

Although main bug is found and fixed, there is clearly a difference in the VSphere setup you have and the rest of us, something which probably let the bug slip in the first place. Do you have any information, why in your case attach operation fails fast, without causing collateral damage?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants