Attaching and deleting volumes seems fuzzy #500

dgabrysch · 2018-07-20T06:26:48Z

/kind bug

What happened:

Using vsphere as cloud provider, we created a storageclass and use PersistenVolumeClaims.

VMDKs get created and attached to machines - but when a pod dies, it takes like 5-15 minutes to attach the disk to another VM. The k8s API - or the volume plugin tries to attach a disk when it is still attached to another machine, same for deletion: The volume plugin tries to delete a disk subjectively a million times, this causes error showing in the VSphere console.

Is this normal?

What you expected to happen:

No errors in VSphere

How to reproduce it (as minimally and precisely as possible):

Simply kill a pod using a volume or delete a deployment

Anything else we need to know?:

Environment:

Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:14:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
VSphere
OS:
NAME="Red Hat Enterprise Linux Server"
VERSION="7.4 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.4"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.4 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.4:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.4"

Kernel:
Linux lx-k8s-x01 3.10.0-693.11.6.el7.x86_64 WIP: Address lack of support within Kubernetes to consume vSphere managed storage #1 SMP Thu Dec 28 14:23:39 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
kubeadm

frapposelli · 2018-07-31T13:47:27Z

/sig vmware
/assign @divyenpatel
PTAL

divyenpatel · 2018-08-02T14:35:21Z

@dgabrysch

When Pod is deleted, we should not see disk deletion attempts in VC tasks. Disk should be detached and re-attached to another node, if new pod is provisioned on different node vm.

If pod is in the running state, and we delete the PVC, then this is expected. Kubernetes will keep re-trying deleting PV, but since PV is already attached to the node vm and is in use, disk can not be deleted.

How is the pod provisioned?
Deployment, StatefulSet, DaemonSet or individual pod?

If you can shared yamls, and sequence of steps we can try to reproduce it here to help debug this issue.

dgabrysch · 2018-08-02T14:52:15Z

Hi @divyenpatel,

of course I can :)

Btw - the yamls are working, had to remove sensitive data a.s.o.

So, this is my example pod:

kind: Deployment
apiVersion: apps/v1beta2
metadata:
labels:
k8s-app: nfs-server-for-testpods
name: nfs-server-for-testpods
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: nfs-server-for-testpods
template:
metadata:
labels:
k8s-app: nfs-server-for-testpods
spec:
containers:
- name: nfs-server-for-testpods
image: /repo:name:v.0.8
command: [/nfs-server.sh]
args: ['/exports/nfs-export']
securityContext:
capabilities:
add:
- SYS_ADMIN
ports:
- containerPort: 2049
protocol: TCP

    volumeMounts:
    - name: nfs-server-volumemount
      mountPath: "/exports/nfs-export"
  volumes:
  - name: nfs-server-volumemount
    persistentVolumeClaim:
      claimName: nfs-server-test-for-testpods

This is my storageclass:
Name: vmdk-on-vmfs
IsDefaultClass: No
Annotations:
Provisioner: kubernetes.io/vsphere-volume
Parameters: datastore=,diskformat=zeroedthick,fstype=ext3
ReclaimPolicy: Retain
Events:

This is the PersistenVolumeClaim:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nfs-server-test-for-testpods
namespace: default
annotations:
volume.beta.kubernetes.io/storage-class: vmdk-on-vmfs
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi

This creates this pvc:
nfs-server-test-for-testpods Bound pvc-75103492-9662-11e8-8dc8-005056a561b8 3Gi RWO vmdk-on-vmfs 19s

Here is a view of the vsphere logs:

Here is the pv:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-75103492-9662-11e8-8dc8-005056a561b8 3Gi RWO Retain Bound default/nfs-server-test-for-testpods vmdk-on-vmfs 3m

So far, so fine.

Now I delete the pod with "kubectl delete pod", and get this:

When I do a "describe pod":
Events:
Type Reason Age From Message

Normal Scheduled 1m default-scheduler Successfully assigned nfs-server-for-testpods-57d466bbbc-xqp7m to "the other machines host name"
Normal SuccessfulMountVolume 1m kubelet, lx-k8s-x03 MountVolume.SetUp succeeded for volume "default-token-p5wc4"
Warning FailedMount 1m attachdetach-controller AttachVolume.Attach failed for volume "pvc-75103492-9662-11e8-8dc8-005056a561b8" : Failed to add disk scsi0:2.

And then we wait...like forever :)

Kind regards,

David

divyenpatel · 2018-08-03T16:46:48Z

Is this disk already attached to some other VM, which is powered on?

Based on the error message: Failed to add disk scsiX:Y. Looks like error is - https://kb.vmware.com/s/article/2001217

dgabrysch · 2018-08-03T21:46:55Z

@divyenpatel - sure, but isn’t that the point? I mean a pod breaks, gets rescheduled, then the disk should be reattached, the bug you mentioned in the KB is happening, but shouldn’t it work with the VSphere storage provider code like this?

We checked this with a colleague, actually it looks like the k8s api is making calls to vsphere to reattach a disk, but the disk is still in use because it hasn’t been detached correctly. This is what I mean by fuzzy :)

nikopen · 2018-08-21T15:40:26Z

@frapposelli @divyenpatel we're also hitting this in prod.

pod X with attached volume X lives in node X
pod X dies and is re-scheduled to another node Y
volume X is stuck in node X and cannot be re-scheduled to node Y for 5-15 minutes

What should happen is:

Kubernetes (vsphere cloud controller?) detecting the re-scheduling of a pod with a volume that needs to go to another node
speaking to VSphere to release the volume from that node on a timely manner
attach it to the node where the pod is re-scheduled

Is this easily fixable / could you show us the relevant code?

dgabrysch · 2018-08-21T19:39:42Z

I don’t know if I mentioned that, but we also wait like 10-15 minutes until a pod finally gets its disk

divyenpatel · 2018-08-23T00:04:37Z

@nikopen @dgabrysch Here is my observation.

Created Storage Class and Pod and PVC using following YAMLs

Tasks observed on vcenter.

We can see Disk got created and pod scheduled on kubernetes-node3 and disk is attached to the node VM.

$ kubectl get deployments
NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
wordpress-mysql   1         1         1            1           8m
$ kubectl get pods
NAME                               READY     STATUS    RESTARTS   AGE
wordpress-mysql-556d6bd778-xktqp   1/1       Running   0          8m

After Pod is up and running. I deleted the Pod.

$ kubectl delete pod wordpress-mysql-556d6bd778-xktqp
pod "wordpress-mysql-556d6bd778-xktqp" deleted

New Pod deployment is initiated and this time pod got provisioned on kubernetes-node1

$ kubectl get pods
NAME                               READY     STATUS              RESTARTS   AGE
wordpress-mysql-556d6bd778-62cnv   0/1       ContainerCreating   0          6s

I see couple of failed attempt to attach volume to kubernetes-node1 VM.

This is because disk is already attached to node VM - kubernetes-node3 and this VM is powered on. As soon as disk is detached from kubernetes-node3, we can see disk is successfully attached to the kubernetes-node1 VM.

Events on the new pod wordpress-mysql-556d6bd778-62cnv .

Events:
  Type     Reason                 Age              From                       Message
  ----     ------                 ----             ----                       -------
  Normal   Scheduled              1m               default-scheduler          Successfully assigned wordpress-mysql-556d6bd778-62cnv to kubernetes-node1
  Normal   SuccessfulMountVolume  1m               kubelet, kubernetes-node1  MountVolume.SetUp succeeded for volume "default-token-2x9xl"
  Warning  FailedMount            1m (x2 over 1m)  attachdetach-controller    AttachVolume.Attach failed for volume "pvc-d81997fe-a664-11e8-8448-00505691bc49" : Failed to add disk 'scsi1:0'.
  Normal   SuccessfulMountVolume  1m               kubelet, kubernetes-node1  MountVolume.SetUp succeeded for volume "pvc-d81997fe-a664-11e8-8448-00505691bc49"
  Normal   Pulling                1m               kubelet, kubernetes-node1  pulling image "mysql:5.6"
  Normal   Pulled                 1m               kubelet, kubernetes-node1  Successfully pulled image "mysql:5.6"
  Normal   Created                1m               kubelet, kubernetes-node1  Created container
  Normal   Started                1m               kubelet, kubernetes-node1  Started container

I am observing few seconds of delay and not 10 to 15 minutes.
I have tested above scenario with v1.9.11-beta.0.13+089473511b6775 version.

nikopen · 2018-08-23T09:10:41Z

Thank you for the detailed repro-attempt @divyenpatel !

Would you say that the delay we encounter is because of the initial VM being generally unresponsive and unable to release the VMDK because of that? And that the cloud controller code already handles volume movements like other cloud controllers?

@dgabrysch can you also attempt to repro the issue?

nikopen · 2018-08-23T09:45:31Z

@divyenpatel @dgabrysch What I tried:

any normal pod with a volume attached:

cordon the node which the pod lives in
do a sudo docker kill of the container itself within the node
at the same time, do a kubectl delete po/[pod] --force --grace-period=0

This should mirror a hard eviction event, by force killing the container and rescheduling it to another node.

The events from kube-controller-manager about 20 seconds after the eviction:

0823 09:38:34.870346       6 reconciler.go:231] attacherDetacher.DetachVolume started for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"
I0823 09:38:34.875144       6 operation_generator.go:1165] Verified volume is safe to detach for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"
I0823 09:38:35.228887       6 operation_generator.go:387] DetachVolume.Detach succeeded for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"

I0823 09:38:54.228531       6 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"
I0823 09:38:55.491572       6 operation_generator.go:309] AttachVolume.Attach succeeded for volume "pvc-80ab51f0-3354-11e8-932b-005056b771b4"

It essentially took around 1 minute to fully reschedule, with the initial node being responsive (just cordoned).

nikopen · 2018-08-23T11:32:14Z

However, the bug is the following:

cordon the node which the pod is in
kubectl delete po/[pod] --force --grace-period=0
the pod is immediately rescheduled to a new node. Grab the new node from a kubectl describe [pod] and attempt to Ping it or SSH into it.
you can see that pings/ssh fail to reach the new node. kubectl get node shows it as 'NotReady'.

That node will 'unfreeze' only when the volume becomes available and is attached to the new node. The rescheduled pod might get stuck in a 'scheduling->eviction->scheduling' loop as the new node is 'notready'/frozen. It generally depends on how much load the cluster has, but even in a dormant situation it gives at least 1 minute of freezing.

This is the bug we're hitting. Can you confirm it @divyenpatel ?

bradbeam · 2018-08-23T15:04:20Z

Is #502 related/dupe?

nikopen · 2018-08-23T15:54:30Z

seems like it :)

nikopen · 2018-08-30T10:58:38Z

Fixed with kubernetes#67825 , this can be closed

redbaron · 2018-10-09T11:46:50Z

@divyenpatel , #500 (comment) in you showed that vmware quickly gives up reattaching disk, whole operation is just 1 seconds, but in our (and others) case operation was taking 25 seconds before failing, during that time original VM is frozen, in some cases it is long enough for kernel start taking CPU offline, which triggers VM restart by VSphere

Although main bug is found and fixed, there is clearly a difference in the VSphere setup you have and the rest of us, something which probably let the bug slip in the first place. Do you have any information, why in your case attach operation fails fast, without causing collateral damage?

frapposelli added the customer label Jul 31, 2018

frapposelli assigned divyenpatel Jul 31, 2018

nikopen mentioned this issue Aug 24, 2018

Fix VMWare VM freezing bug by reverting #51066 kubernetes/kubernetes#67825

Merged

ashahi1 closed this as completed in e3969fe Aug 30, 2018

nikopen mentioned this issue Aug 31, 2018

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 kubernetes/kubernetes#68056

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attaching and deleting volumes seems fuzzy #500

Attaching and deleting volumes seems fuzzy #500

dgabrysch commented Jul 20, 2018

frapposelli commented Jul 31, 2018

divyenpatel commented Aug 2, 2018

dgabrysch commented Aug 2, 2018 •

edited

Loading

divyenpatel commented Aug 3, 2018

dgabrysch commented Aug 3, 2018

nikopen commented Aug 21, 2018

dgabrysch commented Aug 21, 2018

divyenpatel commented Aug 23, 2018

nikopen commented Aug 23, 2018 •

edited

Loading

nikopen commented Aug 23, 2018 •

edited

Loading

nikopen commented Aug 23, 2018

bradbeam commented Aug 23, 2018

nikopen commented Aug 23, 2018

nikopen commented Aug 30, 2018

redbaron commented Oct 9, 2018

Attaching and deleting volumes seems fuzzy #500

Attaching and deleting volumes seems fuzzy #500

Comments

dgabrysch commented Jul 20, 2018

frapposelli commented Jul 31, 2018

divyenpatel commented Aug 2, 2018

dgabrysch commented Aug 2, 2018 • edited Loading

So, this is my example pod:

This is the PersistenVolumeClaim:

divyenpatel commented Aug 3, 2018

dgabrysch commented Aug 3, 2018

nikopen commented Aug 21, 2018

dgabrysch commented Aug 21, 2018

divyenpatel commented Aug 23, 2018

nikopen commented Aug 23, 2018 • edited Loading

nikopen commented Aug 23, 2018 • edited Loading

nikopen commented Aug 23, 2018

bradbeam commented Aug 23, 2018

nikopen commented Aug 23, 2018

nikopen commented Aug 30, 2018

redbaron commented Oct 9, 2018

dgabrysch commented Aug 2, 2018 •

edited

Loading

nikopen commented Aug 23, 2018 •

edited

Loading

nikopen commented Aug 23, 2018 •

edited

Loading