EBS volumes cannot reattach to PetSet after unexpected detachment #37662

sam-myers · 2016-11-29T23:45:33Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

Yes

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

I am aware of similar
issue #29166 which was fixed in #36616 and v1.4.6. However, I can
still reproduce as of v1.4.6.

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Bug Report

Kubernetes version (use kubectl version):

v1.4.6

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): CoreOS
Kernel (e.g. uname -a): 4.7.3-coreos-r2
Install tools: kube-aws v0.9.1

What happened:

Periodically, petsets will drop below the number of desired replicas
and be unable to restore themselves.

The petset shows the following error:

Unable to mount volumes for pod "petset-min-repro-0_default(xxx...)": timeout expired waiting for volumes to attach/mount for pod "petset-min-repro-0"/"default". list of unattached/unmounted volumes=[storage]
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "petset-min-repro-0"/"default". list of unattached/unmounted volumes=[storage]

What you expected to happen:

I expect EBS volumes to reattach to the correct pod automatically
following node failure.

How to reproduce it (as minimally and precisely as possible):

Apply the below YAML to bring up the test PetSet

# Define storage class first so it can be used later
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: ebs-encrypted-storage

# Launch in AWS
provisioner: kubernetes.io/aws-ebs

# EBS-specific settings
# http://kubernetes.io/docs/user-guide/persistent-volumes/#aws
parameters:
  type: io1
  encrypted: "true"
  zone: us-west-1b
  iopsPerGB: "10"


---

# PetSet boilerplate
apiVersion: apps/v1alpha1
kind: PetSet
metadata:
  name: petset-min-repro
  labels:
    component: test
    role: reprpoduce

spec:
  serviceName: petset-min-repro
  replicas: 2

  template:
    metadata:
      labels:
        component: test
        role: reprpoduce
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"

    spec:

      # One container with an image that does nothing
      containers:
      - name: es-data
        image: alpine:latest
        command:
        - tail
        - -f
        - /dev/null

        # Attach persistent storage
        volumeMounts:
        - name: storage
          mountPath: /data

  volumeClaimTemplates:
  - metadata:
      name: storage
      annotations:
        # The storage should use the below defined storage class
        # Use both alpha and beta annotations for compatibility
        # http://blog.kubernetes.io/2016/10/dynamic-provisioning-and-storage-in-kubernetes.html
        volume.alpha.kubernetes.io/storage-class: ebs-encrypted-storage
        volume.beta.kubernetes.io/storage-class: ebs-encrypted-storage

    spec:
      # Volume should only be mountable to one pod at a time
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          # Smallest allowed io1 volume size
          storage: 4Gi

Identify the EBS volume bound to one of the pods

PVC=$(kubectl describe pvc storage-petset-min-repro-0 | grep Volume | awk '{ print $2 }')
VOLUME_ID=$(kubectl describe pv $PVC | grep VolumeID | awk -F '/' '{ print $NF }')

Detach the EBS volume from the node. One may reasonably ask why this would
happen at all, but this is the most reliable way I've found
to reproduce. It has also occurred randomly.

aws ec2 detach-volume --volume-id=$VOLUME_ID

Observe that the pod does not successfully restart and
remains stuck in state ContainerCreating.

kubectl get pod petset-min-repro-0

Anything else do we need to know:

I have been able to so far work around this issue by terminating the node
the pod is attempting to attach onto.

The text was updated successfully, but these errors were encountered:

patzeltjonas · 2016-11-30T12:25:04Z

I can confirm this issue. As far as I have read all the EBS volume issues it should be fixed with open PR #37302

sam-myers · 2016-11-30T18:34:20Z

@patzeltjonas Thanks, that is excellent news!

jingxu97 · 2016-11-30T22:25:07Z

The fix #36840 is merged in master. It should be back ported to release 1.4 soon. Please let me know if you have any issue after upgrading. Thanks!

sam-myers · 2016-12-05T21:02:38Z

I see via #37867 has been placed in release-1.4 branch, any idea on where I can find when the next 1.4 release is planned?

patzeltjonas · 2016-12-06T06:07:20Z

I've build a quick-release of the branch release-1.4 yesterday containing the bugfix. I set up a test cluster but after 6 hours some of the petset volumes got stuck again.. There are other issues for the 1.5-beta.2 version which contains the bugfix, still having volume issues #37854 #37844

sam-myers · 2016-12-13T21:25:22Z

Updated to v1.5.0 (simultaneously upgraded from PetSet to PersistentSet). I have experimented with some automated tests that bring these pods up and down rapidly in a very similar set of circumstances that would quickly break v1.4.6. I have yet to see this issue since the update. It certainly appears to be resolved!

As for the linked issues, I have not seen #37844. I can confirm that I do occasionally see the VolumeInUse issue from #37854, but it is a much lower severity for us.

jingxu97 · 2016-12-14T19:45:14Z

@demotivated, thank you for your update. You mentioned you oaccasionally see VolumeInUse issue, could you please let me know more details about it or share some log when it happened? Thanks a lot!

sam-myers · 2016-12-20T23:29:30Z

@jingxu97 I have not seen the issue in several days and unfortunately have no logs to share. The sequence typically looks like this:

Pod running on Node 1
Terminate Node 1
Pod attempts to run on Node 2
Pod fails because volume is attached to Node 1
Automated retries...
Pod successfully runs on Node 2

jingxu97 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Nov 30, 2016

jingxu97 self-assigned this Nov 30, 2016

sam-myers closed this as completed Dec 13, 2016

jingxu97 mentioned this issue Dec 14, 2016

Volume stuck in attaching state when using multiple PersistentVolumeClaim #36450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EBS volumes cannot reattach to PetSet after unexpected detachment #37662

EBS volumes cannot reattach to PetSet after unexpected detachment #37662

sam-myers commented Nov 29, 2016 •

edited

Loading

patzeltjonas commented Nov 30, 2016

sam-myers commented Nov 30, 2016

jingxu97 commented Nov 30, 2016

sam-myers commented Dec 5, 2016

patzeltjonas commented Dec 6, 2016 •

edited

Loading

sam-myers commented Dec 13, 2016

jingxu97 commented Dec 14, 2016

sam-myers commented Dec 20, 2016

EBS volumes cannot reattach to PetSet after unexpected detachment #37662

EBS volumes cannot reattach to PetSet after unexpected detachment #37662

Comments

sam-myers commented Nov 29, 2016 • edited Loading

patzeltjonas commented Nov 30, 2016

sam-myers commented Nov 30, 2016

jingxu97 commented Nov 30, 2016

sam-myers commented Dec 5, 2016

patzeltjonas commented Dec 6, 2016 • edited Loading

sam-myers commented Dec 13, 2016

jingxu97 commented Dec 14, 2016

sam-myers commented Dec 20, 2016

sam-myers commented Nov 29, 2016 •

edited

Loading

patzeltjonas commented Dec 6, 2016 •

edited

Loading