Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EBS volumes cannot reattach to PetSet after unexpected detachment #37662

Closed
sam-myers opened this issue Nov 29, 2016 · 8 comments
Closed

EBS volumes cannot reattach to PetSet after unexpected detachment #37662

sam-myers opened this issue Nov 29, 2016 · 8 comments
Assignees
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@sam-myers
Copy link

sam-myers commented Nov 29, 2016

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

Yes

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

I am aware of similar
issue #29166 which was fixed in #36616 and v1.4.6. However, I can
still reproduce as of v1.4.6.


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Bug Report

Kubernetes version (use kubectl version):

v1.4.6

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): CoreOS
  • Kernel (e.g. uname -a): 4.7.3-coreos-r2
  • Install tools: kube-aws v0.9.1

What happened:

Periodically, petsets will drop below the number of desired replicas
and be unable to restore themselves.

The petset shows the following error:

Unable to mount volumes for pod "petset-min-repro-0_default(xxx...)": timeout expired waiting for volumes to attach/mount for pod "petset-min-repro-0"/"default". list of unattached/unmounted volumes=[storage]
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "petset-min-repro-0"/"default". list of unattached/unmounted volumes=[storage]

What you expected to happen:

I expect EBS volumes to reattach to the correct pod automatically
following node failure.

How to reproduce it (as minimally and precisely as possible):

  1. Apply the below YAML to bring up the test PetSet
# Define storage class first so it can be used later
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: ebs-encrypted-storage

# Launch in AWS
provisioner: kubernetes.io/aws-ebs

# EBS-specific settings
# http://kubernetes.io/docs/user-guide/persistent-volumes/#aws
parameters:
  type: io1
  encrypted: "true"
  zone: us-west-1b
  iopsPerGB: "10"


---

# PetSet boilerplate
apiVersion: apps/v1alpha1
kind: PetSet
metadata:
  name: petset-min-repro
  labels:
    component: test
    role: reprpoduce

spec:
  serviceName: petset-min-repro
  replicas: 2

  template:
    metadata:
      labels:
        component: test
        role: reprpoduce
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"

    spec:

      # One container with an image that does nothing
      containers:
      - name: es-data
        image: alpine:latest
        command:
        - tail
        - -f
        - /dev/null

        # Attach persistent storage
        volumeMounts:
        - name: storage
          mountPath: /data

  volumeClaimTemplates:
  - metadata:
      name: storage
      annotations:
        # The storage should use the below defined storage class
        # Use both alpha and beta annotations for compatibility
        # http://blog.kubernetes.io/2016/10/dynamic-provisioning-and-storage-in-kubernetes.html
        volume.alpha.kubernetes.io/storage-class: ebs-encrypted-storage
        volume.beta.kubernetes.io/storage-class: ebs-encrypted-storage

    spec:
      # Volume should only be mountable to one pod at a time
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          # Smallest allowed io1 volume size
          storage: 4Gi

  1. Identify the EBS volume bound to one of the pods
PVC=$(kubectl describe pvc storage-petset-min-repro-0 | grep Volume | awk '{ print $2 }')
VOLUME_ID=$(kubectl describe pv $PVC | grep VolumeID | awk -F '/' '{ print $NF }')
  1. Detach the EBS volume from the node. One may reasonably ask why this would
    happen at all, but this is the most reliable way I've found
    to reproduce. It has also occurred randomly.
aws ec2 detach-volume --volume-id=$VOLUME_ID
  1. Observe that the pod does not successfully restart and
    remains stuck in state ContainerCreating.
kubectl get pod petset-min-repro-0

Anything else do we need to know:

I have been able to so far work around this issue by terminating the node
the pod is attempting to attach onto.

@patzeltjonas
Copy link

I can confirm this issue. As far as I have read all the EBS volume issues it should be fixed with open PR #37302

@sam-myers
Copy link
Author

@patzeltjonas Thanks, that is excellent news!

@jingxu97 jingxu97 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Nov 30, 2016
@jingxu97
Copy link
Contributor

The fix #36840 is merged in master. It should be back ported to release 1.4 soon. Please let me know if you have any issue after upgrading. Thanks!

@jingxu97 jingxu97 self-assigned this Nov 30, 2016
@sam-myers
Copy link
Author

I see via #37867 has been placed in release-1.4 branch, any idea on where I can find when the next 1.4 release is planned?

@patzeltjonas
Copy link

patzeltjonas commented Dec 6, 2016

I've build a quick-release of the branch release-1.4 yesterday containing the bugfix. I set up a test cluster but after 6 hours some of the petset volumes got stuck again.. There are other issues for the 1.5-beta.2 version which contains the bugfix, still having volume issues #37854 #37844

@sam-myers
Copy link
Author

Updated to v1.5.0 (simultaneously upgraded from PetSet to PersistentSet). I have experimented with some automated tests that bring these pods up and down rapidly in a very similar set of circumstances that would quickly break v1.4.6. I have yet to see this issue since the update. It certainly appears to be resolved!

As for the linked issues, I have not seen #37844. I can confirm that I do occasionally see the VolumeInUse issue from #37854, but it is a much lower severity for us.

@jingxu97
Copy link
Contributor

@demotivated, thank you for your update. You mentioned you oaccasionally see VolumeInUse issue, could you please let me know more details about it or share some log when it happened? Thanks a lot!

@sam-myers
Copy link
Author

@jingxu97 I have not seen the issue in several days and unfortunately have no logs to share. The sequence typically looks like this:

  1. Pod running on Node 1
  2. Terminate Node 1
  3. Pod attempts to run on Node 2
  4. Pod fails because volume is attached to Node 1
  5. Automated retries...
  6. Pod successfully runs on Node 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

3 participants