Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup Dependency No CSINode Data #1030

Closed
tsunamishaun opened this issue Aug 20, 2021 · 14 comments
Closed

Startup Dependency No CSINode Data #1030

tsunamishaun opened this issue Aug 20, 2021 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tsunamishaun
Copy link

tsunamishaun commented Aug 20, 2021

/kind bug

What happened?
We recently started using karpenter to provision nodes. Since Karpenter overrides the built in scheduler it is able to frontload and startup pods that were pending schedule first. Other necessary pods like ebs-csi-node and aws-node arrive about a minute later. We believe it is during this gap that the controller tries to read csiNode for topology properties which it doesn't have at that moment in time, being that the startup order is somewhat delayed. I don't believe it retries because the information is eventually there, looking for some validation?

What you expected to happen?
Controller retry to provision volume if csiNode information not available on first attempt.

How to reproduce it (as minimally and precisely as possible)?
Remove csiNode properties that belong to a node and launch a pod that needs a volume.

Anything else we need to know?:
We limit imdsv2 access to hostNetworking only, as far as we can tell the driver uses the kube-api and it appears it is setting the appropriate information on the csiNode CR, just not when the controller expects it.

Environment

  • Kubernetes version (use kubectl version): 1.20
  • Driver version: 1.2.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2021
@vdhanan
Copy link
Contributor

vdhanan commented Aug 30, 2021

/assign

@ellistarn
Copy link

I was able to reproduce this by following

https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
and
https://karpenter.sh/docs/getting-started/

I'm noticing that my node has

annotations:
  csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-09dc0a26f4104c2d1"}'
labels:
  topology.ebs.csi.aws.com/zone: us-west-2b

I'm curious why the EBS CSI driver doesn't rely on the standard labels

topology.kubernetes.io/zone: us-west-2b

and standard instance type

providerID: aws:///us-west-2b/i-09dc0a26f4104c2d1

We should be able to avoid this race condition by using those properties of the node, which are set correctly by the kubelet on startup.

Minimally, it should probably just retry so the pod doesn't get stuck in container creating:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-14T22:47:35Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-14T22:47:35Z"
    message: 'containers with unready status: [app]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-14T22:47:35Z"
    message: 'containers with unready status: [app]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-14T22:46:30Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: centos
    imageID: ""
    lastState: {}
    name: app
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: ContainerCreating

@wongma7
Copy link
Contributor

wongma7 commented Sep 15, 2021

Do you have logs from the controller (its ebs-plugin container or csi-provisioner container)? This bugfix looks kind of suspect but I'm not sure if it would fix this issue, kubernetes-csi/external-provisioner#617, can you try the more recent external-provisioner:v2.2.2 which includes this bugfix and see if it works? (by replacing https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/values.yaml#L16)

The logic for watching csinode objects and retrying is handled by external-provisioner so I strongly suspect the issue is that we are bundling a version of external-provisioner with a bug, if not the one i linked, in it. The controller side of the driver is otherwise ignorant of kubernetes csi api objects like csinode.

I'm curious why the EBS CSI driver doesn't rely on the standard labels

the topology labeling is an open issue. At the moment it's simplest for all CSI drivers to use their own labels which predate the v1 standard labels. Otherwise things get complicated in situations where CSI migration is toggled on and off (translation lib expects to find CSI label A on CSI volumes and standard label B on in-tree volumes) or in driver upgrade downgrade scenarios (PV provisioned with version A has label A and then can't get scheduled anywhere after I upgrade to driver where nodes have standard label B). Ref: #962.

providerID: aws:///us-west-2b/i-09dc0a26f4104c2d1

Our driver uses providerID to populate the csinode nodeid in cases where instance metadata is unavailable. The csi.volume.kubernetes.io/nodeid annotation is deprecated and made redundant by the csinode nodeid field so it can be ignored: https://github.com/kubernetes/community/pull/2034/files#diff-eab491c51f66885ca6fa1f76254d53d01c39e09dca6939ba890cdfdeaac21fe0R106.

@ellistarn
Copy link

ellistarn commented Sep 15, 2021

Thanks @wongma7. I've made some progress with v2.2.2, though I'm not sure if it resolves the issue in the original bug report (I haven't tried 1.2.0). I don't have particularly interesting logic, but I may no be looking at the right component -- not deep here.

There are two cases:

  1. PV has already had a zone allocated (this works w/ Karpenter)
  2. PV does not have a zone allocated

Case 1:

This appears to be working as long as the pod is scheduled in the same zone as the volume.

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Warning  FailedScheduling        112s (x2 over 112s)  default-scheduler        0/1 nodes are available: 1 node(s) didn't match Pod's node affinity.
  Warning  FailedScheduling        107s                 default-scheduler        0/2 nodes are available: 1 node(s) didn't match Pod's node affinity, 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate.
  Warning  NetworkNotReady         50s (x7 over 62s)    kubelet                  network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  Warning  FailedAttachVolume      43s (x6 over 59s)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-6b0b786a-9603-4ac9-bb6d-045ab04cae8e" : CSINode ip-192-168-14-212.us-west-2.compute.internal does not contain driver ebs.csi.aws.com
  Normal   SuccessfulAttachVolume  27s                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-6b0b786a-9603-4ac9-bb6d-045ab04cae8e"

Success!

NAME   READY   STATUS    RESTARTS   AGE
app    1/1     Running   0          2m18s

Case 2

No PV is created, and the PVC is stuck in pending:

➜  dynamic-provisioning git:(master) ✗ k describe persistentvolumeclaims ebs-claim
Name:          ebs-claim
Namespace:     default
StorageClass:  ebs-sc
Status:        Pending
Volume:
Labels:        <none>
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       app
Events:
  Type     Reason                Age                   From                         Message
  ----     ------                ----                  ----                         -------
  Warning  ProvisioningFailed    5m12s                 persistentvolume-controller  storageclass.storage.k8s.io "ebs-sc" not found
  Normal   WaitForPodScheduled   5m10s                 persistentvolume-controller  waiting for pod app to be scheduled
  Normal   WaitForFirstConsumer  10s (x20 over 4m55s)  persistentvolume-controller  waiting for first consumer to be created before binding

The pod is stuck pending

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  7m13s (x2 over 7m13s)  default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  7m10s                  default-scheduler  0/1 nodes are available: 1 node(s) didn't match Pod's node affinity.
  Warning  FailedScheduling  7m6s                   default-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity, 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate.
  Warning  NetworkNotReady   5m55s (x5 over 6m3s)   kubelet            network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  Warning  FailedMount       108s (x17 over 5m53s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[persistent-storage default-token-pbmk9]: error processing PVC default/ebs-claim: PVC is not bound
  Warning  FailedMount       55s (x6 over 4m34s)    kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-pbmk9 persistent-storage]: error processing PVC default/ebs-claim: PVC is not bound

@wongma7
Copy link
Contributor

wongma7 commented Nov 5, 2021

How does a PV get allocated a zone at creation time? Is the difference between case 1 and case 2 the WaitForFirstConsumer field of StorageClas?

Does (and should) the scheduler set annSelectedNode on PVCs after their respective pods get scheduled? https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L1150-L1159

This is just from reading code but I think CSI drivers rely on that annotation to be set by scheduler. I don't think sig-storage-lib-external-provisioner has accounted for custom schedulers. Otherwise, the first time the driver (specifically the external-provisioner sidecar running sig-storage-lib-external-provisioner) "sees" the PVC it decides there is nothing to do with it (since it doesn't know in what zone the PV needs to be created). And then the PVC is never updated again so the driver doesn't receive any update events so the PVC sits idle. Logs from the driver won't show any of this.

@wongma7
Copy link
Contributor

wongma7 commented Nov 5, 2021

Formatted for clarity: the expected flow for A) default scheduler + B) in-tree PV controller is + C) external-provisioner/sig-storage-lib-external-provisioner is:

  1. scheduler picks node
  2. scheduler updates PVC with annSelectedNode
  3. PV controller sees the previous update and updates PVC with annStorageProvisioner
  4. external-provisioner/sig-storage-lib-external-provisioner sees the previous update and provisions a PV for PVC

If the scheduler doesn't set annSelectedNode and the StorageClass is WaitForFirstConsumer then the external-provisioner will never provision a volume.

(this is from reading the code so might be wrong, this contract between A/B/C seems to be undocumented, it predates KEPs I think. )

@vdhanan vdhanan removed their assignment Dec 16, 2021
@ellistarn
Copy link

Hey @wongma7, is the above true re annSelectedNode for when a node is deleted and the PVC is bound to a deleted node? Is the annotation simply updated to reflect the new node?

@wongma7
Copy link
Contributor

wongma7 commented Dec 17, 2021

@ellistarn
Copy link

ellistarn commented Dec 17, 2021

I have an implementation here that I think is working. Any chance you can review?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2022
@ellistarn
Copy link

This is resolved.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 17, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants