Startup Dependency No CSINode Data #1030

tsunamishaun · 2021-08-20T19:49:59Z

/kind bug

What happened?
We recently started using karpenter to provision nodes. Since Karpenter overrides the built in scheduler it is able to frontload and startup pods that were pending schedule first. Other necessary pods like ebs-csi-node and aws-node arrive about a minute later. We believe it is during this gap that the controller tries to read csiNode for topology properties which it doesn't have at that moment in time, being that the startup order is somewhat delayed. I don't believe it retries because the information is eventually there, looking for some validation?

What you expected to happen?
Controller retry to provision volume if csiNode information not available on first attempt.

How to reproduce it (as minimally and precisely as possible)?
Remove csiNode properties that belong to a node and launch a pod that needs a volume.

Anything else we need to know?:
We limit imdsv2 access to hostNetworking only, as far as we can tell the driver uses the kube-api and it appears it is setting the appropriate information on the csiNode CR, just not when the controller expects it.

Environment

Kubernetes version (use kubectl version): 1.20
Driver version: 1.2.0

The text was updated successfully, but these errors were encountered:

vdhanan · 2021-08-30T18:05:38Z

/assign

ellistarn · 2021-09-15T00:10:23Z

I was able to reproduce this by following

https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
and
https://karpenter.sh/docs/getting-started/

I'm noticing that my node has

annotations:
  csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-09dc0a26f4104c2d1"}'
labels:
  topology.ebs.csi.aws.com/zone: us-west-2b

I'm curious why the EBS CSI driver doesn't rely on the standard labels

topology.kubernetes.io/zone: us-west-2b

and standard instance type

providerID: aws:///us-west-2b/i-09dc0a26f4104c2d1

We should be able to avoid this race condition by using those properties of the node, which are set correctly by the kubelet on startup.

Minimally, it should probably just retry so the pod doesn't get stuck in container creating:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-14T22:47:35Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-14T22:47:35Z"
    message: 'containers with unready status: [app]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-14T22:47:35Z"
    message: 'containers with unready status: [app]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-14T22:46:30Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: centos
    imageID: ""
    lastState: {}
    name: app
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: ContainerCreating

wongma7 · 2021-09-15T00:54:25Z

Do you have logs from the controller (its ebs-plugin container or csi-provisioner container)? This bugfix looks kind of suspect but I'm not sure if it would fix this issue, kubernetes-csi/external-provisioner#617, can you try the more recent external-provisioner:v2.2.2 which includes this bugfix and see if it works? (by replacing https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/values.yaml#L16)

The logic for watching csinode objects and retrying is handled by external-provisioner so I strongly suspect the issue is that we are bundling a version of external-provisioner with a bug, if not the one i linked, in it. The controller side of the driver is otherwise ignorant of kubernetes csi api objects like csinode.

I'm curious why the EBS CSI driver doesn't rely on the standard labels

the topology labeling is an open issue. At the moment it's simplest for all CSI drivers to use their own labels which predate the v1 standard labels. Otherwise things get complicated in situations where CSI migration is toggled on and off (translation lib expects to find CSI label A on CSI volumes and standard label B on in-tree volumes) or in driver upgrade downgrade scenarios (PV provisioned with version A has label A and then can't get scheduled anywhere after I upgrade to driver where nodes have standard label B). Ref: #962.

providerID: aws:///us-west-2b/i-09dc0a26f4104c2d1

Our driver uses providerID to populate the csinode nodeid in cases where instance metadata is unavailable. The csi.volume.kubernetes.io/nodeid annotation is deprecated and made redundant by the csinode nodeid field so it can be ignored: https://github.com/kubernetes/community/pull/2034/files#diff-eab491c51f66885ca6fa1f76254d53d01c39e09dca6939ba890cdfdeaac21fe0R106.

ellistarn · 2021-09-15T22:00:48Z

Thanks @wongma7. I've made some progress with v2.2.2, though I'm not sure if it resolves the issue in the original bug report (I haven't tried 1.2.0). I don't have particularly interesting logic, but I may no be looking at the right component -- not deep here.

There are two cases:

PV has already had a zone allocated (this works w/ Karpenter)
PV does not have a zone allocated

Case 1:

This appears to be working as long as the pod is scheduled in the same zone as the volume.

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Warning  FailedScheduling        112s (x2 over 112s)  default-scheduler        0/1 nodes are available: 1 node(s) didn't match Pod's node affinity.
  Warning  FailedScheduling        107s                 default-scheduler        0/2 nodes are available: 1 node(s) didn't match Pod's node affinity, 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate.
  Warning  NetworkNotReady         50s (x7 over 62s)    kubelet                  network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  Warning  FailedAttachVolume      43s (x6 over 59s)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-6b0b786a-9603-4ac9-bb6d-045ab04cae8e" : CSINode ip-192-168-14-212.us-west-2.compute.internal does not contain driver ebs.csi.aws.com
  Normal   SuccessfulAttachVolume  27s                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-6b0b786a-9603-4ac9-bb6d-045ab04cae8e"

Success!

NAME   READY   STATUS    RESTARTS   AGE
app    1/1     Running   0          2m18s

Case 2

No PV is created, and the PVC is stuck in pending:

➜  dynamic-provisioning git:(master) ✗ k describe persistentvolumeclaims ebs-claim
Name:          ebs-claim
Namespace:     default
StorageClass:  ebs-sc
Status:        Pending
Volume:
Labels:        <none>
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       app
Events:
  Type     Reason                Age                   From                         Message
  ----     ------                ----                  ----                         -------
  Warning  ProvisioningFailed    5m12s                 persistentvolume-controller  storageclass.storage.k8s.io "ebs-sc" not found
  Normal   WaitForPodScheduled   5m10s                 persistentvolume-controller  waiting for pod app to be scheduled
  Normal   WaitForFirstConsumer  10s (x20 over 4m55s)  persistentvolume-controller  waiting for first consumer to be created before binding

The pod is stuck pending

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  7m13s (x2 over 7m13s)  default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  7m10s                  default-scheduler  0/1 nodes are available: 1 node(s) didn't match Pod's node affinity.
  Warning  FailedScheduling  7m6s                   default-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity, 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate.
  Warning  NetworkNotReady   5m55s (x5 over 6m3s)   kubelet            network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  Warning  FailedMount       108s (x17 over 5m53s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[persistent-storage default-token-pbmk9]: error processing PVC default/ebs-claim: PVC is not bound
  Warning  FailedMount       55s (x6 over 4m34s)    kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-pbmk9 persistent-storage]: error processing PVC default/ebs-claim: PVC is not bound

wongma7 · 2021-11-05T21:00:44Z

How does a PV get allocated a zone at creation time? Is the difference between case 1 and case 2 the WaitForFirstConsumer field of StorageClas?

Does (and should) the scheduler set annSelectedNode on PVCs after their respective pods get scheduled? https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L1150-L1159

This is just from reading code but I think CSI drivers rely on that annotation to be set by scheduler. I don't think sig-storage-lib-external-provisioner has accounted for custom schedulers. Otherwise, the first time the driver (specifically the external-provisioner sidecar running sig-storage-lib-external-provisioner) "sees" the PVC it decides there is nothing to do with it (since it doesn't know in what zone the PV needs to be created). And then the PVC is never updated again so the driver doesn't receive any update events so the PVC sits idle. Logs from the driver won't show any of this.

wongma7 · 2021-11-05T21:15:57Z

Formatted for clarity: the expected flow for A) default scheduler + B) in-tree PV controller is + C) external-provisioner/sig-storage-lib-external-provisioner is:

scheduler picks node
scheduler updates PVC with annSelectedNode
PV controller sees the previous update and updates PVC with annStorageProvisioner
external-provisioner/sig-storage-lib-external-provisioner sees the previous update and provisions a PV for PVC

If the scheduler doesn't set annSelectedNode and the StorageClass is WaitForFirstConsumer then the external-provisioner will never provision a volume.

(this is from reading the code so might be wrong, this contract between A/B/C seems to be undocumented, it predates KEPs I think. )

ellistarn · 2021-12-17T21:30:29Z

Hey @wongma7, is the above true re annSelectedNode for when a node is deleted and the PVC is bound to a deleted node? Is the annotation simply updated to reflect the new node?

wongma7 · 2021-12-17T22:05:49Z

Yes, scheduler should reset the annotation after the pod has been rescheduled because the "volume binding plugin" should run again https://github.com/kubernetes/kubernetes/blob/39c76ba2edeadb84a115cc3fbd9204a2177f1c28/pkg/scheduler/framework/plugins/volumebinding/binder.go#L419

REF: https://github.com/kubernetes/design-proposals-archive/blob/main/storage/volume-topology-scheduling.md

ellistarn · 2021-12-17T22:54:06Z

I have an implementation here that I think is working. Any chance you can review?

k8s-triage-robot · 2022-03-17T23:39:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ellistarn · 2022-03-17T23:45:25Z

This is resolved.

k8s-triage-robot · 2022-04-17T00:03:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-05-17T00:34:34Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-17T00:34:43Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2021

k8s-ci-robot assigned vdhanan Aug 30, 2021

tsunamishaun mentioned this issue Sep 9, 2021

Startup Race Condition aws/karpenter-provider-aws#670

Closed

wongma7 mentioned this issue Oct 6, 2021

Windows CSI Node DaemonSet is not Running/ErrImagePull #1001

Closed

vdhanan removed their assignment Dec 16, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 17, 2022

k8s-ci-robot closed this as completed May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup Dependency No CSINode Data #1030

Startup Dependency No CSINode Data #1030

tsunamishaun commented Aug 20, 2021 •

edited

Loading

vdhanan commented Aug 30, 2021

ellistarn commented Sep 15, 2021

wongma7 commented Sep 15, 2021

ellistarn commented Sep 15, 2021 •

edited

Loading

wongma7 commented Nov 5, 2021

wongma7 commented Nov 5, 2021

ellistarn commented Dec 17, 2021

wongma7 commented Dec 17, 2021

ellistarn commented Dec 17, 2021 •

edited

Loading

k8s-triage-robot commented Mar 17, 2022

ellistarn commented Mar 17, 2022

k8s-triage-robot commented Apr 17, 2022

k8s-triage-robot commented May 17, 2022

k8s-ci-robot commented May 17, 2022

Startup Dependency No CSINode Data #1030

Startup Dependency No CSINode Data #1030

Comments

tsunamishaun commented Aug 20, 2021 • edited Loading

vdhanan commented Aug 30, 2021

ellistarn commented Sep 15, 2021

wongma7 commented Sep 15, 2021

ellistarn commented Sep 15, 2021 • edited Loading

Case 1:

Case 2

wongma7 commented Nov 5, 2021

wongma7 commented Nov 5, 2021

ellistarn commented Dec 17, 2021

wongma7 commented Dec 17, 2021

ellistarn commented Dec 17, 2021 • edited Loading

k8s-triage-robot commented Mar 17, 2022

ellistarn commented Mar 17, 2022

k8s-triage-robot commented Apr 17, 2022

k8s-triage-robot commented May 17, 2022

k8s-ci-robot commented May 17, 2022

tsunamishaun commented Aug 20, 2021 •

edited

Loading

ellistarn commented Sep 15, 2021 •

edited

Loading

ellistarn commented Dec 17, 2021 •

edited

Loading