[Bug] Managed nodes unable to join to cluster #6856

boskiv · 2023-07-26T15:55:28Z

Resource handler returned message: "[Issue(Code=NodeCreationFailure, Message=Instances failed to join the kubernetes cluster, ResourceIds=[i-06674baeb5e93d782, i-0c36741a8b0694281, i-0ef7266d7edb1860a])] (Service: null, Status Code: 0, Request ID: null)" (RequestToken: 2d9f01c5-a52b-0402-252c-e91a0046feb1, HandlerErrorCode: GeneralServiceException)

Here is a config file

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: sf-cluster
  region: ap-northeast-1
  version: "1.27"
  tags:
    karpenter.sh/discovery: sf-cluster
vpc:
  cidr: 10.10.0.0/16
  clusterEndpoints:
    publicAccess: true
    privateAccess: true
iam:
  withOIDC: true

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]
    logRetentionInDays: 30

iamIdentityMappings:
  - arn: arn:aws:iam::625332060816:role/OrganizationAccountAccessRole
    username: admin
    groups:
      - system:masters
    noDuplicateARNs: true # prevents shadowing of ARNs

addons:
  - name: vpc-cni
    version: latest
  - name: kube-proxy
    version: latest
  - name: coredns
    version: latest
  - name: aws-ebs-csi-driver
    version: latest
    wellKnownPolicies:
      ebsCSIController: true
      certManager: true
      awsLoadBalancerController: true
      externalDNS: true
      imageBuilder: true

karpenter:
  version: 'v0.29.0' # Exact version must be provided
  createServiceAccount: true # default is false
  withSpotInterruptionQueue: true # adds all required policies and rules for supporting Spot Interruption Queue, default is false

managedNodeGroups:
  - name: ng-nats
    instanceTypes:
      - c6a.large
    spot: true
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
    desiredCapacity: 3
    labels:
      node.k8s/role: nats
      node-role.kubernetes.io/nats: nats
    taints:
      - key: node.k8s/role
        value: nats
        effect: NoSchedule

  - name: ng-db
    instanceTypes:
      - c6a.large
    spot: true
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
    desiredCapacity: 3
    labels:
      node.k8s/role: timescaledb
      node-role.kubernetes.io/nats: timescaledb
    taints:
      - key: node.k8s/role
        value: timescaledb
        effect: NoSchedule

  - name: ng-sf
    instanceTypes:
      - c6a.large
    spot: true
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
    desiredCapacity: 3
    labels:
      node.k8s/role: sf
      node-role.kubernetes.io/sf: sf
    taints:
      - key: node.k8s/role
        value: sf
        effect: NoSchedule

  - name: ng-jobs
    minSize: 1
    maxSize: 20
    spot: true
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
    instanceTypes:
      - c6a.large
    desiredCapacity: 1
    labels:
        node.k8s/role: jobs
        node-role.kubernetes.io/jobs: jobs
    taints:
      - key: node.k8s/role
        value: jobs
        effect: NoSchedule

  - name: ng-default
    instanceType: c6a.large
    minSize: 1
    maxSize: 10
    desiredCapacity: 2
    labels:
      node.k8s/role: default
      node-role.kubernetes.io/default: default
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true

The text was updated successfully, but these errors were encountered:

github-actions · 2023-07-26T15:56:06Z

Hello boskiv 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

cPu1 · 2023-07-27T08:40:46Z

@boskiv, which of the five nodegroups failed to join the cluster? Can you share the logs, redacting any sensitive information? Thanks for the detailed issue.

boskiv · 2023-07-27T09:57:05Z

@cPu1 no one. All failed.

TiberiuGC · 2023-08-11T07:38:36Z

Hi @boskiv 👋 - what's causing your nodes to fail joining the cluster is setting this label node-role.kubernetes.io/default: default. At the moment, eksctl is applying the labels via kubelet --node-labels. Please refer to this comment to understand why this behaviour is not desirable and check out whether the suggested workaround satisfies your use case.

There's an open issue arguing eksctl should find another mean of setting these type of labels, as they are user selected and should not suffer from the kubelet related restriction. However, there's no clear solution for now, and may require upstream support as-well.

TiberiuGC · 2023-08-11T07:58:27Z

The open issue referenced above was initially related to self-managed nodegroups, however, I found a duplicate bug for EKS managed nodegroups. Nevertheless, the root cause is the same.

Closing this issue and we shall track any progress below.

#4007

boskiv added the kind/bug label Jul 26, 2023

Himangini added the needs-investigation label Aug 1, 2023

TiberiuGC self-assigned this Aug 10, 2023

TiberiuGC added kind/help Request for help and removed kind/bug needs-investigation labels Aug 11, 2023

TiberiuGC added kind/bug blocked/aws priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases and removed kind/help Request for help labels Aug 11, 2023

TiberiuGC closed this as completed Aug 11, 2023

TiberiuGC mentioned this issue Aug 11, 2023

Add validations for managed nodegroup labels #6947

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Managed nodes unable to join to cluster #6856

[Bug] Managed nodes unable to join to cluster #6856

boskiv commented Jul 26, 2023

github-actions bot commented Jul 26, 2023

cPu1 commented Jul 27, 2023

boskiv commented Jul 27, 2023

TiberiuGC commented Aug 11, 2023 •

edited

Loading

TiberiuGC commented Aug 11, 2023

[Bug] Managed nodes unable to join to cluster #6856

[Bug] Managed nodes unable to join to cluster #6856

Comments

boskiv commented Jul 26, 2023

github-actions bot commented Jul 26, 2023

cPu1 commented Jul 27, 2023

boskiv commented Jul 27, 2023

TiberiuGC commented Aug 11, 2023 • edited Loading

TiberiuGC commented Aug 11, 2023

TiberiuGC commented Aug 11, 2023 •

edited

Loading