Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Managed nodes unable to join to cluster #6856

Closed
boskiv opened this issue Jul 26, 2023 · 5 comments
Closed

[Bug] Managed nodes unable to join to cluster #6856

boskiv opened this issue Jul 26, 2023 · 5 comments
Assignees
Labels
blocked/aws kind/bug priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases

Comments

@boskiv
Copy link

boskiv commented Jul 26, 2023

Resource handler returned message: "[Issue(Code=NodeCreationFailure, Message=Instances failed to join the kubernetes cluster, ResourceIds=[i-06674baeb5e93d782, i-0c36741a8b0694281, i-0ef7266d7edb1860a])] (Service: null, Status Code: 0, Request ID: null)" (RequestToken: 2d9f01c5-a52b-0402-252c-e91a0046feb1, HandlerErrorCode: GeneralServiceException)

Here is a config file

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: sf-cluster
  region: ap-northeast-1
  version: "1.27"
  tags:
    karpenter.sh/discovery: sf-cluster
vpc:
  cidr: 10.10.0.0/16
  clusterEndpoints:
    publicAccess: true
    privateAccess: true
iam:
  withOIDC: true

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]
    logRetentionInDays: 30

iamIdentityMappings:
  - arn: arn:aws:iam::625332060816:role/OrganizationAccountAccessRole
    username: admin
    groups:
      - system:masters
    noDuplicateARNs: true # prevents shadowing of ARNs

addons:
  - name: vpc-cni
    version: latest
  - name: kube-proxy
    version: latest
  - name: coredns
    version: latest
  - name: aws-ebs-csi-driver
    version: latest
    wellKnownPolicies:
      ebsCSIController: true
      certManager: true
      awsLoadBalancerController: true
      externalDNS: true
      imageBuilder: true

karpenter:
  version: 'v0.29.0' # Exact version must be provided
  createServiceAccount: true # default is false
  withSpotInterruptionQueue: true # adds all required policies and rules for supporting Spot Interruption Queue, default is false

managedNodeGroups:
  - name: ng-nats
    instanceTypes:
      - c6a.large
    spot: true
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
    desiredCapacity: 3
    labels:
      node.k8s/role: nats
      node-role.kubernetes.io/nats: nats
    taints:
      - key: node.k8s/role
        value: nats
        effect: NoSchedule

  - name: ng-db
    instanceTypes:
      - c6a.large
    spot: true
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
    desiredCapacity: 3
    labels:
      node.k8s/role: timescaledb
      node-role.kubernetes.io/nats: timescaledb
    taints:
      - key: node.k8s/role
        value: timescaledb
        effect: NoSchedule

  - name: ng-sf
    instanceTypes:
      - c6a.large
    spot: true
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
    desiredCapacity: 3
    labels:
      node.k8s/role: sf
      node-role.kubernetes.io/sf: sf
    taints:
      - key: node.k8s/role
        value: sf
        effect: NoSchedule

  - name: ng-jobs
    minSize: 1
    maxSize: 20
    spot: true
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
    instanceTypes:
      - c6a.large
    desiredCapacity: 1
    labels:
        node.k8s/role: jobs
        node-role.kubernetes.io/jobs: jobs
    taints:
      - key: node.k8s/role
        value: jobs
        effect: NoSchedule

  - name: ng-default
    instanceType: c6a.large
    minSize: 1
    maxSize: 10
    desiredCapacity: 2
    labels:
      node.k8s/role: default
      node-role.kubernetes.io/default: default
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        externalDNS: true
        certManager: true
        appMesh: true
        appMeshPreview: true
        ebs: true
        fsx: true
        efs: true
        awsLoadBalancerController: true
        xRay: true
        cloudWatch: true
@github-actions
Copy link
Contributor

Hello boskiv 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

@cPu1
Copy link
Contributor

cPu1 commented Jul 27, 2023

@boskiv, which of the five nodegroups failed to join the cluster? Can you share the logs, redacting any sensitive information? Thanks for the detailed issue.

@boskiv
Copy link
Author

boskiv commented Jul 27, 2023

@cPu1 no one. All failed.

@TiberiuGC
Copy link
Collaborator

TiberiuGC commented Aug 11, 2023

Hi @boskiv 👋 - what's causing your nodes to fail joining the cluster is setting this label node-role.kubernetes.io/default: default. At the moment, eksctl is applying the labels via kubelet --node-labels. Please refer to this comment to understand why this behaviour is not desirable and check out whether the suggested workaround satisfies your use case.

There's an open issue arguing eksctl should find another mean of setting these type of labels, as they are user selected and should not suffer from the kubelet related restriction. However, there's no clear solution for now, and may require upstream support as-well.

@TiberiuGC TiberiuGC added kind/bug blocked/aws priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases and removed kind/help Request for help labels Aug 11, 2023
@TiberiuGC
Copy link
Collaborator

The open issue referenced above was initially related to self-managed nodegroups, however, I found a duplicate bug for EKS managed nodegroups. Nevertheless, the root cause is the same.

Closing this issue and we shall track any progress below.

#4007

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked/aws kind/bug priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases
Projects
None yet
Development

No branches or pull requests

4 participants