Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation does not catch invalid capacity-type values #1141

Closed
Tasmana-banana opened this issue Jan 13, 2022 · 13 comments
Closed

Validation does not catch invalid capacity-type values #1141

Tasmana-banana opened this issue Jan 13, 2022 · 13 comments
Assignees
Labels
bug Something isn't working good-first-issue Good for newcomers

Comments

@Tasmana-banana
Copy link

Tasmana-banana commented Jan 13, 2022

Version

Karpenter: v0.5.3

Kubernetes: v1.20.0

Hi all!
When apply provisioner and scale test deployment i get some error in logs controller:

2022-01-13T21:43:17.290Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "5047f3c", "provisioner": "default"}
2022-01-13T21:43:19.291Z	INFO	controller.provisioning	Batched 4 pods in 1.001107034s	{"commit": "5047f3c", "provisioner": "default"}
2022-01-13T21:43:19.333Z	ERROR	controller.provisioning	Failed to find instance type option(s) for [default/inflate-6b88c9fb68-8vvs5 default/inflate-6b88c9fb68-22hlt default/inflate-6b88c9fb68-dfzgm default/inflate-6b88c9fb68-ttrhh]	{"commit": "5047f3c", "provisioner": "default"}

Use Provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["eu-west-2a", "eu-west-2b", "eu-west-2c"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64", "amd64", "x86"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: [ "t2.medium", "t3a.medium", "t3.large"]
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
  limits:
    resources:
      cpu: 1000
  provider:
    instanceProfile: dev-eks-20211224135100984100000004
  ttlSecondsAfterEmpty: 30

Deploy karpenter chart by ansible:

- name: Deploy karpenter chart
  kubernetes.core.helm:
    name: "{{ karpenter_namespace }}"
    create_namespace: true
    release_namespace: "{{ karpenter_namespace }}"
    chart_ref: karpenter/karpenter
    chart_version: 0.5.3
    release_values:
      serviceAccount:
        create: true
        name: karpenter
        annotations:
          eks.amazonaws.com/role-arn: "arn:aws:iam::{{ account_id }}:role/EKS-Karpenter-Role"
      controller:
        clusterName: "{{ cluster_name }}"
        clusterEndpoint: "{{ cluster_endpoint }}"

Can u help my?

@Tasmana-banana Tasmana-banana added the bug Something isn't working label Jan 13, 2022
@suket22
Copy link
Contributor

suket22 commented Jan 13, 2022

Could you also share your deployment.yaml? I'd like to see what the pod was requesting for.

Also if you run

kubectl patch configmap config-logging -n karpenter --patch '{"data":{"loglevel.controller":"debug"}}'

you might be able to see more verbose logging for our controller which might help as well.

@Tasmana-banana
Copy link
Author

Tasmana-banana commented Jan 14, 2022

@suket22 shure
Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1

Also in debug mod i get two new debug logs:

2022-01-14T08:39:55.661Z	DEBUG	controller.provisioning	Ignoring security group sg-0ad80af170adb1b14, only one group with tag kubernetes.io/cluster/dev-eks is allowed	{"commit": "5047f3c", "provisioner": "karpenter"}
2022-01-14T08:39:55.662Z	DEBUG	controller.provisioning	Discovered caBundle, length 1066	{"commit": "5047f3c", "provisioner": "karpenter"}

@Tasmana-banana
Copy link
Author

Ok, I have new error...
controller.provisioning Could not launch node, launching instances, with fleet error(s), InvalidParameterValue: Value (KarpenterNodeInstanceProfile-dev-eks) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name {"commit": "5047f3c", "provisioner": "demand"} Who know, what to attach the error to?
I checked several times, the role name is correct

@ellistarn
Copy link
Contributor

Keep in mind, this is an instance profile, not a role. Can you verify with aws iam get-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-dev-eks?

@hscheib
Copy link

hscheib commented Jan 14, 2022

I ran into this original error ERROR controller.provisioning Failed to find instance type option(s) for and it was because my Provisioner config was incorrect.

I had

spec:
  requirements:
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["ondemand"]

And the correct value is "on-demand". So I just needed to double check my values.

FWIW, I turned on debug logs in the karpenter controller and did not see any validation errors when it starts and discovers the Provisioner to help identify which value was incorrect.

@ellistarn ellistarn changed the title Failed to find instance type option(s) Validation does not catch invalid capacity-type values Jan 14, 2022
@ellistarn ellistarn added the burning Time sensitive issues label Jan 14, 2022
@Tasmana-banana
Copy link
Author

@ellistarn Thanks for the tip! Resolve using instance profile claster.
I tried to create just a role with the necessary permissions, this was my fakap.

@ellistarn
Copy link
Contributor

Reopening so we can include this value validation.

@suket22 suket22 added good-first-issue Good for newcomers and removed burning Time sensitive issues labels Jan 19, 2022
@felix-zhe-huang felix-zhe-huang self-assigned this Feb 1, 2022
@felix-zhe-huang felix-zhe-huang linked a pull request Feb 1, 2022 that will close this issue
3 tasks
@felix-zhe-huang
Copy link
Contributor

We will need the new requirement implementation to print out meaningful error messages at runtime. I will submit a fix after PR #1155 is merged.

@MrBones757
Copy link

MrBones757 commented Feb 2, 2022

EDIT:
thanks to ellistarn for pointing out the kubernetes.io/os tag is unsupported

Hello, I'm having a similar error where im seeing the following.
Can you confirm if this is related to the above, or perhaps see if ive done anythign wrong?
Ive enabled debug logging and have not seen any additional info

Seems to be an issue where i have defined one node selector, but karpenter seems to require all to be defined on the workload?
My goal is to specify one node requirement -ci-node-type: windows-build
Such that it triggers this provisioner and then uses karpenter's smarts to select the correct node type based on what i have defined

2022-02-02T02:50:20.699Z INFO controller.provisioning Batched 1 pods in 1.000990569s {"commit": "62c4546", "provisioner": "cicd-workload-windows-20h2-build-provisioner"}
2022-02-02T02:50:20.709Z ERROR controller.provisioning Failed to find instance type option(s) for [default/inflate-5c94dfdb8c-wpxnr] {"commit": "62c4546", "provisioner": "cicd-workload-windows-20h2-build-provisioner"}

My Provisioner Config is:
(Please ignore the jinja templating and substitute with your own values, this is just here for the deployment mechanism we use)

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: cicd-workload-windows-20h2-build-provisioner
spec:
  # If nil, the feature is disabled, nodes will never expire
  ttlSecondsUntilExpired: 604800 # 7 days in seconds = 7 * 24 * 60 * 60 Seconds;

  # If nil, the feature is disabled, nodes will never scale down due to low utilization
  ttlSecondsAfterEmpty: 3600

  taints:
    - key: <my-org-name>.ci-node-type/windows-build
      effect: NoSchedule

  labels:
    capacity-type: on-demand
    <my-org-name>-ci-node-type: windows-build

  # Requirements that constrain the parameters of provisioned nodes.
  # These requirements are combined with pod.spec.affinity.nodeAffinity rules.
  # Operators { In, NotIn } are supported to enable including or excluding values
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["m5.large", "m5.xlarge", "m5.2xlarge"]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["ap-southeast-2a", "ap-southeast-2b", "ap-southeast-2c"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["on-demand"]
    - key: "kubernetes.io/os"
      operator: In
      values: ["windows"]

  # Resource limits constrain the total size of the cluster.
  # Limits prevent Karpenter from creating new instances once the limit is exceeded.
  limits:
    resources:
      cpu: 32
      memory: 128Gi

  # These fields vary per cloud provider, see your cloud provider specific documentation
  provider:
    instanceProfile: "{{ stack_outputs['karpenter-infra']['KarpenterNodeInstanceProfileArn'] }}"
    launchTemplate: "lt-{{ cluster_name }}-windows-20H2-amd64"
    subnetSelector:
      kubernetes.io/cluster/{{ cluster_name }}: shared
    securityGroupSelector:
      kubernetes.io/cluster/{{ cluster_name }}: owned
    tags:
      Name: "ec2-{{ cluster_name }}-windows-20h2-build-node"
      technical:name: "ec2-{{ cluster_name }}-windows-20h2-build-node"
{% for key, value in common_tags.items() %}
      {{ key }}: "{{ value }}"
{% endfor %}

the workload config is:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
      tolerations:
        - key: "<my-org-name>.ci-node-type/windows-build"
          operator: "Exists"
          effect: "NoSchedule"
      nodeSelector:
        <my-org-name>-ci-node-type: windows-build

@jitterjuice
Copy link

Commenting to follow thread as this issue is of interest to me

@ellistarn
Copy link
Contributor

ellistarn commented Feb 2, 2022

https://github.com/aws/karpenter/issues/1131👀 👀 👀 👀 👀

    - key: "kubernetes.io/os"
      operator: In
      values: ["windows"]

@MrBones757
Copy link

MrBones757 commented Feb 2, 2022

i didnt even considder that this was a windows issue - just assumed that it would work since it uses launch templates under the hood.. ill check it out. Thankyou!
Update:
This has resolved the issue, will look at creating a PR for this time permitting :)

@felix-zhe-huang
Copy link
Contributor

PR #1155 introduces stricter requirement validation and now it will catch typos in provisioner spec without calling any EC2 APIs. The idea is that the requirement values from instanceTypes and from the provisioner will conflict due to typo. For example,

- key: karpenter.sh/capacity-type
      operator: In
      values:
      - ondemand

will conflict with value on-demand from the instance requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good-first-issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants