Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook didn't catch an error in spec.blockDeviceMappings[*].ebs settings #3409

Closed
andrescaroc opened this issue Feb 16, 2023 · 17 comments · Fixed by #3898
Closed

Webhook didn't catch an error in spec.blockDeviceMappings[*].ebs settings #3409

andrescaroc opened this issue Feb 16, 2023 · 17 comments · Fixed by #3898
Labels
question Issues that are support related questions

Comments

@andrescaroc
Copy link

andrescaroc commented Feb 16, 2023

Version

Karpenter Version: v0.24.0

Kubernetes Version: v1.23.0

Expected Behavior

If there is an error in the settings of an AWSNodeTemplate it should be cached by the karpenter validating webhook.

In my case It seems karpenter is forcing me to define an ebs.volumeSize when I already defined an ebs.snapshotID.

However, according the aws API documentation is only required one of the two, and is possible to define both under a condition:

The size of the volume, in GiBs. You must specify either a snapshot ID or a volume size. If you specify a snapshot, the default is the snapshot size. You can specify a volume size that is equal to or larger than the snapshot size.

Actual Behavior

I was trying to deploy a Provisioner refering to an AWSNodeTemplate with missing required fields (I was not aware of) in the blockDeviceMappings section , the validating webhook let it pass and be deployed.
Right away the karpenter pods start to crash letting the karpenter useless.

karpenter pods watch:

kubectl -n karpenter get po -w
NAME                         READY   STATUS    RESTARTS   AGE
karpenter-6f4fdc54c6-fxbwk   1/1     Running   0          19s
karpenter-6f4fdc54c6-m7ffd   1/1     Running   0          19s
karpenter-6f4fdc54c6-fxbwk   0/1     Error     0          85s
karpenter-6f4fdc54c6-fxbwk   0/1     Running   1 (2s ago)   86s
karpenter-6f4fdc54c6-fxbwk   1/1     Running   1 (4s ago)   88s
karpenter-6f4fdc54c6-m7ffd   0/1     Error     0            106s
karpenter-6f4fdc54c6-m7ffd   0/1     Running   1 (3s ago)   108s
karpenter-6f4fdc54c6-m7ffd   1/1     Running   1 (5s ago)   110s
karpenter-6f4fdc54c6-fxbwk   0/1     Error     1 (40s ago)   2m4s
karpenter-6f4fdc54c6-fxbwk   0/1     CrashLoopBackOff   1 (7s ago)    2m11s
karpenter-6f4fdc54c6-m7ffd   0/1     Error              1 (38s ago)   2m23s
karpenter-6f4fdc54c6-fxbwk   0/1     Running            2 (20s ago)   2m24s
karpenter-6f4fdc54c6-fxbwk   1/1     Running            2 (23s ago)   2m27s
karpenter-6f4fdc54c6-m7ffd   0/1     CrashLoopBackOff   1 (9s ago)    2m31s
karpenter-6f4fdc54c6-m7ffd   0/1     Running            2 (23s ago)   2m45s
karpenter-6f4fdc54c6-fxbwk   0/1     Error              2 (41s ago)   2m45s
karpenter-6f4fdc54c6-m7ffd   1/1     Running            2 (25s ago)   2m47s
karpenter-6f4fdc54c6-fxbwk   0/1     CrashLoopBackOff   2 (7s ago)    2m51s
karpenter-6f4fdc54c6-m7ffd   0/1     Error              2 (45s ago)   3m7s

Steps to Reproduce the Problem

Define a provisioner that refers to an AWSNodeTemplate as follows:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: my-provisioner
spec:
  providerRef:
    name: my-node-template
...

Define an AWSNodeTemplate that will use a snapshot as follows:

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: my-node-template
spec:
  blockDeviceMappings:
    - deviceName: /dev/xvdb
      ebs:
        deleteOnTermination: true
        snapshotID: snap-0a2000008824631
...

I am going to focus in the blockDeviceMappings section that was my case, but might be happening in other nested sections under spec

Deploy both resources:

kubectl apply -f <provisioner-and-node-template.yaml>

No complains shown by the karpenter webhook.

Watch the karpenter pods:

kubectl -n karpenter get po -w

Karpenter pods shoud start to restart and eventually crash both

You will not be able to do any other operation like editing karpenter resources (the node template) or delete it because the karpenter webhook won't be available.

kubectl delete provisioners.karpenter.sh my-provisioner
Error from server (InternalError): Internal error occurred: failed calling webhook "validation.webhook.karpenter.sh": failed to call webhook: Post "https://karpenter.karpenter.svc:443/validate/karpenter.sh?timeout=10s": no endpoints available for service "karpenter"
kubectl delete awsnodetemplates.karpenter.k8s.aws my-node-template
Error from server (InternalError): Internal error occurred: failed calling webhook "validation.webhook.karpenter.k8s.aws": failed to call webhook: Post "https://karpenter.karpenter.svc:443/validate/karpenter.k8s.aws?timeout=10s": no endpoints available for service "karpenter"

To solve it you need to rollout restart the karpenter deployment and patch the faulty AWSNodeTemplate right away, previous they start to crash again.

Resource Specs and Logs

Logs were not useful to find the issue:

2023-02-16T19:04:04.066Z	INFO	controller	Starting workers	{"commit": "8c27519-dirty", "controller": "node", "controllerGroup": "", "controllerKind": "Node", "worker count": 10}
2023-02-16T19:04:04.077Z	INFO	controller	Starting workers	{"commit": "8c27519-dirty", "controller": "inflightchecks", "controllerGroup": "", "controllerKind": "Node", "worker count": 10}
2023-02-16T19:04:04.077Z	INFO	controller	Starting workers	{"commit": "8c27519-dirty", "controller": "counter", "controllerGroup": "karpenter.sh", "controllerKind": "Provisioner", "worker count": 10}
2023-02-16T19:04:04.128Z	DEBUG	controller.awsnodetemplate	discovered security groups	{"commit": "8c27519-dirty", "awsnodetemplate": "bottlerocket", "security-groups": ["sg-0830faredacted366"]}
2023-02-16T19:04:04.953Z	DEBUG	controller.deprovisioning	discovered EC2 instance types	{"commit": "8c27519-dirty", "instance-type-count": 505}
2023-02-16T19:04:05.092Z	DEBUG	controller.deprovisioning	discovered EC2 instance types zonal offerings for subnets	{"commit": "8c27519-dirty", "subnet-selector": "{\"Name\":\"*Private*\",\"karpenter.sh/discovery\":\"my-cluster\"}"}
2023-02-16T19:04:05.118Z	INFO	controller.deprovisioning	stopping controller	{"commit": "8c27519-dirty"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1c521fb]

goroutine 750 [running]:
github.com/aws/karpenter/pkg/cloudprovider.computeCapacity({0x2ce89d8, 0xc001dd2c90}, 0xc000ce4c00, {0x2cec768, 0xc0020b0b50}, {0xc0020b0b48, 0x1, 0x1}, 0xc000056023?)
	github.com/aws/karpenter/pkg/cloudprovider/instancetype.go:134 +0x1db
github.com/aws/karpenter/pkg/cloudprovider.NewInstanceType({0x2ce89d8, 0xc001dd2c90}, 0xc000ce4c00, 0x0?, {0xc000056023, 0xc}, 0xc0016183c0, {0xc00340af00, 0x6, 0x8})
	github.com/aws/karpenter/pkg/cloudprovider/instancetype.go:58 +0x315
github.com/aws/karpenter/pkg/cloudprovider.(*InstanceTypeProvider).List(0xc000c2fc20, {0x2ce89d8, 0xc001dd2c90}, 0x0?, 0xc0016183c0)
	github.com/aws/karpenter/pkg/cloudprovider/instancetypes.go:112 +0x425
github.com/aws/karpenter/pkg/cloudprovider.(*CloudProvider).GetInstanceTypes(0xc000d6f3b0, {0x2ce89d8, 0xc001dd2c90}, 0xc0010a09c8)
	github.com/aws/karpenter/pkg/cloudprovider/cloudprovider.go:181 +0x9a
github.com/aws/karpenter-core/pkg/cloudprovider/metrics.(*decorator).GetInstanceTypes(0xc000d38190, {0x2ce89d8, 0xc001dd2c90}, 0x13?)
	github.com/aws/[email protected]/pkg/cloudprovider/metrics/cloudprovider.go:86 +0x199
github.com/aws/karpenter-core/pkg/controllers/deprovisioning.buildProvisionerMap({0x2ce89d8, 0xc001dd2c90}, {0x2cf3fe8, 0xc000d7ef00}, {0x2cec648, 0xc000d38190})
	github.com/aws/[email protected]/pkg/controllers/deprovisioning/helpers.go:252 +0x1fe
github.com/aws/karpenter-core/pkg/controllers/deprovisioning.candidateNodes({0x2ce89d8, 0xc001dd2c90}, 0xc001dd2cc0?, {0x2cf3fe8, 0xc000d7ef00}, {0x2cede30, 0x419c5d0}, {0x2cec648?, 0xc000d38190?}, 0xc0023f1d50)
	github.com/aws/[email protected]/pkg/controllers/deprovisioning/helpers.go:162 +0x7c
github.com/aws/karpenter-core/pkg/controllers/deprovisioning.(*Controller).Reconcile(0xc00042e0e0, {0x2ce89d8, 0xc001dd2c90}, {{{0x0?, 0x0?}, {0x404d6c?, 0x0?}}})
	github.com/aws/[email protected]/pkg/controllers/deprovisioning/controller.go:110 +0x179
github.com/aws/karpenter-core/pkg/operator/controller.(*Singleton).reconcile(0xc000d6f950, {0x2ce89d8, 0xc001dd2c90})
	github.com/aws/[email protected]/pkg/operator/controller/singleton.go:108 +0xb9
github.com/aws/karpenter-core/pkg/operator/controller.(*Singleton).Start(0xc000d6f950, {0x2ce8930, 0xc000127340})
	github.com/aws/[email protected]/pkg/operator/controller/singleton.go:99 +0x205
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc000d28860)
	sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219 +0xdb
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile
	sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:203 +0x1ad

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@andrescaroc andrescaroc added the bug Something isn't working label Feb 16, 2023
@ellistarn
Copy link
Contributor

We should remove Deletion webhooks using

// VerbLimited defines which Verbs you want to have the webhook invoked on.
type VerbLimited interface {
	// SupportedVerbs define which operations (verbs) webhook is called on.
	SupportedVerbs() []admissionregistrationv1.OperationType
}

We should also fix this panic.

@ellistarn
Copy link
Contributor

ellistarn commented Feb 17, 2023

Trying to reproduce this. Getting a hang instead of a crash

karpenter-785fc5c56d-qp4vp controller 2023-02-17T18:07:58.602Z	INFO	controller.provisioner	found provisionable pod(s)	{"commit": "8c27519-dirty", "pods": 2}
karpenter-785fc5c56d-qp4vp controller 2023-02-17T18:07:58.602Z	INFO	controller.provisioner	computed new node(s) to fit pod(s)	{"commit": "8c27519-dirty", "nodes": 1, "pods": 1}
karpenter-785fc5c56d-qp4vp controller 2023-02-17T18:07:58.602Z	INFO	controller.provisioner	computed 1 unready node(s) will fit 1 pod(s)	{"commit": "8c27519-dirty"}
karpenter-785fc5c56d-qp4vp controller 2023-02-17T18:07:58.602Z	INFO	controller.provisioner	launching machine with 1 pods requesting {"cpu":"1125m","pods":"4"} from types r5.12xlarge, c5.2xlarge, m6id.24xlarge, m6idn.xlarge, r5d.large and 324 other(s)	{"commit": "8c27519-dirty", "provisioner": "my-provisioner"}
karpenter-785fc5c56d-qp4vp controller 2023-02-17T18:08:22.047Z	INFO	controller.provisioner.cloudprovider	launched new instance	{"commit": "8c27519-dirty", "provisioner": "my-provisioner", "id": "i-05dcfeb6e726d4f46", "hostname": "ip-192-168-121-174.us-west-2.compute.internal", "instance-type": "c5ad.large", "zone": "us-west-2b", "capacity-type": "on-demand"}

Oddly, the instance does launch -- but why does it take so long?

@ellistarn
Copy link
Contributor

Ah -- was blocked on the pending snapshot

karpenter-785fc5c56d-qp4vp controller 2023-02-17T18:07:57.495Z	ERROR	controller.provisioner	launching machine, creating cloud provider instance, creating instance, with fleet error(s), InvalidBlockDeviceMapping: Snapshot snap-0a91e509380d5a3bc is not available for use. Its current state is pending	{"commit": "8c27519-dirty"}

@ellistarn
Copy link
Contributor

Hey @andrescaroc ,
I'm not able to reproduce this crash

k get awsnodetemplates.karpenter.k8s.aws my-node-template -oyaml
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.k8s.aws/v1alpha1","kind":"AWSNodeTemplate","metadata":{"annotations":{},"name":"my-node-template"},"spec":{"blockDeviceMappings":[{"deviceName":"/dev/xvdb","ebs":{"deleteOnTermination":true,"snapshotID":"snap-0a2000008824631"}}],"securityGroupSelector":{"karpenter.sh/discovery":"dev"},"subnetSelector":{"karpenter.sh/discovery":"dev"}}}
  creationTimestamp: "2023-02-17T17:53:15Z"
  generation: 3
  name: my-node-template
  resourceVersion: "22439490"
  uid: 5caaee39-f334-4c10-8782-d71f3179005b
spec:
  blockDeviceMappings:
  - deviceName: /dev/xvdb
    ebs:
      deleteOnTermination: true
      snapshotID: snap-0a91e509380d5a3bc
  securityGroupSelector:
    karpenter.sh/discovery: dev
  subnetSelector:
    karpenter.sh/discovery: dev

@ellistarn
Copy link
Contributor

Noticing you're running on "8c27519-dirty". Is there a chance this bug is in a local build of Karpenter?

@ellistarn
Copy link
Contributor

Ah -- looks like this is a release issue.

@andrescaroc
Copy link
Author

Noticing you're running on "8c27519-dirty". Is there a chance this bug is in a local build of Karpenter?

No local build of Karpenter, I am using the helm chart of v0.24 (I think is the latest official release at the time of writing)

@ellistarn
Copy link
Contributor

I cut #3414 for this

@ellistarn
Copy link
Contributor

@andrescaroc can you reproduce this 100% of the time with your instructions? I am unable to.

@ellistarn ellistarn added question Issues that are support related questions and removed bug Something isn't working labels Mar 13, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2023

Labeled for closure due to inactivity in 10 days.

@andrescaroc
Copy link
Author

andrescaroc commented May 12, 2023

@ellistarn I think this is still an issue:

Today I tried again to use a snapshot without defining the volume size and karpenter crashes getting unrecoverable.

Karpenter version: 0.27.1

Logs:

2023-05-12T19:32:54.911Z        INFO    controller.termination  cordoned node   {"commit": "7131be2-dirty", "node": "ip-192-168-152-172.eu-central-1.compute.internal"}
2023-05-12T19:32:55.240Z        INFO    controller.termination  deleted node    {"commit": "7131be2-dirty", "node": "ip-192-168-152-172.eu-central-1.compute.internal"}
2023-05-12T19:36:12.054Z        INFO    controller.provisioner  found provisionable pod(s)      {"commit": "7131be2-dirty", "pods": 1}
2023-05-12T19:36:12.054Z        INFO    controller.provisioner  computed new machine(s) to fit pod(s)   {"commit": "7131be2-dirty", "machines": 1, "pods": 1}
2023-05-12T19:36:12.054Z        INFO    controller.provisioner  launching machine with 1 pods requesting {"cpu":"170m","memory":"10360Mi","nvidia.com/gpu":"1","pods":"6"} from types g4dn.xlarge, g4dn.2xlarge, g4dn.4xlarge        {"commit": "7131be2-dirty", "provisioner": "knative-main"}
2023-05-12T19:36:12.353Z        DEBUG   controller.provisioner.cloudprovider    created launch template {"commit": "7131be2-dirty", "provisioner": "knative-main", "launch-template-name": "Karpenter-knative-10886214137580747345", "launch-template-id": "lt-0e0bb7afbc78c4018"}
2023-05-12T19:36:14.456Z        INFO    controller.provisioner.cloudprovider    launched new instance   {"commit": "7131be2-dirty", "provisioner": "knative-main", "id": "i-04e6a4907aed68b75", "hostname": "ip-192-168-148-183.eu-central-1.compute.internal", "instance-type": "g4dn.xlarge", "zone": "eu-central-1a", "capacity-type": "spot"}
2023-05-12T19:36:22.063Z        INFO    controller.provisioner  stopping controller     {"commit": "7131be2-dirty"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1c217bb]

goroutine 722 [running]:
github.com/aws/karpenter/pkg/providers/instancetype.computeCapacity({0x2d8f3f8, 0xc001062060}, 0xc00d5fd700, {0x2d92ff0, 0xc00417d328}, {0xc00417d320, 0x1, 0x1}, 0xc00005806b?)
        github.com/aws/karpenter/pkg/providers/instancetype/types.go:140 +0x1db
github.com/aws/karpenter/pkg/providers/instancetype.NewInstanceType({0x2d8f3f8, 0xc001062060}, 0xc00d5fd700, 0x0, {0xc00005806b, 0xc}, 0xc00de1be00, {0xc0030c1500, 0x6, 0x8})
        github.com/aws/karpenter/pkg/providers/instancetype/types.go:59 +0x315
github.com/aws/karpenter/pkg/providers/instancetype.(*Provider).List.func1(0xc00d5fd700, 0x5?)
        github.com/aws/karpenter/pkg/providers/instancetype/instancetype.go:104 +0xd5
github.com/samber/lo.Map[...]({0xc002687500?, 0x202, 0x5}, 0xc00293b688?)
        github.com/samber/[email protected]/slice.go:29 +0x67
github.com/aws/karpenter/pkg/providers/instancetype.(*Provider).List(0xc0011697a0, {0x2d8f3f8, 0xc001062060}, 0x0, 0xc00de1be00)
        github.com/aws/karpenter/pkg/providers/instancetype/instancetype.go:103 +0x351
github.com/aws/karpenter/pkg/cloudprovider.(*CloudProvider).GetInstanceTypes(0xc000a105d0, {0x2d8f3f8, 0xc001062060}, 0xc00c27bd58)
        github.com/aws/karpenter/pkg/cloudprovider/cloudprovider.go:164 +0x9a
github.com/aws/karpenter-core/pkg/cloudprovider/metrics.(*decorator).GetInstanceTypes(0xc00117f070, {0x2d8f3f8, 0xc001062060}, 0xa?)
        github.com/aws/[email protected]/pkg/cloudprovider/metrics/cloudprovider.go:86 +0x1ad
github.com/aws/karpenter-core/pkg/controllers/provisioning.(*Provisioner).NewScheduler(0xc001169860, {0x2d8f3f8, 0xc001062060}, {0xc00417c290, 0x1, 0x1}, {0xc008ec4080, 0xa, 0x10}, {0x0})
        github.com/aws/[email protected]/pkg/controllers/provisioning/provisioner.go:223 +0x302
github.com/aws/karpenter-core/pkg/controllers/provisioning.(*Provisioner).Schedule(0xc001169860, {0x2d8f3f8, 0xc001062060})
        github.com/aws/[email protected]/pkg/controllers/provisioning/provisioner.go:311 +0x285
github.com/aws/karpenter-core/pkg/controllers/provisioning.(*Provisioner).Reconcile(0xc001169860, {0x2d8f3f8, 0xc001062060}, {{{0x5a?, 0x40e5c8?}, {0xc00d3bc4b0?, 0x50?}}})
        github.com/aws/[email protected]/pkg/controllers/provisioning/provisioner.go:126 +0x88
github.com/aws/karpenter-core/pkg/operator/controller.(*Singleton).reconcile(0xc0011367e0, {0x2d8f3f8, 0xc001062060})
        github.com/aws/[email protected]/pkg/operator/controller/singleton.go:94 +0x2ca
github.com/aws/karpenter-core/pkg/operator/controller.(*Singleton).Start(0xc0011367e0, {0x2d8f350, 0xc0002b1900})
        github.com/aws/[email protected]/pkg/operator/controller/singleton.go:82 +0x205
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc001136900)
        sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219 +0xdb
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile
        sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:203 +0x1ad

AWSNodeTemplate:

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: knative-main
spec:
  amiFamily: Bottlerocket
  blockDeviceMappings:
  - deviceName: /dev/xvdb
    ebs:
      deleteOnTermination: true
      snapshotID: snap-00e88abc1187b5b19
      volumeType: gp3
  securityGroupSelector:
    karpenter.sh/discovery: knative
  subnetSelector:
    Name: '*Private*'
    karpenter.sh/discovery: knative
  tags:
    purpose: engineering

Karpenter gets in an unrecoverable state, trying to fix the AWSNodeTemplate resource hit this error.

Error from server (InternalError): error when applying patch:
...
for: "STDIN": error when patching "STDIN": Internal error occurred: failed calling webhook "defaulting.webhook.karpenter.k8s.aws": failed to call webhook: Post "https://karpenter.karpenter.svc:443/default/karpente
r.k8s.aws?timeout=10s": no endpoints available for service "karpenter"

I have to use kubectl -n karpenter rollout restart deployment karpenter and almost at the same time apply the patch to the AWSNodeTemplate providing the field volumeSize: 40Gi which is not required by the aws api when you provide a snapshot ID.

@njtran
Copy link
Contributor

njtran commented May 12, 2023

@andrescaroc can you reproduce this 100% of the time with your instructions? I am unable to.

Can you confirm what @ellistarn had asked previously? Seems like he had an issue reproducing.

@andrescaroc
Copy link
Author

Can you confirm what @ellistarn had asked previously? Seems like he had an issue reproducing.

Taking in consideration that is the second time that I try without setting the volumeSize in the awsNodeTemplate then it is a 100% of the time

@njtran
Copy link
Contributor

njtran commented May 12, 2023

Ah yeah, looks like I'm getting the same issue. I'll re-open:

karpenter-7f4d686c8f-pr9xn controller github.com/aws/karpenter/pkg/providers/instancetype.computeCapacity({0x2e5e0d8, 0xc00137c5a0}, 0xc0012fde00, {0x2e616f0, 0xc00225de98}, {0xc00010fb28, 0x1, 0x1}, 0xc00005806b?)
karpenter-7f4d686c8f-pr9xn controller 	github.com/aws/karpenter/pkg/providers/instancetype/types.go:150 +0x1db

This is my AWSNodeTepmlateSpec, trying it with the original snapshotID you had.

  spec:
    amiFamily: Bottlerocket
    blockDeviceMappings:
    - deviceName: /dev/xvdb
      ebs:
        deleteOnTermination: true
        snapshotID: snap-0a2000008824631
        volumeType: gp3
    securityGroupSelector:
      karpenter.sh/discovery: nichotr-karpenter-demo
    subnetSelector:
      karpenter.sh/discovery: nichotr-karpenter-demo

Looks like the issue is here: https://github.com/aws/karpenter/blob/main/pkg/providers/instancetype/types.go#L150 and then in https://github.com/aws/karpenter/blob/main/pkg/providers/instancetype/types.go#L176 here.

When I changed my amiFamily to use AL2, I didn't see the crash, but I did still get an error:

karpenter-7f4d686c8f-pr9xn controller 2023-05-12T20:57:55.766Z	ERROR	controller	Reconciler error	{"commit": "3fb6c8e-dirty", "controller": "machine_lifecycle", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"default-cx6w6"}, "namespace": "", "name": "default-cx6w6", "reconcileID": "43ec9326-db5d-4487-aedd-e44f4e02cfde", "error": "creating machine, creating instance, getting launch template configs, getting launch templates, creating launch template, InvalidSnapshotID.Malformed: The snapshot ID 'snap-0a2000008824631' is not valid. The expected format is snap-xxxxxxxx or snap-xxxxxxxxxxxxxxxxx.\n\tstatus code: 400, request id: 36d3628c-6ff2-499e-8afa-266b67e003e3"}

Going to re-open so we can fix this.

@njtran njtran reopened this May 12, 2023
@njtran
Copy link
Contributor

njtran commented May 12, 2023

It looks like for EBS root volumes: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html

For the root volume, you can only modify the following: volume size, volume type, and the Delete on Termination flag.

And our validation logic just checks if one of snapshotID and volumeSize are present: https://github.com/aws/karpenter/blob/main/pkg/apis/v1alpha1/provider_validation.go#L247. Confirmed this by having a snapshotID but no volumeSize (with a fix to make it not panic) and I got

karpenter-67dff4f565-rr5xn controller 2023-05-12T21:32:38.493Z	ERROR	controller	Reconciler error	{"commit": "dfe0040-dirty", "controller": "machine_lifecycle", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"default-rbpmk"}, "namespace": "", "name": "default-rbpmk", "reconcileID": "f6c4113e-2471-48f0-b90c-e6c63a40d67a", "error": "creating machine, creating instance, with fleet error(s), InvalidBlockDeviceMapping: snapshotId cannot be modified on root device"}

@andrescaroc
Copy link
Author

Thanks @njtran, I think you catched the error.

In my case I use bottlerocketOS which by default comes with two volumes (root and data), I want to use a snapshot for the data volume /dev/xvdb with some cached images.

Based on the aws documentation I should be able to provide an EBS Block device mapping without having to specify the volume size.

@andrescaroc
Copy link
Author

... trying it with the original snapshotID you had.

... I didn't see the crash, but I did still get an error:

... InvalidSnapshotID.Malformed: The snapshot ID 'snap-0a2000008824631' is not valid...

@njtran about the error that you are getting for the snapshot itself, I will suggest you to test with a snapshot of your own, I used a random SnapshotID in my description. A snapshot of the data volume of a Bottlerocket OS instance will do the job. Otherwise you may loose time trying to debug something not related with the main issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issues that are support related questions
Projects
None yet
4 participants