[BUG] KOPS 1.19 Seg Fault when removing an in-use instance type from a mixed instance group #10718

timothyclarke · 2021-02-03T16:30:22Z

1. What kops version are you running? The command kops version, will display
this information.
Version 1.19.0 (git-04d36d7d92c72601efd918877fc180c846129ffb)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:15:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops edit ig <instance group name>, kops update cluster, kops update cluster --yes

Create a mixed instance group, apply and make sure there is atleast 1 instance in the group active
Edit the instance group and remove an instance type that is active
Attempt to update the cluster

An example would be

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-02-03T09:39:25Z"
  generation: 3
  labels:
    kops.k8s.io/cluster: dev-euw2.example.com
  name: sites
spec:
  cloudLabels:
    BrandName: sites
    k8s.io/cluster-autoscaler/enabled: '""'
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3a.2xlarge
  maxSize: 11
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.2xlarge
    - m5a.xlarge
    - m5a.2xlarge
    - m5a.4xlarge
    - m4.2xlarge
    - m5.2xlarge
    - m5.4xlarge
    - r5a.xlarge
    - r5a.2xlarge
    - r5.xlarge
    - r5.2xlarge
    - c5a.xlarge
    - c5a.2xlarge
    - c5a.4xlarge
    - c5.2xlarge
    - c5.4xlarge
    onDemandAboveBase: 0
    spotInstancePools: 15
  nodeLabels:
    kops.k8s.io/instancegroup: example
  role: Node
  rollingUpdate:
    maxSurge: 2
  rootVolumeSize: 48
  subnets:
  - eu-west-2a
  - eu-west-2b
  - eu-west-2c

Multiple c5a.xlarge instances had spawned which had insufficient ram, so I edited the instance group to remove that type.

5. What happened after the commands executed?
kops gave a segfault

kops update cluster --state "s3://${S3_BUCKET}" --name ${KUBE_NAME}
I0203 16:19:42.232619   92530 executor.go:111] Tasks: 0 done / 153 total; 62 can run
I0203 16:19:42.751426   92530 executor.go:111] Tasks: 62 done / 153 total; 26 can run
I0203 16:19:43.127438   92530 executor.go:111] Tasks: 88 done / 153 total; 51 can run
I0203 16:19:43.502894   92530 executor.go:111] Tasks: 139 done / 153 total; 14 can run
I0203 16:19:43.664899   92530 dnsname.go:121] AliasTarget for "api.dev-euw2.example.com." is "api-dev-euw2-REMOVED.eu-west-2.elb.amazonaws.com."
I0203 16:19:43.862420   92530 dnsname.go:121] AliasTarget for "bastion.dev-euw2.example.com." is "bastion-dev-euw2-REMOVED.eu-west-2.elb.amazonaws.com."
I0203 16:19:43.964038   92530 executor.go:111] Tasks: 153 done / 153 total; 0 can run
Will modify resources:
  AutoscalingGroup/newgroup.dev-euw2.example.com
  	MixedInstanceOverrides	 [t3a.2xlarge, m5a.xlarge, m5a.2xlarge, m5a.4xlarge, m4.2xlarge, m5.2xlarge, m5.4xlarge, r5a.xlarge, r5a.2xlarge, r5.xlarge, r5.2xlarge, c5a.xlarge, c5a.2xlarge, c5a.4xlarge, c5.2xlarge, c5.4xlarge] -> [t3a.2xlarge, m5a.xlarge, m5a.2xlarge, m5a.4xlarge, m4.2xlarge, m5.2xlarge, m5.4xlarge, r5a.xlarge, r5a.2xlarge, r5.xlarge, r5.2xlarge, c5a.2xlarge, c5a.4xlarge, c5.2xlarge, c5.4xlarge]

Must specify --yes to apply changes

kops update cluster --state "s3://${S3_BUCKET}" --name ${KUBE_NAME} --yes  
I0203 15:53:44.014771   89463 executor.go:111] Tasks: 0 done / 153 total; 62 can run
I0203 15:53:44.499571   89463 executor.go:111] Tasks: 62 done / 153 total; 26 can run
I0203 15:53:44.856034   89463 executor.go:111] Tasks: 88 done / 153 total; 51 can run
I0203 15:53:45.227500   89463 executor.go:111] Tasks: 139 done / 153 total; 14 can run
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2c451fa]

goroutine 1322 [running]:
k8s.io/kops/upup/pkg/fi/cloudup/awstasks.(*AutoscalingGroup).RenderAWS(0xc000a8f800, 0xc0013f8ee0, 0xc0000d2800, 0xc000a8f800, 0xc0000d2a00, 0x0, 0x0)
	upup/pkg/fi/cloudup/awstasks/autoscalinggroup.go:492 +0x1c9a
reflect.Value.call(0x3fefc00, 0xc000a8f800, 0x2a13, 0x412ea26, 0x4, 0xc001b5f8c0, 0x4, 0x4, 0x4760f00, 0xc00004d480, ...)
	GOROOT/src/reflect/value.go:476 +0x8c7
reflect.Value.Call(0x3fefc00, 0xc000a8f800, 0x2a13, 0xc001b5f8c0, 0x4, 0x4, 0xc000a8f800, 0x2a13, 0xc001b5f860)
	GOROOT/src/reflect/value.go:337 +0xb9
k8s.io/kops/upup/pkg/fi.(*Context).Render(0xc0005626e0, 0x4691a00, 0xc0000d2800, 0x4691a00, 0xc000a8f800, 0x4691a00, 0xc0000d2a00, 0x0, 0x0)
	upup/pkg/fi/context.go:222 +0xef5
k8s.io/kops/upup/pkg/fi.DefaultDeltaRunMethod(0x4691a00, 0xc000a8f800, 0xc0005626e0, 0x476a180, 0xc000cc2e00)
	upup/pkg/fi/default_methods.go:79 +0x54a
k8s.io/kops/upup/pkg/fi/cloudup/awstasks.(*AutoscalingGroup).Run(0xc000a8f800, 0xc0005626e0, 0xc0002ae070, 0x0)
	upup/pkg/fi/cloudup/awstasks/autoscalinggroup.go:316 +0xd7
k8s.io/kops/upup/pkg/fi.(*executor).forkJoin.func1(0xc000ad07e0, 0xe, 0xe, 0xc001982d10, 0xc000518900, 0xc000e7fb20, 0x6)
	upup/pkg/fi/executor.go:187 +0x1ae
created by k8s.io/kops/upup/pkg/fi.(*executor).forkJoin
	upup/pkg/fi/executor.go:183 +0x105

6. What did you expect to happen?
I expected the autoscaling group to be updated and remove the instance type specified. While it would be nice if KOPS drained in-use nodes in a similar way that a rolling-update does, I'm equally happy with having to drain the instances myself.
One of the core issues is that if I attempt to grow the mixed instance group before removing this instance type there is a good chance I'll just spawn another instance of the type I'm trying to remove as it's one of the smallest / cheapest.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

adding -v8 to the update --yes results in

I0203 16:22:08.992885   92830 request_logger.go:45] AWS request: elasticloadbalancing/DescribeLoadBalancers
I0203 16:22:09.007895   92830 changes.go:81] Field changed "MixedInstanceOverrides" actual="[t3a.2xlarge m5a.xlarge m5a.2xlarge m5a.4xlarge m4.2xlarge m5.2xlarge m5.4xlarge r5a.xlarge r5a.2xlarge r5.xlarge r5.2xlarge c5a.xlarge c5a.2xlarge c5a.4xlarge c5.2xlarge c5.4xlarge]" expected="[t3a.2xlarge m5a.xlarge m5a.2xlarge m5a.4xlarge m4.2xlarge m5.2xlarge m5.4xlarge r5a.xlarge r5a.2xlarge r5.xlarge r5.2xlarge c5a.2xlarge c5a.4xlarge c5.2xlarge c5.4xlarge]"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2c451fa]

k8s.io/kops/upup/pkg/fi/cloudup/awstasks.(*AutoscalingGroup).RenderAWS(0xc000ed6600, 0xc001581e30, 0xc002084b00, 0xc000ed6600, 0xc002084d00, 0x0, 0x0)
	upup/pkg/fi/cloudup/awstasks/autoscalinggroup.go:492 +0x1c9a
reflect.Value.call(0x3fefc00, 0xc000ed6600, 0x2a13, 0x412ea26, 0x4, 0xc00074de00, 0x4, 0x4, 0x4760f00, 0xc0010c2b80, ...)
	GOROOT/src/reflect/value.go:476 +0x8c7
reflect.Value.Call(0x3fefc00, 0xc000ed6600, 0x2a13, 0xc00074de00, 0x4, 0x4, 0xc000ed6600, 0x2a13, 0xc00074dda0)
	GOROOT/src/reflect/value.go:337 +0xb9
k8s.io/kops/upup/pkg/fi.(*Context).Render(0xc001374780, 0x4691a00, 0xc002084b00, 0x4691a00, 0xc000ed6600, 0x4691a00, 0xc002084d00, 0x0, 0x0)
	upup/pkg/fi/context.go:222 +0xef5
k8s.io/kops/upup/pkg/fi.DefaultDeltaRunMethod(0x4691a00, 0xc000ed6600, 0xc001374780, 0x476a180, 0xc000ce9980)
	upup/pkg/fi/default_methods.go:79 +0x54a
k8s.io/kops/upup/pkg/fi/cloudup/awstasks.(*AutoscalingGroup).Run(0xc000ed6600, 0xc001374780, 0x0, 0x0)
	upup/pkg/fi/cloudup/awstasks/autoscalinggroup.go:316 +0xd7
k8s.io/kops/upup/pkg/fi.(*executor).forkJoin.func1(0xc0007721c0, 0xe, 0xe, 0xc0010657f0, 0xc001403660, 0xc000fdc380, 0xd)
	upup/pkg/fi/executor.go:187 +0x1ae
created by k8s.io/kops/upup/pkg/fi.(*executor).forkJoin
	upup/pkg/fi/executor.go:183 +0x105

9. Anything else do we need to know?

Note: slight edit to remove possibly sensitive info

The text was updated successfully, but these errors were encountered:

h3poteto · 2021-02-05T11:49:05Z

I got the same error not only when removed but also when added.

hakman · 2021-02-07T14:58:41Z

Thanks @h3poteto, should be all good with the next patch release.

rifelpet added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Feb 4, 2021

rifelpet added this to the v1.19 milestone Feb 4, 2021

h3poteto mentioned this issue Feb 5, 2021

Use expected LaunchTemplateId in updating ASG when MixedInstancePolicy is changed #10742

Merged

hakman closed this as completed Feb 7, 2021

hakman mentioned this issue Feb 16, 2021

panic: runtime error on new or updated instance groups using mixedInstancesPolicy #10837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] KOPS 1.19 Seg Fault when removing an in-use instance type from a mixed instance group #10718

[BUG] KOPS 1.19 Seg Fault when removing an in-use instance type from a mixed instance group #10718

timothyclarke commented Feb 3, 2021 •

edited

Loading

h3poteto commented Feb 5, 2021

hakman commented Feb 7, 2021

[BUG] KOPS 1.19 Seg Fault when removing an in-use instance type from a mixed instance group #10718

[BUG] KOPS 1.19 Seg Fault when removing an in-use instance type from a mixed instance group #10718

Comments

timothyclarke commented Feb 3, 2021 • edited Loading

h3poteto commented Feb 5, 2021

hakman commented Feb 7, 2021

timothyclarke commented Feb 3, 2021 •

edited

Loading