[Bug]: SecurityGroupIngressRule / SecurityGroupEgressRules not reconciling and crashing EC2 Pod #1242

vibe · 2024-03-28T10:10:00Z

Is there an existing issue for this?

I have searched the existing issues

Affected Resource(s)

ec2.aws.upbound.io/v1beta1 - SecurityGroupIngressRule
ec2.aws.upbound.io/v1beta1 - SecurityGroupEgressRule
ec2.aws.upbound.io/v1beta1 - SecurityGroup

Resource MRs required to reproduce the bug

apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroup
metadata:
  labels:
    id: some-id
    region: us-west-2
  name: sample-security-group
spec:
  deletionPolicy: Delete
  forProvider:
    vpcIdSelector:
      matchControllerRef: true
      matchLabels:
        region: us-west-2
  managementPolicies:
  - '*'
---
apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroupIngressRule
metadata:
  name: some-rule
spec:
  deletionPolicy: Delete
  forProvider:
    fromPort: 8078
    ipProtocol: tcp
    referencedSecurityGroupIdSelector:
      matchControllerRef: true
      matchLabels:
        region: us-west-2
    region: us-east-1
    securityGroupId: sg-0eb2d04c577f1db47 #this gets autofilled on cluster after first resolution but never updates if parent security group changes
    securityGroupIdRef:
      name: sample-security-group ## this gets autofilled on cluster after first resolution but never updates if parent security group changes
    securityGroupIdSelector:
      matchControllerRef: true
      matchLabels:
        id: some-id
        region: us-west-2
    toPort: 8078
  initProvider: {}
  managementPolicies:
  - '*'

Steps to Reproduce

Create Security Group
Create SecurityGroupIngressRule

Recreate Security Group (so security id changes)
SecurityGroupIngressRule still points to old id
Update policy to resolve always on ingress rule
Ec2 pod crashes
no way out of this other than to delete security group ingress rules.

What happened?

Security Group Rules should get recreated.

SecurityGroupId is also missing from the securitygroupingress rule spec after changing policy to "always". (suspecting this causes the crash)

Relevant Error Output Snippet

{"level":"info","ts":"2024-03-28T10:05:47Z","logger":"provider-aws","msg":"Beta feature enabled","flag":"EnableBetaManagementPolicies"}
panic: runtime error: index out of range [0] with length 0

goroutine 144506 [running]:
github.com/hashicorp/terraform-provider-aws/internal/service/ec2.(*resourceSecurityGroupIngressRule).createSecurityGroupRule(0xc013192960, {0x148846f8, 0xc00e828600}, 0xc00856a420)
    github.com/hashicorp/[email protected]/internal/service/ec2/vpc_security_group_ingress_rule.go:66 +0x19c
github.com/hashicorp/terraform-provider-aws/internal/service/ec2.(*resourceSecurityGroupRule).Create(0xc013192960, {0x148846f8, 0xc00e828600}, {{{{0x14899350, 0xc00e5fb800}, {0x100a8a80, 0xc00e5fb230}}, {0x148a4a98, 0xc00a9f4be0}}, {{{0x14899350, ...}, ...}, ...}, ...}, ...)
    github.com/hashicorp/[email protected]/internal/service/ec2/vpc_security_group_ingress_rule.go:172 +0x298
github.com/hashicorp/terraform-provider-aws/internal/provider/fwprovider.(*wrappedResource).Create.func1({0x148846f8?, 0xc00e828600?}, {{{{0x14899350, 0xc00e5fb800}, {0x100a8a80, 0xc00e5fb230}}, {0x148a4a98, 0xc00a9f4be0}}, {{{0x14899350, 0xc00df7b620}, ...}, ...}, ...}, ...)
    github.com/hashicorp/[email protected]/internal/provider/fwprovider/intercept.go:222 +0x6b
github.com/hashicorp/terraform-provider-aws/internal/provider/fwprovider.(*wrappedResource).Create.interceptedHandler[...].func3({{{{0x14899350, 0xc00e5fb800}, {0x100a8a80, 0xc00e5fb230}}, {0x148a4a98, 0xc00a9f4be0}}, {{{0x14899350, 0xc00df7b620}, {0x100a8a80, 0xc00df7af90}}, ...}, ...}, ...)
    github.com/hashicorp/[email protected]/internal/provider/fwprovider/intercept.go:119 +0x231
github.com/hashicorp/terraform-provider-aws/internal/provider/fwprovider.(*wrappedResource).Create(0xc0077661c0, {0x14884730?, 0xc00e85d720?}, {{{{0x14899350, 0xc00e5fb800}, {0x100a8a80, 0xc00e5fb230}}, {0x148a4a98, 0xc00a9f4be0}}, {{{0x14899350, ...}, ...}, ...}, ...}, ...)
    github.com/hashicorp/[email protected]/internal/provider/fwprovider/intercept.go:226 +0x17c
github.com/hashicorp/terraform-plugin-framework/internal/fwserver.(*Server).CreateResource(0xc013b3be40, {0x14884730, 0xc00e85d720}, 0xc00decfca8, 0xc00decfc48)
    github.com/hashicorp/[email protected]/internal/fwserver/server_createresource.go:101 +0x578
github.com/hashicorp/terraform-plugin-framework/internal/fwserver.(*Server).ApplyResourceChange(0xc00decfe00?, {0x14884730, 0xc00e85d720}, 0xc00e85d770, 0xc00decfe00)
    github.com/hashicorp/[email protected]/internal/fwserver/server_applyresourcechange.go:57 +0x4a5
github.com/hashicorp/terraform-plugin-framework/internal/proto5server.(*Server).ApplyResourceChange(0xc013b3be40, {0x148847a0?, 0xc00ed65420?}, 0xc00e85d6d0)
    github.com/hashicorp/[email protected]/internal/proto5server/server_applyresourcechange.go:55 +0x3e5
github.com/crossplane/upjet/pkg/controller.(*terraformPluginFrameworkExternalClient).Create(0xc0130fca20, {0x148847a0, 0xc00ed65420}, {0x148fe8f0?, 0xc007988900})
    github.com/crossplane/[email protected]/pkg/controller/external_tfpluginfw.go:408 +0x193
github.com/crossplane/upjet/pkg/controller.(*terraformPluginFrameworkAsyncExternalClient).Create.func1()
    github.com/crossplane/[email protected]/pkg/controller/external_async_tfpluginfw.go:141 +0xb5
created by github.com/crossplane/upjet/pkg/controller.(*terraformPluginFrameworkAsyncExternalClient).Create in goroutine 6854
    github.com/crossplane/[email protected]/pkg/controller/external_async_tfpluginfw.go:137 +0x168
Stream closed EOF for crossplane-system/provider-aws-ec2-3d66ea2d7903-864d8c9f6-22hth (package-runtime)

Crossplane Version

1.15.1

Provider Version

1.2.1

Kubernetes Version

1.29

Kubernetes Distribution

EKS

Additional Info

No response

The text was updated successfully, but these errors were encountered:

vibe · 2024-03-28T16:38:45Z

I wanted to call out this behavior also shows up for ManagedPrefixLists and Route.
Causing reconciliation errors that are tedious to recover from, since they require manual intervention.

mbbush · 2024-03-29T18:13:56Z

@vibe Thanks for the bug report. Anything that causes the provider pod to crash is concerning.

Could you provide a bit more detail? In particular, it would be useful to be able to see the output of kubectl get -o yaml for the relevant resources at each step. You can replace any ip addresses, aws account ids, or anything else you deem sensitive if you feel that's necessary; we don't need that information for debugging.

I'm particularly interested to know what precisely you are doing with

Update policy to resolve always on ingress rule

and what you mean by

SecurityGroupId is also missing from the securitygroupingress rule spec after changing policy to "always". (suspecting this causes the crash)

I think I know what you mean, but a yaml manifest is the most clear way to explain that without any ambiguity.

Also, I'm curious to know what happens if you add "either wait for 10 minutes or make some edit to any annotation on the SecurityGroupIngressRule" to your STRs in between recreating the security group and setting the policy to required. I think that might trigger the Ref field to resolve to the new SecurityGroup, although that may end up producing a different invalid state. It would be good to know what happens regardless.

Finally, please put your yaml manifests in ``` triple backticks, so that github will preserve the whitespace. Otherwise they become very difficult to read.

My first impression is that the panic is a bug in the terraform aws provider, but perhaps one that we can avoid triggering through better validation logic.

haarchri · 2024-08-06T10:07:10Z

the problem here is if someone missing to specify one of

referencedSecurityGroupId
prefixListId
cidrIpv4
cidrIpv6

the provider will crash - i wonder if we can enhance upjet to add some CEL rules

an easier version of CEL validation we doing in our reference platform - it will not cover the full combinations of inputs but it would be a good start https://github.com/upbound/configuration-aws-securitygroup/blob/main/apis/definition.yaml#L49

                        x-kubernetes-validations:
                          - rule: "!(has(self.cidrBlocks) && has(self.sourceSecurityGroupName))"
                            message: "cidrBlocks cannot be set together with sourceSecurityGroupName."
                          - rule: "!(has(self.fromPort) && has(self.toPort)) || (has(self.cidrBlocks) || has(self.sourceSecurityGroupName))"
                            message: "When fromPort and toPort are specified, either cidrBlocks or sourceSecurityGroupName must be set."
                          - rule: "!has(self.isSelf) || (self.isSelf == false || (!has(self.cidrBlocks) && !has(self.sourceSecurityGroupName)))"
                            message: "When isSelf is true, neither cidrBlocks nor sourceSecurityGroupName can be specified."
                          - rule: "!(self.protocol == '-1' && (has(self.fromPort) || has(self.toPort)))"
                            message: "When protocol is -1, toPort and fromPort cannot be set."

the other issue we can see if policy: always is used for selector/reference the pod crashes

vibe added the bug Something isn't working label Mar 28, 2024

github-actions bot added the community label Mar 28, 2024

jeanduplessis removed the community label Apr 15, 2024

github-actions bot added the community label Apr 15, 2024

turkenf added needs:information needs:triage and removed community labels May 8, 2024

This was referenced Aug 22, 2024

Recover from panics in async external clients crossplane/upjet#428

Merged

Bump Upjet to commit 2e361ad3b6e6 #1468

Merged

mergenci closed this as completed in #1468 Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: SecurityGroupIngressRule / SecurityGroupEgressRules not reconciling and crashing EC2 Pod #1242

[Bug]: SecurityGroupIngressRule / SecurityGroupEgressRules not reconciling and crashing EC2 Pod #1242

vibe commented Mar 28, 2024 •

edited by mergenci

Loading

vibe commented Mar 28, 2024

mbbush commented Mar 29, 2024 •

edited

Loading

haarchri commented Aug 6, 2024 •

edited

Loading

[Bug]: SecurityGroupIngressRule / SecurityGroupEgressRules not reconciling and crashing EC2 Pod #1242

[Bug]: SecurityGroupIngressRule / SecurityGroupEgressRules not reconciling and crashing EC2 Pod #1242

Comments

vibe commented Mar 28, 2024 • edited by mergenci Loading

Is there an existing issue for this?

Affected Resource(s)

Resource MRs required to reproduce the bug

Steps to Reproduce

What happened?

Relevant Error Output Snippet

Crossplane Version

Provider Version

Kubernetes Version

Kubernetes Distribution

Additional Info

vibe commented Mar 28, 2024

mbbush commented Mar 29, 2024 • edited Loading

haarchri commented Aug 6, 2024 • edited Loading

vibe commented Mar 28, 2024 •

edited by mergenci

Loading

mbbush commented Mar 29, 2024 •

edited

Loading

haarchri commented Aug 6, 2024 •

edited

Loading