Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Load Balancer Deletion with delete_protection=true When Using IngressGroup in AWS Load Balancer Controller #3817

Open
sergeylanzman opened this issue Aug 21, 2024 · 2 comments
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on. triage/unresolved Indicates an issue that can not or will not be resolved.

Comments

@sergeylanzman
Copy link

sergeylanzman commented Aug 21, 2024

Describe the bug
We are using the AWS Load Balancer Controller with a single Ingress for multiple requests. We decided to separate these into multiple Ingresses using IngressGroup. All our load balancers are marked with the annotation delete_protection=true. However, after adding the groupname annotation to the Ingress, the controller unexpectedly deleted the load balancer and created a new one. This behavior is unexpected since delete_protection=true was set.

If I attempt to delete the Ingress, the load balancer is not deleted, but when changing annotations, the controller attempts to delete the load balancer, resulting in an OperationNotPermittedException error. After this error, the controller disables delete protection and deletes the load balancer again.

This issue is critical as it affects production environments by causing downtime due to the need to update DNS, re-register all targets, and more.
The issue seems to originate from the code in load_balancer_synthesizer.go. Specifically, this section appears to be responsible for the unexpected behavior, and it might be necessary to remove or modify this logic for prevent downtime and issue from customers.

Steps to reproduce

  • Mark a load balancer with the annotation delete_protection=true.
  • Add the groupname annotation to an existing Ingress.
  • Observe the controller’s behavior as it deletes and recreates the load balancer despite delete protection being enabled.

Expected outcome
When delete_protection=true is set, the controller should not delete the load balancer regardless of changes to annotations. Additionally, the controller should not disable delete_protection to delete the load balancer.

Environment
AWS Load Balancer controller version: v2.8.2
Kubernetes version: 1.28
Using EKS: yes 1.28
Additional Context
Adding an IngressGroupName should not cause downtime. However, currently, to add an IngressGroupName, it seems the only option is to create a new Ingress and load balancer, then switch traffic, which is complicated and introduces downtime. It might be beneficial to consider a way to set a default IngressGroupName or improve the process to avoid downtime.

Relevant Issues

Issue #1: #2271
Issue #2: #3034

@shraddhabang
Copy link
Collaborator

Hello @sergeylanzman , as documented here, the ALB for an IngressGroup is found by searching for an AWS tag ingress.k8s.aws/stack tag with the name of the IngressGroup as its value. So when you convert your ingress to use ingressGroup the required tag is not found and hence it creates a new ALB with required tag and deletes the old one. If you want to use the existing ALB only to avoid the downtime, you can just change this ingress.k8s.aws/stack to you new group name and apply the annotation change of group.name. This way the old ALB will not be deleted and there will be no downtime. I tried this on my setup can you give it a try?

@shraddhabang shraddhabang added triage/accepted Indicates an issue or PR is ready to be actively worked on. triage/unresolved Indicates an issue that can not or will not be resolved. labels Aug 21, 2024
@sergeylanzman
Copy link
Author

@shraddhabang
I understand why the ALB was recreated due to the missing ingress.k8s.aws/stack tag when converting to an IngressGroup. However, from my point of view, relying on manual tag updates in a production environment feels risky and somewhat hacky, especially as it's not clearly documented.

I believe that if a resource is flagged with delete_protection=true, the controller—or any other tool—should strictly adhere to that flag and not delete the resource unless the flag is manually unset. This behavior is consistent across other AWS services with delete_protection (e.g., RDS, DynamoDB, Shield, etc.). I am not aware of any AWS tools (Terraform, AWS CLI, AWS CDK, AWS SDK, etc.) that bypass this protection without manual intervention, including other AWS controllers from aws-controllers-k8s.

In my opinion, the current behavior makes the delete_protection flag effectively unusable in scenarios where dynamic changes are required. It would be more reliable and easy if the controller could automatically update the necessary tags or, at the very least, respect the delete_protection flag more strictly to prevent unintentional deletions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests

2 participants