Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate potential subnet/AZ trouble #646

Open
tlake opened this issue Oct 10, 2019 · 0 comments
Open

Investigate potential subnet/AZ trouble #646

tlake opened this issue Oct 10, 2019 · 0 comments
Milestone

Comments

@tlake
Copy link
Contributor

tlake commented Oct 10, 2019

We've observed services flapping with L0 v0.11.0 - it seems that sometimes a service is brought up in a subnet that isn't part of the load balancer's subnets, which causes the healthcheck to fail and for the task to be terminated and restarted. This flapping occurs until the service is randomly brought up in a subnet associated with the load balancer, or until a user manually adds the missing subnet to the load balancer.

  • We can't repro it with the previous L0 version v0.10.10, so it seems to be new to v0.11.0.
  • We can repro in v0.11.0 regardless of whether the loadbalancer was created with cross-zone enabled or disabled, so it's probably not caused by the new cross-zone load balancing feature.
  • We can repro in v0.11.0 with an environment using the same instance AMI that v0.10.10 used by default, so it's probably not caused by having updated the AMI to "latest."
  • We've confirmed that the ECS Agent version and Docker version are the same between v0.10.10 and v0.11.0, so it's not those either.

The list of subnets is generated by l0-setup and then spat out as environment variables. It's possible that there's some bug in l0-setup that's gone unnoticed until now.

There's a comment in api/backend/ecs/load_balancer_manager.go in reference to the getSubnetsAndAvailZones() function that may be worth investigating:

// this is awkward, strongly assumes that PrivateSubnets will be distributed across AZs,
// using each at most once.  We error out on bad config for now, in the future we'll
// need to do something to calculate which subnets to use based on where the instance
// got provisioned.

We're not sure what would have changed between v0.10.10 and v0.11.0 that would have started making this a problem, but we haven't ruled it out yet either.

It also might be worth investigating the AWS Terraform provider and whether it's different between Terraform v0.11.x and v0.12.x. If the underlying logic that the provider uses has changed, and if only the v0.12.x provider has those changes, it could be the source of our troubles. If so, making Layer0 compatible with Terraform v0.12.x would be required to solve the problem.

@tlake tlake added this to the v0.11.2 milestone Oct 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant