Investigate potential subnet/AZ trouble #646

tlake · 2019-10-10T22:40:28Z

We've observed services flapping with L0 v0.11.0 - it seems that sometimes a service is brought up in a subnet that isn't part of the load balancer's subnets, which causes the healthcheck to fail and for the task to be terminated and restarted. This flapping occurs until the service is randomly brought up in a subnet associated with the load balancer, or until a user manually adds the missing subnet to the load balancer.

We can't repro it with the previous L0 version v0.10.10, so it seems to be new to v0.11.0.
We can repro in v0.11.0 regardless of whether the loadbalancer was created with cross-zone enabled or disabled, so it's probably not caused by the new cross-zone load balancing feature.
We can repro in v0.11.0 with an environment using the same instance AMI that v0.10.10 used by default, so it's probably not caused by having updated the AMI to "latest."
We've confirmed that the ECS Agent version and Docker version are the same between v0.10.10 and v0.11.0, so it's not those either.

The list of subnets is generated by l0-setup and then spat out as environment variables. It's possible that there's some bug in l0-setup that's gone unnoticed until now.

There's a comment in api/backend/ecs/load_balancer_manager.go in reference to the getSubnetsAndAvailZones() function that may be worth investigating:

// this is awkward, strongly assumes that PrivateSubnets will be distributed across AZs,
// using each at most once.  We error out on bad config for now, in the future we'll
// need to do something to calculate which subnets to use based on where the instance
// got provisioned.

We're not sure what would have changed between v0.10.10 and v0.11.0 that would have started making this a problem, but we haven't ruled it out yet either.

It also might be worth investigating the AWS Terraform provider and whether it's different between Terraform v0.11.x and v0.12.x. If the underlying logic that the provider uses has changed, and if only the v0.12.x provider has those changes, it could be the source of our troubles. If so, making Layer0 compatible with Terraform v0.12.x would be required to solve the problem.

The text was updated successfully, but these errors were encountered:

tlake added this to the v0.11.2 milestone Oct 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate potential subnet/AZ trouble #646

Investigate potential subnet/AZ trouble #646

tlake commented Oct 10, 2019 •

edited

Loading

Investigate potential subnet/AZ trouble #646

Investigate potential subnet/AZ trouble #646

Comments

tlake commented Oct 10, 2019 • edited Loading

tlake commented Oct 10, 2019 •

edited

Loading