You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've observed services flapping with L0 v0.11.0 - it seems that sometimes a service is brought up in a subnet that isn't part of the load balancer's subnets, which causes the healthcheck to fail and for the task to be terminated and restarted. This flapping occurs until the service is randomly brought up in a subnet associated with the load balancer, or until a user manually adds the missing subnet to the load balancer.
We can't repro it with the previous L0 version v0.10.10, so it seems to be new to v0.11.0.
We can repro in v0.11.0 regardless of whether the loadbalancer was created with cross-zone enabled or disabled, so it's probably not caused by the new cross-zone load balancing feature.
We can repro in v0.11.0 with an environment using the same instance AMI that v0.10.10 used by default, so it's probably not caused by having updated the AMI to "latest."
We've confirmed that the ECS Agent version and Docker version are the same between v0.10.10 and v0.11.0, so it's not those either.
The list of subnets is generated by l0-setup and then spat out as environment variables. It's possible that there's some bug in l0-setup that's gone unnoticed until now.
There's a comment in api/backend/ecs/load_balancer_manager.go in reference to the getSubnetsAndAvailZones() function that may be worth investigating:
// this is awkward, strongly assumes that PrivateSubnets will be distributed across AZs,
// using each at most once. We error out on bad config for now, in the future we'll
// need to do something to calculate which subnets to use based on where the instance
// got provisioned.
We're not sure what would have changed between v0.10.10 and v0.11.0 that would have started making this a problem, but we haven't ruled it out yet either.
It also might be worth investigating the AWS Terraform provider and whether it's different between Terraform v0.11.x and v0.12.x. If the underlying logic that the provider uses has changed, and if only the v0.12.x provider has those changes, it could be the source of our troubles. If so, making Layer0 compatible with Terraform v0.12.x would be required to solve the problem.
The text was updated successfully, but these errors were encountered:
We've observed services flapping with L0 v0.11.0 - it seems that sometimes a service is brought up in a subnet that isn't part of the load balancer's subnets, which causes the healthcheck to fail and for the task to be terminated and restarted. This flapping occurs until the service is randomly brought up in a subnet associated with the load balancer, or until a user manually adds the missing subnet to the load balancer.
The list of subnets is generated by
l0-setup
and then spat out as environment variables. It's possible that there's some bug inl0-setup
that's gone unnoticed until now.There's a comment in
api/backend/ecs/load_balancer_manager.go
in reference to thegetSubnetsAndAvailZones()
function that may be worth investigating:We're not sure what would have changed between v0.10.10 and v0.11.0 that would have started making this a problem, but we haven't ruled it out yet either.
It also might be worth investigating the AWS Terraform provider and whether it's different between Terraform v0.11.x and v0.12.x. If the underlying logic that the provider uses has changed, and if only the v0.12.x provider has those changes, it could be the source of our troubles. If so, making Layer0 compatible with Terraform v0.12.x would be required to solve the problem.
The text was updated successfully, but these errors were encountered: