Add-ons management Race condition with Node Groups #1801

jmickle · 2022-01-20T04:04:07Z

Description

During module creation with Add-on "coredns" and a managed node group, there is the inevitable possibility of a race condition that I have hit a few times. Based on the random order of returned resources of spinning up the new the coredns add-on can potentially be attempted to be created before a manage node group which inherently creates a "wait for creation condition" which will fail out after 20 min. On a subsequent plan/apply given the ordering of resource creation has shifted, the node group will be created and then finally core-dns will destroy and recreate successfully.

Versions

Terraform: latest

Provider(s): latest

Module: latest

Reproduction

Steps to reproduce the behavior:

N/A N/A N/A

Code Snippet to Reproduce

Expected behavior

CoreDNS addon should probably have a depends-on block for the node profile.

Actual behavior

Terminal Output Screenshot(s)

Additional context

Happy to submit a PR, I just have not had an opportunity to get familiar with the actual module code at the moment. I will happily attempt to get familiar with the module to PR it this weekend. However, wanted to open an issue so if someone more familiar with it knows it could easily be a 5 minute fix.

bryantbiggs · 2022-01-20T13:25:42Z

We cannot control the behavior of CoreDNS in this module - its simply a configuration within the addon resource.

jmickle · 2022-01-20T20:56:33Z

thinking out loud here, from a high glance perspective can't a depends block be added to the addon's module that depends on the node_group module be ran first? Of course that would have to make the assumption that a node group is being constructed as part of the configuration............

bryantbiggs · 2022-01-21T19:48:31Z

I believe this is only an issue when deploying EVERYTHING at the same time from zero. not great if you are trying to do ephemeral clusters and launching them from scratch all the time, but that seems like a niche use case

wojtekszpunar · 2022-01-24T09:47:02Z

There is an additional problem with it.
Before add-ons were moved into the module, I had a separate module which created those add-ons and everything worked quite quick. I mean, it took maybe 2 min or so.
Once we have moved to module managed CoreDNS add-on our cluster boot time takes extra 15 min or more due to I believe this race condition (I can see that deployment is stuck on add-on installation and in console add-on is in degraded state).
I believe there is a connection between nodes created and add-on going into degraded state when we install add-ons within module which do not wait for nodes (aws/containers-roadmap#1389).

This happens nearly each time when you deploy a fresh cluster.

daroga0002 · 2022-01-24T11:37:09Z

maybe explicit dependency to addons relying on node groups will help here?

jmickle · 2022-01-26T21:42:53Z

Have to admit, I am stuck on a good approach to actually solve this. Solving this might require the module to make assumptions about the cluster that this module probably should not make since it relates to the idea of "enforcing" a way for the node group to be managed which might not necessarily be the desired approach.... Maybe it's best to strike out the addon's management from this module and make that an entirely different module?!

jjhidalgar · 2022-01-27T03:46:39Z

I have the addons separate from the module, so when I want to create a cluster I first run terraform apply with target=module.eks and then just terraform apply for some additional things including the addons, it works decent.

bryantbiggs · 2022-01-27T03:54:30Z

The examples all work, and yes there is an added wait period for CoreDNS - what is the exact issue we are trying to solve for here?

wojtekszpunar · 2022-01-27T09:45:26Z

I think the main objective here is to reduce time needed to install CoreDNS add-on. The problem I believe is with timing and when you start deploying a new cluster with this add-on enabled you can observe that it's going into degraded mode on AWS EKS Console.

I think the issue is with execution time. I mean add-on resource waits for EKS cluster (control plane) and then starts deploying CoreDNS. Problem here (I think) is that we do not have yet any nodes (nodegroups or any other) and internal kube-dns deployment is in pending state when we try to update it(relates to bug reported in AWS container roadmap).
This causes some problem on EKS add-on implementation and add-on goes into degraded mode and retrying to install, resulting in average 14-minute delay to install just that add-on.

Moving it to module might help but might not because depending on your choice of nodes you might have to have depend_on certain type of node types.
I had previously an implementation (before ver 18) where add-ons were part of an external module installing other Kubernetes resources and I have never seen this issue. Most likely because EKS created both Control Plane and Data Plane (nodes) and only then proceed with installation of add-ons and other resources.

I think the way to fix it is to temporarily remove this implementation or put some information about consequences and wait until AWS fixes the issue.
Another option is to insert depend_on into add-on resource to wait on data plane (which ever option you choose) which might be tricky.

bryantbiggs · 2022-01-27T16:01:25Z

right, so lets break this down to specifics though:

this only happens when spinning up a brand new cluster and you are provisioning the CoreDNS at the same time as cluster creation
the perceived issue from the statement above is that it "takes too long"

I don't know what the long term effects are if we add a depends_on, but it seems less than ideal because then we are saying the addons are mostly node addons (what if we have an addon that is geared towards the control plane and now it has to wait for nodes which could send things into a deadlock waiting)

I think there are two things here that work today without modification to the module:

Users can opt out of the module provided addon functionality and provide their own addon resource
Wait for some upstream resolution as identified in the link provided above [EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State aws/containers-roadmap#1389

jmickle · 2022-02-01T04:51:57Z

@bryantbiggs Apologies I have been OOO but, the perceived issue is not that it "takes too long" It's that you could potentially never be able to successfully run the create of a new EKS cluster using the addons block as a result of the necessity of having a managed node group to create the addon for coredns. If the addon attempts to create before the node group, the addon will always fail as it will never reach a "stable" state.

jmickle · 2022-02-01T04:55:12Z

Comments about the time of duration taking too long are scope creep on this issue report, the real issue is the dependency to create an addon for something like coredns that requires at minimum the default node group for kube-system. If the default node group is not created, OR, the addon attempts to be created in terraform BEFORE the node-group, terraform will inherently wait until the timeout to fail, and the EKS module will result in a "failed to create cluster state" resulting in the next run being a full new destroy and create. At which point you have to hope the ordering of how the resources are created happen in a correct order so that it will successfully create.

jmickle · 2022-02-01T05:05:35Z

Adding in here for further review: The issue is also present with the EKS_EBS_CSI_DRIVER.

I am really starting to think, this module should not do addons should be independent of this module.... As AWS continues to releases addons primarily ones from kubernetes-sig this will inherently become more of a problem.......

I also do not believe that we should be accepting the idea of run targets first and then run full apply.... that feels like defeat in the notion of IaC

bryantbiggs · 2022-02-01T12:14:42Z

Users are free to not enable addons from within the module and instead manage externally

paul-pop · 2022-02-02T11:01:55Z

One easy (but manual) way to fix is to head to the console after your node was created and joined the cluster and while Terraform is attempting to create the coredns addon and:

Head over to the Add-ons tab of your cluster
Select coredns and click Edit
Tick Override existing configuration for this add-on on the cluster and press Update

This is going to take a few seconds and it will show as Active at which moment Terraform will stop waiting and mark is as completed.

miseyu · 2022-02-03T08:48:18Z

The same phenomenon also occurs with aws-ebs-csi-driver.

I was able to avoid this by managing them individually.

resource "aws_eks_addon" "coredns" {
  cluster_name = module.eks.cluster_id
  addon_name   = "coredns"

  resolve_conflicts = "OVERWRITE"

  depends_on = [module.eks.eks_managed_node_groups]
}

resource "aws_eks_addon" "kube_proxy" {
  cluster_name = module.eks.cluster_id
  addon_name   = "kube-proxy"

  resolve_conflicts = "OVERWRITE"
}

resource "aws_eks_addon" "aws_ebs_csi_driver" {
  cluster_name = module.eks.cluster_id
  addon_name   = "aws-ebs-csi-driver"

  resolve_conflicts = "OVERWRITE"

  depends_on = [module.eks.eks_managed_node_groups]
}

philicious · 2022-02-03T12:58:03Z

Users are free to not enable addons from within the module and instead manage externally
..
what if we have an addon that is geared towards the control plane and now it has to wait for nodes which could send things into a deadlock waiting

imho, if a dependency on node-group isnt desired, then at least the documentation should have a clear warning about this problem and encourage people to manage addons separately like others suggested here

bryantbiggs · 2022-02-03T13:02:43Z

imho, if a dependency on node-group isnt desired, then at least the documentation should have a clear warning about this problem and encourage people to manage addons separately like others suggested here

#1801 (comment)

bryantbiggs · 2022-02-03T20:55:20Z

After further consideration, I think we can take on this change. It is obscure (only on a fresh cluster creation) and an issue for the upstream EKS service, but I think we can tolerate this change and have ways to accommodate any future addons that may require creation prior to node groups

bryantbiggs · 2022-02-03T21:55:13Z

Slight change - give me a bit to put together a new PR, but a fix is coming. you can see more in my edited note on #1840

jmickle · 2022-02-04T00:28:29Z

Thanks for reconsidering it, I definitely think (and have thought from the start) that depends on block would certainly help fix it. I do question if this is the direction the module should be going as it sort of makes an implicit assumption.... I do think that in the future addons should be a different module that probably takes an input for what node-group they are expected to be placed on. Something like (default, my-custom-ng, etc) might not be a bad option as it can be used to expand deeper to support taints/tolerances later on down the line.

jmickle · 2022-02-04T00:30:03Z

also another interesting thought, in the event that someone down the line wanted only a fargate based eks cluster, a separate module for addons could potentially allow passing in the required kube patches and extra variables for those services to make them run on fargate.

antonbabenko · 2022-02-07T20:34:30Z

This issue has been resolved in version 18.4.1 🎉

kaykhancheckpoint · 2022-02-11T16:42:39Z

@antonbabenko I experienced the same issue today on 18.5.1

luisamador · 2022-03-09T11:15:08Z

I've just hit this now on the latest version of the module: 18.8.1

KipsasJaujoj · 2022-04-15T10:43:49Z

Same issue with 18.20.2 version. Not sure how the issue was resolved @antonbabenko

0xdnL · 2022-04-19T13:58:00Z

Same issue here, using module version:18.13.0 on fresh cluster install.

bryantbiggs · 2022-04-19T14:05:05Z

if you run into an issue, please create a new issue with the steps and configuration to reproduce (after ensuring you are using the latest version of the module). thanks!

0xdnL · 2022-04-27T14:47:09Z

ok, I didn't have a racecondition, but misconfigured the vpc_cni_irsa as seen in #2041 (comment)

alexvaque · 2022-06-02T13:18:46Z

This issue has been resolved in version 18.4.1 tada

I updated to 18.19.0 and I follow the next instructions

The same phenomenon also occurs with aws-ebs-csi-driver.

I was able to avoid this by managing them individually.

resource "aws_eks_addon" "coredns" {
  cluster_name = module.eks.cluster_id
  addon_name   = "coredns"

  resolve_conflicts = "OVERWRITE"

  depends_on = [module.eks.eks_managed_node_groups]
}

resource "aws_eks_addon" "kube_proxy" {
  cluster_name = module.eks.cluster_id
  addon_name   = "kube-proxy"

  resolve_conflicts = "OVERWRITE"
}

resource "aws_eks_addon" "aws_ebs_csi_driver" {
  cluster_name = module.eks.cluster_id
  addon_name   = "aws-ebs-csi-driver"

  resolve_conflicts = "OVERWRITE"

  depends_on = [module.eks.eks_managed_node_groups]
}

and now it is working :) thanks

EDITED: after new EKS recreation I experienced the same previous issue. So it is not still fixed. I tried to increase to the last providers and TF module versions but did not work

JonEustace · 2022-07-24T12:26:59Z

Same issue here on 18.26

ecoupal-believe · 2022-08-11T09:32:05Z

Is there any workaround to make this works for a fresh install ? I tried depends_on but it doesn't seems to works for me.

EDIT:
I finally used that dirty tricks thats seems to works:

resource "time_sleep" "wait_3_minutes" {
  depends_on = [module.eks]

  create_duration = "3m"
}
resource "aws_eks_addon" "kube_proxy" {
  cluster_name      = module.eks.cluster_id
  addon_name        = "kube-proxy"
  depends_on = [time_sleep.wait_3_minutes]
}

resource "aws_eks_addon" "vpc_cni" {
  cluster_name      = module.eks.cluster_id
  addon_name        = "vpc-cni"
  resolve_conflicts = "OVERWRITE"
  depends_on = [time_sleep.wait_3_minutes]
}

resource "aws_eks_addon" "coredns" {
  cluster_name      = module.eks.cluster_id
  addon_name        = "coredns"
  resolve_conflicts = "OVERWRITE"
  depends_on = [aws_eks_addon.vpc_cni]
}

github-actions · 2022-11-09T02:36:47Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jseiser mentioned this issue Jan 21, 2022

Self-managed-based cluster can't have workers #1803

Closed

bryantbiggs added the question label Jan 21, 2022

paul-pop mentioned this issue Feb 2, 2022

add-on coredns creating take to much time #1765

Closed

bryantbiggs mentioned this issue Feb 3, 2022

Race-Condition with certain EKS Addons #1838

Closed

bryantbiggs mentioned this issue Feb 3, 2022

fix: Add node group dependency for EKS addons resource creation #1840

Merged

1 task

antonbabenko closed this as completed in #1840 Feb 7, 2022

kaykhancheckpoint mentioned this issue Feb 11, 2022

Creation of cluster and coredns addon timesout after 20 mins #1868

Closed

stecullum mentioned this issue Oct 19, 2022

Addons race condition with a nodegroup cloudposse/terraform-aws-eks-cluster#170

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add-ons management Race condition with Node Groups #1801

Add-ons management Race condition with Node Groups #1801

jmickle commented Jan 20, 2022

bryantbiggs commented Jan 20, 2022

jmickle commented Jan 20, 2022

bryantbiggs commented Jan 21, 2022

wojtekszpunar commented Jan 24, 2022

daroga0002 commented Jan 24, 2022

jmickle commented Jan 26, 2022

jjhidalgar commented Jan 27, 2022

bryantbiggs commented Jan 27, 2022 •

edited

Loading

wojtekszpunar commented Jan 27, 2022

bryantbiggs commented Jan 27, 2022

jmickle commented Feb 1, 2022 •

edited

Loading

jmickle commented Feb 1, 2022

jmickle commented Feb 1, 2022

bryantbiggs commented Feb 1, 2022

paul-pop commented Feb 2, 2022

miseyu commented Feb 3, 2022 •

edited

Loading

philicious commented Feb 3, 2022

bryantbiggs commented Feb 3, 2022 •

edited

Loading

bryantbiggs commented Feb 3, 2022

bryantbiggs commented Feb 3, 2022

jmickle commented Feb 4, 2022

jmickle commented Feb 4, 2022

antonbabenko commented Feb 7, 2022

kaykhancheckpoint commented Feb 11, 2022

luisamador commented Mar 9, 2022 •

edited

Loading

KipsasJaujoj commented Apr 15, 2022

0xdnL commented Apr 19, 2022

bryantbiggs commented Apr 19, 2022

0xdnL commented Apr 27, 2022

alexvaque commented Jun 2, 2022 •

edited

Loading

JonEustace commented Jul 24, 2022

ecoupal-believe commented Aug 11, 2022 •

edited

Loading

github-actions bot commented Nov 9, 2022

Add-ons management Race condition with Node Groups #1801

Add-ons management Race condition with Node Groups #1801

Comments

jmickle commented Jan 20, 2022

Description

Versions

Reproduction

Code Snippet to Reproduce

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

bryantbiggs commented Jan 20, 2022

jmickle commented Jan 20, 2022

bryantbiggs commented Jan 21, 2022

wojtekszpunar commented Jan 24, 2022

daroga0002 commented Jan 24, 2022

jmickle commented Jan 26, 2022

jjhidalgar commented Jan 27, 2022

bryantbiggs commented Jan 27, 2022 • edited Loading

wojtekszpunar commented Jan 27, 2022

bryantbiggs commented Jan 27, 2022

jmickle commented Feb 1, 2022 • edited Loading

jmickle commented Feb 1, 2022

jmickle commented Feb 1, 2022

bryantbiggs commented Feb 1, 2022

paul-pop commented Feb 2, 2022

miseyu commented Feb 3, 2022 • edited Loading

philicious commented Feb 3, 2022

bryantbiggs commented Feb 3, 2022 • edited Loading

bryantbiggs commented Feb 3, 2022

bryantbiggs commented Feb 3, 2022

jmickle commented Feb 4, 2022

jmickle commented Feb 4, 2022

antonbabenko commented Feb 7, 2022

kaykhancheckpoint commented Feb 11, 2022

luisamador commented Mar 9, 2022 • edited Loading

KipsasJaujoj commented Apr 15, 2022

0xdnL commented Apr 19, 2022

bryantbiggs commented Apr 19, 2022

0xdnL commented Apr 27, 2022

alexvaque commented Jun 2, 2022 • edited Loading

JonEustace commented Jul 24, 2022

ecoupal-believe commented Aug 11, 2022 • edited Loading

github-actions bot commented Nov 9, 2022

bryantbiggs commented Jan 27, 2022 •

edited

Loading

jmickle commented Feb 1, 2022 •

edited

Loading

miseyu commented Feb 3, 2022 •

edited

Loading

bryantbiggs commented Feb 3, 2022 •

edited

Loading

luisamador commented Mar 9, 2022 •

edited

Loading

alexvaque commented Jun 2, 2022 •

edited

Loading

ecoupal-believe commented Aug 11, 2022 •

edited

Loading