Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add-ons management Race condition with Node Groups #1801

Closed
jmickle opened this issue Jan 20, 2022 · 33 comments · Fixed by #1840
Closed

Add-ons management Race condition with Node Groups #1801

jmickle opened this issue Jan 20, 2022 · 33 comments · Fixed by #1840

Comments

@jmickle
Copy link

jmickle commented Jan 20, 2022

Description

During module creation with Add-on "coredns" and a managed node group, there is the inevitable possibility of a race condition that I have hit a few times. Based on the random order of returned resources of spinning up the new the coredns add-on can potentially be attempted to be created before a manage node group which inherently creates a "wait for creation condition" which will fail out after 20 min. On a subsequent plan/apply given the ordering of resource creation has shifted, the node group will be created and then finally core-dns will destroy and recreate successfully.

Versions

  • Terraform: latest
  • Provider(s): latest
  • Module: latest

Reproduction

Steps to reproduce the behavior:

N/A N/A N/A

Code Snippet to Reproduce

Expected behavior

CoreDNS addon should probably have a depends-on block for the node profile.

Actual behavior

Terminal Output Screenshot(s)

Additional context

Happy to submit a PR, I just have not had an opportunity to get familiar with the actual module code at the moment. I will happily attempt to get familiar with the module to PR it this weekend. However, wanted to open an issue so if someone more familiar with it knows it could easily be a 5 minute fix.

@bryantbiggs
Copy link
Member

We cannot control the behavior of CoreDNS in this module - its simply a configuration within the addon resource.

@jmickle
Copy link
Author

jmickle commented Jan 20, 2022

thinking out loud here, from a high glance perspective can't a depends block be added to the addon's module that depends on the node_group module be ran first? Of course that would have to make the assumption that a node group is being constructed as part of the configuration............

@bryantbiggs
Copy link
Member

I believe this is only an issue when deploying EVERYTHING at the same time from zero. not great if you are trying to do ephemeral clusters and launching them from scratch all the time, but that seems like a niche use case

@wojtekszpunar
Copy link

There is an additional problem with it.
Before add-ons were moved into the module, I had a separate module which created those add-ons and everything worked quite quick. I mean, it took maybe 2 min or so.
Once we have moved to module managed CoreDNS add-on our cluster boot time takes extra 15 min or more due to I believe this race condition (I can see that deployment is stuck on add-on installation and in console add-on is in degraded state).
I believe there is a connection between nodes created and add-on going into degraded state when we install add-ons within module which do not wait for nodes (aws/containers-roadmap#1389).

This happens nearly each time when you deploy a fresh cluster.

@daroga0002
Copy link
Contributor

maybe explicit dependency to addons relying on node groups will help here?

@jmickle
Copy link
Author

jmickle commented Jan 26, 2022

Have to admit, I am stuck on a good approach to actually solve this. Solving this might require the module to make assumptions about the cluster that this module probably should not make since it relates to the idea of "enforcing" a way for the node group to be managed which might not necessarily be the desired approach.... Maybe it's best to strike out the addon's management from this module and make that an entirely different module?!

@jjhidalgar
Copy link
Contributor

I have the addons separate from the module, so when I want to create a cluster I first run terraform apply with target=module.eks and then just terraform apply for some additional things including the addons, it works decent.

@bryantbiggs
Copy link
Member

bryantbiggs commented Jan 27, 2022

The examples all work, and yes there is an added wait period for CoreDNS - what is the exact issue we are trying to solve for here?

@wojtekszpunar
Copy link

I think the main objective here is to reduce time needed to install CoreDNS add-on. The problem I believe is with timing and when you start deploying a new cluster with this add-on enabled you can observe that it's going into degraded mode on AWS EKS Console.

I think the issue is with execution time. I mean add-on resource waits for EKS cluster (control plane) and then starts deploying CoreDNS. Problem here (I think) is that we do not have yet any nodes (nodegroups or any other) and internal kube-dns deployment is in pending state when we try to update it(relates to bug reported in AWS container roadmap).
This causes some problem on EKS add-on implementation and add-on goes into degraded mode and retrying to install, resulting in average 14-minute delay to install just that add-on.

Moving it to module might help but might not because depending on your choice of nodes you might have to have depend_on certain type of node types.
I had previously an implementation (before ver 18) where add-ons were part of an external module installing other Kubernetes resources and I have never seen this issue. Most likely because EKS created both Control Plane and Data Plane (nodes) and only then proceed with installation of add-ons and other resources.

I think the way to fix it is to temporarily remove this implementation or put some information about consequences and wait until AWS fixes the issue.
Another option is to insert depend_on into add-on resource to wait on data plane (which ever option you choose) which might be tricky.

@bryantbiggs
Copy link
Member

right, so lets break this down to specifics though:

  • this only happens when spinning up a brand new cluster and you are provisioning the CoreDNS at the same time as cluster creation
  • the perceived issue from the statement above is that it "takes too long"

I don't know what the long term effects are if we add a depends_on, but it seems less than ideal because then we are saying the addons are mostly node addons (what if we have an addon that is geared towards the control plane and now it has to wait for nodes which could send things into a deadlock waiting)

I think there are two things here that work today without modification to the module:

  1. Users can opt out of the module provided addon functionality and provide their own addon resource
  2. Wait for some upstream resolution as identified in the link provided above [EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State aws/containers-roadmap#1389

@jmickle
Copy link
Author

jmickle commented Feb 1, 2022

@bryantbiggs Apologies I have been OOO but, the perceived issue is not that it "takes too long" It's that you could potentially never be able to successfully run the create of a new EKS cluster using the addons block as a result of the necessity of having a managed node group to create the addon for coredns. If the addon attempts to create before the node group, the addon will always fail as it will never reach a "stable" state.

@jmickle
Copy link
Author

jmickle commented Feb 1, 2022

Comments about the time of duration taking too long are scope creep on this issue report, the real issue is the dependency to create an addon for something like coredns that requires at minimum the default node group for kube-system. If the default node group is not created, OR, the addon attempts to be created in terraform BEFORE the node-group, terraform will inherently wait until the timeout to fail, and the EKS module will result in a "failed to create cluster state" resulting in the next run being a full new destroy and create. At which point you have to hope the ordering of how the resources are created happen in a correct order so that it will successfully create.

@jmickle
Copy link
Author

jmickle commented Feb 1, 2022

Adding in here for further review: The issue is also present with the EKS_EBS_CSI_DRIVER.

I am really starting to think, this module should not do addons should be independent of this module.... As AWS continues to releases addons primarily ones from kubernetes-sig this will inherently become more of a problem.......

I also do not believe that we should be accepting the idea of run targets first and then run full apply.... that feels like defeat in the notion of IaC

@bryantbiggs
Copy link
Member

Users are free to not enable addons from within the module and instead manage externally

@paul-pop
Copy link

paul-pop commented Feb 2, 2022

One easy (but manual) way to fix is to head to the console after your node was created and joined the cluster and while Terraform is attempting to create the coredns addon and:

  1. Head over to the Add-ons tab of your cluster
  2. Select coredns and click Edit
  3. Tick Override existing configuration for this add-on on the cluster and press Update

This is going to take a few seconds and it will show as Active at which moment Terraform will stop waiting and mark is as completed.

@miseyu
Copy link

miseyu commented Feb 3, 2022

The same phenomenon also occurs with aws-ebs-csi-driver.

I was able to avoid this by managing them individually.

resource "aws_eks_addon" "coredns" {
  cluster_name = module.eks.cluster_id
  addon_name   = "coredns"

  resolve_conflicts = "OVERWRITE"

  depends_on = [module.eks.eks_managed_node_groups]
}

resource "aws_eks_addon" "kube_proxy" {
  cluster_name = module.eks.cluster_id
  addon_name   = "kube-proxy"

  resolve_conflicts = "OVERWRITE"
}

resource "aws_eks_addon" "aws_ebs_csi_driver" {
  cluster_name = module.eks.cluster_id
  addon_name   = "aws-ebs-csi-driver"

  resolve_conflicts = "OVERWRITE"

  depends_on = [module.eks.eks_managed_node_groups]
}

@philicious
Copy link
Contributor

Users are free to not enable addons from within the module and instead manage externally
..
what if we have an addon that is geared towards the control plane and now it has to wait for nodes which could send things into a deadlock waiting

imho, if a dependency on node-group isnt desired, then at least the documentation should have a clear warning about this problem and encourage people to manage addons separately like others suggested here

@bryantbiggs
Copy link
Member

bryantbiggs commented Feb 3, 2022

imho, if a dependency on node-group isnt desired, then at least the documentation should have a clear warning about this problem and encourage people to manage addons separately like others suggested here

#1801 (comment)

@bryantbiggs
Copy link
Member

After further consideration, I think we can take on this change. It is obscure (only on a fresh cluster creation) and an issue for the upstream EKS service, but I think we can tolerate this change and have ways to accommodate any future addons that may require creation prior to node groups

@bryantbiggs
Copy link
Member

Slight change - give me a bit to put together a new PR, but a fix is coming. you can see more in my edited note on #1840

@jmickle
Copy link
Author

jmickle commented Feb 4, 2022

Thanks for reconsidering it, I definitely think (and have thought from the start) that depends on block would certainly help fix it. I do question if this is the direction the module should be going as it sort of makes an implicit assumption.... I do think that in the future addons should be a different module that probably takes an input for what node-group they are expected to be placed on. Something like (default, my-custom-ng, etc) might not be a bad option as it can be used to expand deeper to support taints/tolerances later on down the line.

@jmickle
Copy link
Author

jmickle commented Feb 4, 2022

also another interesting thought, in the event that someone down the line wanted only a fargate based eks cluster, a separate module for addons could potentially allow passing in the required kube patches and extra variables for those services to make them run on fargate.

@antonbabenko
Copy link
Member

This issue has been resolved in version 18.4.1 🎉

@kaykhancheckpoint
Copy link

@antonbabenko I experienced the same issue today on 18.5.1

@luisamador
Copy link

luisamador commented Mar 9, 2022

I've just hit this now on the latest version of the module: 18.8.1

@KipsasJaujoj
Copy link

Same issue with 18.20.2 version. Not sure how the issue was resolved @antonbabenko

@0xdnL
Copy link

0xdnL commented Apr 19, 2022

Same issue here, using module version:18.13.0 on fresh cluster install.

@bryantbiggs
Copy link
Member

if you run into an issue, please create a new issue with the steps and configuration to reproduce (after ensuring you are using the latest version of the module). thanks!

@0xdnL
Copy link

0xdnL commented Apr 27, 2022

ok, I didn't have a racecondition, but misconfigured the vpc_cni_irsa as seen in #2041 (comment)

@alexvaque
Copy link

alexvaque commented Jun 2, 2022

This issue has been resolved in version 18.4.1 tada

I updated to 18.19.0 and I follow the next instructions

The same phenomenon also occurs with aws-ebs-csi-driver.

I was able to avoid this by managing them individually.

resource "aws_eks_addon" "coredns" {
  cluster_name = module.eks.cluster_id
  addon_name   = "coredns"

  resolve_conflicts = "OVERWRITE"

  depends_on = [module.eks.eks_managed_node_groups]
}

resource "aws_eks_addon" "kube_proxy" {
  cluster_name = module.eks.cluster_id
  addon_name   = "kube-proxy"

  resolve_conflicts = "OVERWRITE"
}

resource "aws_eks_addon" "aws_ebs_csi_driver" {
  cluster_name = module.eks.cluster_id
  addon_name   = "aws-ebs-csi-driver"

  resolve_conflicts = "OVERWRITE"

  depends_on = [module.eks.eks_managed_node_groups]
}

and now it is working :) thanks

EDITED: after new EKS recreation I experienced the same previous issue. So it is not still fixed. I tried to increase to the last providers and TF module versions but did not work

@JonEustace
Copy link

Same issue here on 18.26

@ecoupal-believe
Copy link

ecoupal-believe commented Aug 11, 2022

Is there any workaround to make this works for a fresh install ? I tried depends_on but it doesn't seems to works for me.

EDIT:
I finally used that dirty tricks thats seems to works:

resource "time_sleep" "wait_3_minutes" {
  depends_on = [module.eks]

  create_duration = "3m"
}
resource "aws_eks_addon" "kube_proxy" {
  cluster_name      = module.eks.cluster_id
  addon_name        = "kube-proxy"
  depends_on = [time_sleep.wait_3_minutes]
}

resource "aws_eks_addon" "vpc_cni" {
  cluster_name      = module.eks.cluster_id
  addon_name        = "vpc-cni"
  resolve_conflicts = "OVERWRITE"
  depends_on = [time_sleep.wait_3_minutes]
}

resource "aws_eks_addon" "coredns" {
  cluster_name      = module.eks.cluster_id
  addon_name        = "coredns"
  resolve_conflicts = "OVERWRITE"
  depends_on = [aws_eks_addon.vpc_cni]
}

@github-actions
Copy link

github-actions bot commented Nov 9, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.