-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add-ons management Race condition with Node Groups #1801
Comments
We cannot control the behavior of CoreDNS in this module - its simply a configuration within the addon resource. |
thinking out loud here, from a high glance perspective can't a depends block be added to the addon's module that depends on the node_group module be ran first? Of course that would have to make the assumption that a node group is being constructed as part of the configuration............ |
I believe this is only an issue when deploying EVERYTHING at the same time from zero. not great if you are trying to do ephemeral clusters and launching them from scratch all the time, but that seems like a niche use case |
There is an additional problem with it. This happens nearly each time when you deploy a fresh cluster. |
maybe explicit dependency to addons relying on node groups will help here? |
Have to admit, I am stuck on a good approach to actually solve this. Solving this might require the module to make assumptions about the cluster that this module probably should not make since it relates to the idea of "enforcing" a way for the node group to be managed which might not necessarily be the desired approach.... Maybe it's best to strike out the addon's management from this module and make that an entirely different module?! |
I have the addons separate from the module, so when I want to create a cluster I first run terraform apply with target=module.eks and then just terraform apply for some additional things including the addons, it works decent. |
The examples all work, and yes there is an added wait period for CoreDNS - what is the exact issue we are trying to solve for here? |
I think the main objective here is to reduce time needed to install CoreDNS add-on. The problem I believe is with timing and when you start deploying a new cluster with this add-on enabled you can observe that it's going into degraded mode on AWS EKS Console. I think the issue is with execution time. I mean add-on resource waits for EKS cluster (control plane) and then starts deploying CoreDNS. Problem here (I think) is that we do not have yet any nodes (nodegroups or any other) and internal kube-dns deployment is in pending state when we try to update it(relates to bug reported in AWS container roadmap). Moving it to module might help but might not because depending on your choice of nodes you might have to have depend_on certain type of node types. I think the way to fix it is to temporarily remove this implementation or put some information about consequences and wait until AWS fixes the issue. |
right, so lets break this down to specifics though:
I don't know what the long term effects are if we add a I think there are two things here that work today without modification to the module:
|
@bryantbiggs Apologies I have been OOO but, the perceived issue is not that it "takes too long" It's that you could potentially never be able to successfully run the create of a new EKS cluster using the addons block as a result of the necessity of having a managed node group to create the addon for coredns. If the addon attempts to create before the node group, the addon will always fail as it will never reach a "stable" state. |
Comments about the time of duration taking too long are scope creep on this issue report, the real issue is the dependency to create an addon for something like coredns that requires at minimum the default node group for kube-system. If the default node group is not created, OR, the addon attempts to be created in terraform BEFORE the node-group, terraform will inherently wait until the timeout to fail, and the EKS module will result in a "failed to create cluster state" resulting in the next run being a full new destroy and create. At which point you have to hope the ordering of how the resources are created happen in a correct order so that it will successfully create. |
Adding in here for further review: The issue is also present with the EKS_EBS_CSI_DRIVER. I am really starting to think, this module should not do addons should be independent of this module.... As AWS continues to releases addons primarily ones from kubernetes-sig this will inherently become more of a problem....... I also do not believe that we should be accepting the idea of run targets first and then run full apply.... that feels like defeat in the notion of IaC |
Users are free to not enable addons from within the module and instead manage externally |
One easy (but manual) way to fix is to head to the console after your node was created and joined the cluster and while Terraform is attempting to create the coredns addon and:
This is going to take a few seconds and it will show as Active at which moment Terraform will stop waiting and mark is as completed. |
The same phenomenon also occurs with I was able to avoid this by managing them individually. resource "aws_eks_addon" "coredns" {
cluster_name = module.eks.cluster_id
addon_name = "coredns"
resolve_conflicts = "OVERWRITE"
depends_on = [module.eks.eks_managed_node_groups]
}
resource "aws_eks_addon" "kube_proxy" {
cluster_name = module.eks.cluster_id
addon_name = "kube-proxy"
resolve_conflicts = "OVERWRITE"
}
resource "aws_eks_addon" "aws_ebs_csi_driver" {
cluster_name = module.eks.cluster_id
addon_name = "aws-ebs-csi-driver"
resolve_conflicts = "OVERWRITE"
depends_on = [module.eks.eks_managed_node_groups]
}
|
imho, if a dependency on node-group isnt desired, then at least the documentation should have a clear warning about this problem and encourage people to manage addons separately like others suggested here |
|
After further consideration, I think we can take on this change. It is obscure (only on a fresh cluster creation) and an issue for the upstream EKS service, but I think we can tolerate this change and have ways to accommodate any future addons that may require creation prior to node groups |
Slight change - give me a bit to put together a new PR, but a fix is coming. you can see more in my edited note on #1840 |
Thanks for reconsidering it, I definitely think (and have thought from the start) that depends on block would certainly help fix it. I do question if this is the direction the module should be going as it sort of makes an implicit assumption.... I do think that in the future addons should be a different module that probably takes an input for what node-group they are expected to be placed on. Something like (default, my-custom-ng, etc) might not be a bad option as it can be used to expand deeper to support taints/tolerances later on down the line. |
also another interesting thought, in the event that someone down the line wanted only a fargate based eks cluster, a separate module for addons could potentially allow passing in the required kube patches and extra variables for those services to make them run on fargate. |
This issue has been resolved in version 18.4.1 🎉 |
@antonbabenko I experienced the same issue today on 18.5.1 |
I've just hit this now on the latest version of the module: 18.8.1 |
Same issue with 18.20.2 version. Not sure how the issue was resolved @antonbabenko |
Same issue here, using module version: |
if you run into an issue, please create a new issue with the steps and configuration to reproduce (after ensuring you are using the latest version of the module). thanks! |
ok, I didn't have a racecondition, but misconfigured the |
I updated to 18.19.0 and I follow the next instructions
and now it is working :) thanks EDITED: after new EKS recreation I experienced the same previous issue. So it is not still fixed. I tried to increase to the last providers and TF module versions but did not work |
Same issue here on 18.26 |
Is there any workaround to make this works for a fresh install ? I tried depends_on but it doesn't seems to works for me. EDIT:
|
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Description
During module creation with Add-on "coredns" and a managed node group, there is the inevitable possibility of a race condition that I have hit a few times. Based on the random order of returned resources of spinning up the new the coredns add-on can potentially be attempted to be created before a manage node group which inherently creates a "wait for creation condition" which will fail out after 20 min. On a subsequent plan/apply given the ordering of resource creation has shifted, the node group will be created and then finally core-dns will destroy and recreate successfully.
Versions
Reproduction
Steps to reproduce the behavior:
N/A N/A N/ACode Snippet to Reproduce
Expected behavior
CoreDNS addon should probably have a depends-on block for the node profile.
Actual behavior
Terminal Output Screenshot(s)
Additional context
Happy to submit a PR, I just have not had an opportunity to get familiar with the actual module code at the moment. I will happily attempt to get familiar with the module to PR it this weekend. However, wanted to open an issue so if someone more familiar with it knows it could easily be a 5 minute fix.
The text was updated successfully, but these errors were encountered: