Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of network connectivity between Fargate pods and self-managed workers #1196

Closed
1 of 4 tasks
TBeijen opened this issue Jan 25, 2021 · 6 comments
Closed
1 of 4 tasks

Comments

@TBeijen
Copy link

TBeijen commented Jan 25, 2021

I have issues

I'm adding a Fargate profile to a cluster consisting of self-managed launch template workers. I notice there's no network connectivity between the Fargate pods and the pod running on the EC2 nodes. This is due to the cluster_security_group not being set on autoscaling EC2 workers.

I'm submitting a...

  • bug report
  • feature request
  • support request - read the FAQ first!
  • kudos, thank you, warm fuzzy

What is the current behavior?

Hybrid clusters consisting of Fargate pods and self-managed autoscaling groups lack network connectivity. Pods can interact with kubelet, and all kubelets can interact with control plane. So from Kubernetes perspective, all pods seem healthy. However network connectivity between pods on Fargate and pods on regular EC2 instances is impossible, e.g.

  • Ingress daemonset on regular EC2 nodes can't route traffic to Fargate pods.
  • Fargate pods can't query CoreDNS running on regular EC2 nodes.
  • Prometheus (operator) discovers pods and Fargate nodes just fine. However scraping times out.
  • Etc.

If this is a bug, how to reproduce? Please include a code sample if relevant.

  • Create a cluster with launch template based node-group. Deploy an application consisting of multiple services. Spread service over Fargate and regular deployments.
  • Observe network problems as described above.
  • Add cluster_security_group to launch template, similar to code fragment below and cycle workers.
  • Observe network problems solve and pods connecting as expected.

Adding security group to workers: TBeijen@c949473

    security_groups = flatten([
      local.worker_security_group_id,
      var.worker_additional_security_group_ids,
      # Added this line
      aws_eks_cluster.this[0].vpc_config[0].cluster_security_group_id,
      lookup(
        var.worker_groups_launch_template[count.index],
        "additional_security_group_ids",

What's the expected behavior?

Full network connectivity between self-managed workers, managed node groups and Fargate pods

Are you able to fix this problem and submit a PR? Link here if you have already.

Yes

Environment details

  • Affected module version: v13.2.1
  • OS: N/A
  • Terraform version: 0.12.30
  • Terraform AWS provider version: 3.24.1

Any other relevant info

Things to consider:

  • Adding a security group to launch template reduces number of security groups that can be added. This will affect module upgrade path for some users.
  • At a first glance it looks like the cluster_security_group set up by EKS, covers all cases the additional security group, added by this module facilitates. (Full comm. between control plane, self-managed workers, managed nodes and Fargate nodes). This also seems hinted by the comments in fix: Add vpc_config.cluster_security_group output as primary cluster security group id #828

(Toally out of scope of just this issue) What's status on any (if any) refactor plans on launch template worker groups? Also considering hard-to-fix problems like #737 which seem to originate from an over-usage of random-pet.

@barryib
Copy link
Member

barryib commented Jan 29, 2021

@TBeijen the actual security groups management is quite messy since we still support the legacy. Today, we don't need to create the cluster security group anymore, but sounds like we're still doing it. There is a need for a code cleaning and refactoring.

With that said, I think there is a variable for your use case var.worker_create_cluster_primary_security_group_rules even if its name is quite confusing. This switch, will create a SG rule to allow pods running on workers to receive communication from cluster primary security group (e.g. Fargate pods).

@TBeijen
Copy link
Author

TBeijen commented Jan 29, 2021

@barryib Yup, just this morning found worker_create_cluster_primary_security_group_rules and indeed that accomplishes the same goal.

Looking at https://github.com/terraform-aws-modules/terraform-aws-eks/pull/858/files#diff-2fdb488192d2afd49fb090fcc8bd32fd3af72bcb789420915e78d6406ef9e2e4L4, the current legacy-compatible security groups are still there. Moving workers into a submodule has great potential for cleanup. Things that spring to mind:

  • Removing the EKS <1.14 bits
  • Sticking to the (primary) cluster_security_group
  • Not creating the random_pet unless neccessary (asg_recreate_on_change) to reduce proposed changes during terraform plan wherever possible.

Is there a sort of high-level roadmap for this type of progress? I'd gladly help out (given time, which differs greatly per week).

@aloisbarreras
Copy link

I actually had this same problem today and eventually found worker_create_cluster_primary_security_group_rules as well.

I am happy to write the code to make this simpler if the maintainers want to point me in a high level direction that will integrate nicely with the current roadmap.

@stale
Copy link

stale bot commented Apr 30, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 30, 2021
@stale
Copy link

stale bot commented May 30, 2021

This issue has been automatically closed because it has not had recent activity since being marked as stale.

@stale stale bot closed this as completed May 30, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants