Lack of network connectivity between Fargate pods and self-managed workers #1196

TBeijen · 2021-01-25T15:34:58Z

I have issues

I'm adding a Fargate profile to a cluster consisting of self-managed launch template workers. I notice there's no network connectivity between the Fargate pods and the pod running on the EC2 nodes. This is due to the cluster_security_group not being set on autoscaling EC2 workers.

I'm submitting a...

bug report
feature request
support request - read the FAQ first!
kudos, thank you, warm fuzzy

What is the current behavior?

Hybrid clusters consisting of Fargate pods and self-managed autoscaling groups lack network connectivity. Pods can interact with kubelet, and all kubelets can interact with control plane. So from Kubernetes perspective, all pods seem healthy. However network connectivity between pods on Fargate and pods on regular EC2 instances is impossible, e.g.

Ingress daemonset on regular EC2 nodes can't route traffic to Fargate pods.
Fargate pods can't query CoreDNS running on regular EC2 nodes.
Prometheus (operator) discovers pods and Fargate nodes just fine. However scraping times out.
Etc.

If this is a bug, how to reproduce? Please include a code sample if relevant.

Create a cluster with launch template based node-group. Deploy an application consisting of multiple services. Spread service over Fargate and regular deployments.
Observe network problems as described above.
Add cluster_security_group to launch template, similar to code fragment below and cycle workers.
Observe network problems solve and pods connecting as expected.

Adding security group to workers: TBeijen@c949473

    security_groups = flatten([
      local.worker_security_group_id,
      var.worker_additional_security_group_ids,
      # Added this line
      aws_eks_cluster.this[0].vpc_config[0].cluster_security_group_id,
      lookup(
        var.worker_groups_launch_template[count.index],
        "additional_security_group_ids",

What's the expected behavior?

Full network connectivity between self-managed workers, managed node groups and Fargate pods

Are you able to fix this problem and submit a PR? Link here if you have already.

Yes

Environment details

Affected module version: v13.2.1
OS: N/A
Terraform version: 0.12.30
Terraform AWS provider version: 3.24.1

Any other relevant info

Things to consider:

Adding a security group to launch template reduces number of security groups that can be added. This will affect module upgrade path for some users.
At a first glance it looks like the cluster_security_group set up by EKS, covers all cases the additional security group, added by this module facilitates. (Full comm. between control plane, self-managed workers, managed nodes and Fargate nodes). This also seems hinted by the comments in fix: Add vpc_config.cluster_security_group output as primary cluster security group id #828

(Toally out of scope of just this issue) What's status on any (if any) refactor plans on launch template worker groups? Also considering hard-to-fix problems like #737 which seem to originate from an over-usage of random-pet.

The text was updated successfully, but these errors were encountered:

barryib · 2021-01-29T08:45:50Z

@TBeijen the actual security groups management is quite messy since we still support the legacy. Today, we don't need to create the cluster security group anymore, but sounds like we're still doing it. There is a need for a code cleaning and refactoring.

With that said, I think there is a variable for your use case var.worker_create_cluster_primary_security_group_rules even if its name is quite confusing. This switch, will create a SG rule to allow pods running on workers to receive communication from cluster primary security group (e.g. Fargate pods).

TBeijen · 2021-01-29T12:18:31Z

@barryib Yup, just this morning found worker_create_cluster_primary_security_group_rules and indeed that accomplishes the same goal.

Looking at https://github.com/terraform-aws-modules/terraform-aws-eks/pull/858/files#diff-2fdb488192d2afd49fb090fcc8bd32fd3af72bcb789420915e78d6406ef9e2e4L4, the current legacy-compatible security groups are still there. Moving workers into a submodule has great potential for cleanup. Things that spring to mind:

Removing the EKS <1.14 bits
Sticking to the (primary) cluster_security_group
Not creating the random_pet unless neccessary (asg_recreate_on_change) to reduce proposed changes during terraform plan wherever possible.

Is there a sort of high-level roadmap for this type of progress? I'd gladly help out (given time, which differs greatly per week).

aloisbarreras · 2021-01-29T22:31:57Z

I actually had this same problem today and eventually found worker_create_cluster_primary_security_group_rules as well.

I am happy to write the code to make this simpler if the maintainers want to point me in a high level direction that will integrate nicely with the current roadmap.

stale · 2021-04-30T02:46:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-05-30T13:56:46Z

This issue has been automatically closed because it has not had recent activity since being marked as stale.

github-actions · 2022-11-21T02:28:01Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

TBeijen mentioned this issue Jan 31, 2021

feat: Ability to manage worker groups as maps #858

Closed

2 tasks

stale bot added the stale label Apr 30, 2021

stale bot closed this as completed May 30, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of network connectivity between Fargate pods and self-managed workers #1196

Lack of network connectivity between Fargate pods and self-managed workers #1196

TBeijen commented Jan 25, 2021

barryib commented Jan 29, 2021

TBeijen commented Jan 29, 2021

aloisbarreras commented Jan 29, 2021

stale bot commented Apr 30, 2021

stale bot commented May 30, 2021

github-actions bot commented Nov 21, 2022

Lack of network connectivity between Fargate pods and self-managed workers #1196

Lack of network connectivity between Fargate pods and self-managed workers #1196

Comments

TBeijen commented Jan 25, 2021

I have issues

I'm submitting a...

What is the current behavior?

If this is a bug, how to reproduce? Please include a code sample if relevant.

What's the expected behavior?

Are you able to fix this problem and submit a PR? Link here if you have already.

Environment details

Any other relevant info

barryib commented Jan 29, 2021

TBeijen commented Jan 29, 2021

aloisbarreras commented Jan 29, 2021

stale bot commented Apr 30, 2021

stale bot commented May 30, 2021

github-actions bot commented Nov 21, 2022