Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS Advanced Networking model leads to frequent port exhaustion issues #637

Closed
strtdusty opened this issue Sep 5, 2018 · 7 comments
Closed

Comments

@strtdusty
Copy link

What happened:
We experience latency on egress network connections and refused ingress connections due to port exhaustion issues on our single public IP.

What you expected to happen:
Communication to be unhindered by port exhaustion issues.

How to reproduce it (as minimally and precisely as possible):
Using advanced networking egress through a single PIP. We currently run a 6 node cluster with ~30 pods per node. Each node has ~8 outbound connections (to things like service bus, azure storage, management API etc).

Anything else we need to know?:
We use AKS with advanced networking with egress traffic going through a single basic load balancer/PIP. This model allows us to evaluate all traffic using a next-gen firewall. I know that AKS has done some changes recently around azureproxy to help limit the traffic to the master nodes but this has only slightly helped the issue.

Environment:

  • Kubernetes version (use kubectl version): 1.10.2
  • Size of cluster (how many worker nodes are in the cluster?) 6 nodes
  • General description of workloads in the cluster: service bus based microservices with differing backing storage (cosmosDB, storage etc).
  • Others:
@bremnes
Copy link

bremnes commented Sep 5, 2018

@strtdusty, could this have something to do with the maximum pods per node limitation for advanced networking? I noticed that you are listing ~30 pods, which is the limitation.

(I haven't tried this myself, but I know of this limitation since we want to move to AKS and advanced networking, but are a bit skeptical of the 30 pod limitation. Update: I see now that this can be increased, so I guess it's no problem.)

@strtdusty
Copy link
Author

@bremnes If it is due to the 30 pod limit then we are going to be in real trouble when we rebuild the cluster with a 100 pod limit. But yes, my understanding of the SNAT issue is that it will get worse with the more pods we load on a node. If my reading is correct, if we have a 101 node cluster, we will get 256 SNAT ports per VM/node. Assuming ~8 outbound connections per pod and 100 pods per node I would need at least 4 public IPs. Does this seem correct/expected?

https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections#pat

Also note that you cannot change the pod limitation on an existing cluster.

@tomasr
Copy link

tomasr commented Sep 7, 2018

Wanted to add that while it doesn't seem documented, the maximum value for the maximum number of pods per node when using advanced networking seems to be around 250. This is because as soon as a node is added to the cluster, every single possible IP (based on the number of pods setting) is taken from the vnet subnet and added to the network interface of the node; hitting thus the limit of maximum number of IPs per network interface in Azure.

I guess that's probably one good reason why the setting cannot be changed after cluster creation.

@strtdusty
Copy link
Author

It is documented and the limit is 110. I think that is poor reasoning on limiting the pods/node, however. Who is to say that I cannot efficiently handle all of my allocated IPs in 100 nodes vs the 145 nodes that would be allowed in a vnet with a 110 limit? There is just a limit of 16,000 (used to be 4,000) IP per vnet. How I allocate those across nodes should not be dictated.

Our use case doesn't demand a higher density but there are probably people out there who do need it.

I agree this is probably why you can't change the density after creation. I would hope that when node pools are available you would be able to have different density on different pools.

@tomasr
Copy link

tomasr commented Sep 8, 2018

@strtdusty That's actually incorrect, in my experience. That document talks about the default values, not the actual limits. I have a cluster I created a few weeks ago with maxPods = 250:
image

And Azure CNI:

image

@strtdusty
Copy link
Author

You are right @tomasr I was looking at the default limits.

@jnoller
Copy link
Contributor

jnoller commented Apr 4, 2019

Closing this issue as old/stale.

If this issue still comes up, please confirm you are running the latest AKS release. If you are on the latest release and the issue can be re-created outside of your specific cluster please open a new github issue.

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

@jnoller jnoller closed this as completed Apr 4, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Jul 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants