AKS Advanced Networking model leads to frequent port exhaustion issues #637

strtdusty · 2018-09-05T16:10:41Z

What happened:
We experience latency on egress network connections and refused ingress connections due to port exhaustion issues on our single public IP.

What you expected to happen:
Communication to be unhindered by port exhaustion issues.

How to reproduce it (as minimally and precisely as possible):
Using advanced networking egress through a single PIP. We currently run a 6 node cluster with ~30 pods per node. Each node has ~8 outbound connections (to things like service bus, azure storage, management API etc).

Anything else we need to know?:
We use AKS with advanced networking with egress traffic going through a single basic load balancer/PIP. This model allows us to evaluate all traffic using a next-gen firewall. I know that AKS has done some changes recently around azureproxy to help limit the traffic to the master nodes but this has only slightly helped the issue.

Environment:

Kubernetes version (use kubectl version): 1.10.2
Size of cluster (how many worker nodes are in the cluster?) 6 nodes
General description of workloads in the cluster: service bus based microservices with differing backing storage (cosmosDB, storage etc).
Others:

The text was updated successfully, but these errors were encountered:

bremnes · 2018-09-05T23:28:10Z

@strtdusty, could this have something to do with the maximum pods per node limitation for advanced networking? I noticed that you are listing ~30 pods, which is the limitation.

(I haven't tried this myself, but I know of this limitation since we want to move to AKS and advanced networking, but are a bit skeptical of the 30 pod limitation. Update: I see now that this can be increased, so I guess it's no problem.)

strtdusty · 2018-09-06T00:00:13Z

@bremnes If it is due to the 30 pod limit then we are going to be in real trouble when we rebuild the cluster with a 100 pod limit. But yes, my understanding of the SNAT issue is that it will get worse with the more pods we load on a node. If my reading is correct, if we have a 101 node cluster, we will get 256 SNAT ports per VM/node. Assuming ~8 outbound connections per pod and 100 pods per node I would need at least 4 public IPs. Does this seem correct/expected?

https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections#pat

Also note that you cannot change the pod limitation on an existing cluster.

tomasr · 2018-09-07T20:34:37Z

Wanted to add that while it doesn't seem documented, the maximum value for the maximum number of pods per node when using advanced networking seems to be around 250. This is because as soon as a node is added to the cluster, every single possible IP (based on the number of pods setting) is taken from the vnet subnet and added to the network interface of the node; hitting thus the limit of maximum number of IPs per network interface in Azure.

I guess that's probably one good reason why the setting cannot be changed after cluster creation.

strtdusty · 2018-09-07T23:37:42Z

It is documented and the limit is 110. I think that is poor reasoning on limiting the pods/node, however. Who is to say that I cannot efficiently handle all of my allocated IPs in 100 nodes vs the 145 nodes that would be allowed in a vnet with a 110 limit? There is just a limit of 16,000 (used to be 4,000) IP per vnet. How I allocate those across nodes should not be dictated.

Our use case doesn't demand a higher density but there are probably people out there who do need it.

I agree this is probably why you can't change the density after creation. I would hope that when node pools are available you would be able to have different density on different pools.

tomasr · 2018-09-08T10:56:45Z

@strtdusty That's actually incorrect, in my experience. That document talks about the default values, not the actual limits. I have a cluster I created a few weeks ago with maxPods = 250:

And Azure CNI:

strtdusty · 2018-09-08T13:18:32Z

You are right @tomasr I was looking at the default limits.

jnoller · 2019-04-04T15:42:35Z

Closing this issue as old/stale.

If this issue still comes up, please confirm you are running the latest AKS release. If you are on the latest release and the issue can be re-created outside of your specific cluster please open a new github issue.

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

strtdusty mentioned this issue Sep 5, 2018

Performance degradation for high levels of in-cluster kube-apiserver traffic #620

Closed

jnoller closed this as completed Apr 4, 2019

ghost locked as resolved and limited conversation to collaborators Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS Advanced Networking model leads to frequent port exhaustion issues #637

AKS Advanced Networking model leads to frequent port exhaustion issues #637

strtdusty commented Sep 5, 2018

bremnes commented Sep 5, 2018 •

edited

Loading

strtdusty commented Sep 6, 2018

tomasr commented Sep 7, 2018

strtdusty commented Sep 7, 2018

tomasr commented Sep 8, 2018

strtdusty commented Sep 8, 2018

jnoller commented Apr 4, 2019

AKS Advanced Networking model leads to frequent port exhaustion issues #637

AKS Advanced Networking model leads to frequent port exhaustion issues #637

Comments

strtdusty commented Sep 5, 2018

bremnes commented Sep 5, 2018 • edited Loading

strtdusty commented Sep 6, 2018

tomasr commented Sep 7, 2018

strtdusty commented Sep 7, 2018

tomasr commented Sep 8, 2018

strtdusty commented Sep 8, 2018

jnoller commented Apr 4, 2019

bremnes commented Sep 5, 2018 •

edited

Loading