-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation for high levels of in-cluster kube-apiserver traffic #620
Comments
Were any changes made to azureproxy? How can customers check if they were upgraded? |
The azureproxy deployment has disappeared from two of my clusters and all kube-svc-redirect pods are in CrashLoop... is this the effect of the upgrade? |
see #626 |
Any updates on this issue? |
Can you please add some color around the issue you are still chasing? I opened #637 today and am wondering if it is related. |
Quick update, thanks for your patience. The configuration changes for As part of the rollout there were some circumstances where incorrect limit/requests were set on What changed
While we don't yet drop a release version in a customer-visible spot, you can check your
What this doesDeamonSet change The Now traffic destined for the K8s control plane will remain local to the host originating the traffic, spreading the "query load" across a larger number of Azure VMs. Note that this moves the goalposts rather than completely fixing the problem. This workaround performs better with more cluster members. Connection timeouts Previous idle connection timeouts were too conservative and impacted watches. The symptoms would appear as aborted or closed connections before a watch timeout. Under most circumstances, the controller/informer loop would re-initialize the watch and carry on. Timeouts are now set to a minimum of 10 minutes. Going forwardWe've seen a large decrease in connection related issues across the fleet, but aren't yet out of the woods. Engineering continues to work on long-term networking updates to fully address high-volume in-cluster K8s workloads. |
@slack thanks a lot for the update. Understanding what is going with these changes is really important for us. |
Nice. Finally :) Guys, could you please confirm that restarting nodes of AKS is the way to get this patch? |
We restarted our nodes to get the patch (verified that
Could you please assist on this? |
Istio recently made some changes to reduce the number of watches set on the API-server (istio/istio#7675 (comment)), and this has had positive effects on stability in one of our AKS-clusters, at least. Is this relevant to how you continue to tune AKS, or was the impact of the number of watches already known? In any case, would it be possible to tune AKS to at least handle a somewhat higher number of watches? |
Where is source code for svcredirect and azure proxy ? |
I have an AKS cluster in westeurope region with 58 nodes. My prometheus server scrapes are timing out regularly, mostly the cadvisor & node metric targets. (We have found similar timout in verious api related operations as well). These kind of problems arose somewhere between the growth of this cluster from 14 nodes to 44 nodes. Since Prometheus is querying api server for scraping node & cadvisor metrics, the traffic caused by prometheus increased by almost ~5 times. Will this be a reason for api performance degredation ? scrape conf samples: |
I wonder if you are hitting a SNAT Port exhaustion? If using Advanced Networking for your cluster and UDR Routing to next hop to a NVA firewall, you could max out SNAT ports, maybe? The Azure Loadbalancer does have a metric for SNAP Port usage, assuming your NVAs are behind one. |
This issue has been automatically marked as stale because it has not had activity in 90 days. It will be closed if no further activity occurs. Thank you! |
This issue will now be closed because it hasn't had any activity for 15 days after stale. slack feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion. |
The AKS team is aware of performance issues when in-cluster components (like Tiller, Istio) generate a large amount of traffic to the Kubernetes API. Symptoms include slow in-cluster API responses, slow Kubernetes dashboard, long-running API watches timing out, or an inability to establish an outbound connection to the AKS cluster's external API endpoint.
In the short term, we have made a few changes to the AKS infrastructure that is expected to help, but not eliminate these timeouts. Updated configuration will begin global rollout on in the coming weeks. Customers who create new clusters or upgrade existing clusters will automatically receive this updated deployment.
In parallel, engineering is working on a long-term fix. As we make progress, we will update this GitHub issue.
The text was updated successfully, but these errors were encountered: