-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kube-prometheus-stack] Frequent errors for alert rule KubeAggregatedAPIErrors (aggregator_unavailable_apiservice_total) #3539
Comments
+1 |
Any feedback on this issue |
Also interested in more info about this. So far we've had the following:
We are going to increase the cpu configs for the metric server and see if this 'helps'. In the metric server logs we can't see anything 'strange' except a whole bunch of:
However these do not 'correlate' with the cpu thottling / timing of the alerts |
We found a correlation between the KubeAggregatedAPIErrors alerts and what appear to be huge spikes in requests to the Kubernetes API: However we have not been able to identify the source of these huge spikes, and they only seem to appear on Azure AKS. We raised a ticket with Microsoft support, but after some initial analysis they went very quiet and we haven't made any further progress. |
the same here |
@chencivalue - are you running Strimzi Kafka in your AKS cluster? |
@johnswarbrick-napier no but im using kafka-18.3.1 helm chart |
Might be related: |
We have same thing in GKS 1.26 |
Hi, i'm experiencing the same issue on AKS with K8S v1.29.2. I see that your thoughts are directed to metric-server too, so i'll communicate my observations. Metric server is using the aggregated API layer of Kubernetes.
K8S documentation about aggregated layer strongly advise to observe less than 5s of latency beetween API Server and extended API Server. This issue could be opened or mentionned elsewhere as the prometheus alert from the runbook (KubeAggregatedAPIErrors) seems legit.
|
We noticed a strong correlation in this issue with KEDA certs rotation. (We are running KEDA on AKS) |
UPDATE: |
We actually removed this alert of prometheus (KubeAggregatedAPIDown) from what we are alerted on, as per Microsoft response after a 1 year troubleshooting on the causes, hope it helps someone out there.
|
We actually removed this alert of prometheus (KubeAggregatedAPIErrors) from what we are alerted on, as per Microsoft response after a 1 year troubleshooting on the causes, hope it helps someone out there.
This proposal has been made during my working time at WeScale company ;) |
I think the root cause is a not correct alert addressed in kubernetes-monitoring/kubernetes-mixin#774 |
Describe the bug a clear and concise description of what the bug is.
Hi -
Running the latest kube-prometheus-stack 47.0.0 on Azure AKS I'm getting frequent alerts for bundled rule KubeAggregatedAPIErrors:
It's firing regularly across >100 Azure AKS clusters, but I don't know if this is a true error or a false positive.
What does this alert mean, and do I need to tune or even disable it?
Thanks in advance! :)
What's your helm version?
version.BuildInfo{Version:"v3.12.1", GitCommit:"f32a527a060157990e2aa86bf45010dfb3cc8b8d", GitTreeState:"clean", GoVersion:"go1.20.4"}
What's your kubectl version?
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
Which chart?
kube-prometheus-stack
What's the chart version?
47.0.0
What happened?
No response
What you expected to happen?
No response
How to reproduce it?
No response
Enter the changed values of values.yaml?
No response
Enter the command that you execute and failing/misfunctioning.
sum by (name, namespace, cluster) (increase(aggregator_unavailable_apiservice_total[10m])) > 4
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: