[kube-prometheus-stack] Frequent errors for alert rule KubeAggregatedAPIErrors (aggregator_unavailable_apiservice_total) #3539

johnswarbrick-napier · 2023-06-30T23:07:13Z

Describe the bug a clear and concise description of what the bug is.

Hi -

Running the latest kube-prometheus-stack 47.0.0 on Azure AKS I'm getting frequent alerts for bundled rule KubeAggregatedAPIErrors:

It's firing regularly across >100 Azure AKS clusters, but I don't know if this is a true error or a false positive.

What does this alert mean, and do I need to tune or even disable it?

Thanks in advance! :)

What's your helm version?

version.BuildInfo{Version:"v3.12.1", GitCommit:"f32a527a060157990e2aa86bf45010dfb3cc8b8d", GitTreeState:"clean", GoVersion:"go1.20.4"}

What's your kubectl version?

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

kube-prometheus-stack

What's the chart version?

47.0.0

What happened?

No response

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

sum by (name, namespace, cluster) (increase(aggregator_unavailable_apiservice_total[10m])) > 4

Anything else we need to know?

No response

kladiv · 2023-07-18T05:18:12Z

+1
The same here on k3s installed on Hetzner bare-metal servers

PhilipNO · 2023-08-10T03:36:57Z

Any feedback on this issue

Vandersteen · 2023-09-11T12:50:09Z

Also interested in more info about this.

So far we've had the following:

The metric-server was often crashlooping, so we gave it more resources (azure support gave us some instructions on how to do so: here)
The crashlooping stopped, however this error is still triggering 'often'
- (We only messed with the memory portion of the metric server)
After some investigation, it seems the metric-server is sometimes being 'cpu throttled', this appears to correlate with the timings of these alerts

We are going to increase the cpu configs for the metric server and see if this 'helps'.

In the metric server logs we can't see anything 'strange' except a whole bunch of:

E0909 01:06:44.237015       1 nanny_lib.go:130] Get "https://xxx.hcp.westeurope.azmk8s.io:443/api/v1/nodes?resourceVersion=0": net/http: TLS handshake timeout

However these do not 'correlate' with the cpu thottling / timing of the alerts

johnswarbrick-napier · 2023-09-11T14:45:11Z

We found a correlation between the KubeAggregatedAPIErrors alerts and what appear to be huge spikes in requests to the Kubernetes API:

However we have not been able to identify the source of these huge spikes, and they only seem to appear on Azure AKS.

We raised a ticket with Microsoft support, but after some initial analysis they went very quiet and we haven't made any further progress.

chencivalue · 2023-09-14T09:10:07Z

the same here

johnswarbrick-napier · 2023-09-14T09:11:36Z

@chencivalue - are you running Strimzi Kafka in your AKS cluster?

chencivalue · 2023-09-14T10:32:06Z

@johnswarbrick-napier no but im using kafka-18.3.1 helm chart

Vandersteen · 2023-09-18T08:29:19Z

Might be related:

Azure/AKS#3685

elghazal-a · 2024-01-22T12:36:37Z

We have same thing in GKS 1.26

damienvergnaud · 2024-04-22T17:05:48Z

Hi, i'm experiencing the same issue on AKS with K8S v1.29.2.

I see that your thoughts are directed to metric-server too, so i'll communicate my observations.

Metric server is using the aggregated API layer of Kubernetes.

For this, an AKS basic installation seems to declare a PATH of the Kubernetes API (Using an ApiService Object) for k8s API-Server to forward the path directly to the extended API server of metric server.
- In my case, the metric-server ApiService Object is the only one NOT using "Local" as a service in opposite to all others.
- Maybe it's your case too ?
- This COULD justify real latency beetween api server and metric server but i'm failing trying to prove it actually.

K8S documentation about aggregated layer strongly advise to observe less than 5s of latency beetween API Server and extended API Server.
https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/#response-latency

This issue could be opened or mentionned elsewhere as the prometheus alert from the runbook (KubeAggregatedAPIErrors) seems legit.

In my case, the APIService events send a failDiscoveryCheck on v1.beta.metrics.k8s.io/

jfouche-vendavo · 2024-08-06T10:48:58Z

We noticed a strong correlation in this issue with KEDA certs rotation. (We are running KEDA on AKS)
Unfortunately, fixing the certs rotation issue did not stop the KubeAggregatedAPIErrors !

jfouche-vendavo · 2024-08-07T14:16:51Z

UPDATE:
FYI I have disabled cert rotation on Keda as above but this does not fix the KubeAggregatedAPIErrors. These errors must be happening elsewhere (possibly not on KEDA).

damienvergnaud · 2024-11-25T15:14:20Z

We actually removed this alert of prometheus (KubeAggregatedAPIDown) from what we are alerted on, as per Microsoft response after a 1 year troubleshooting on the causes, hope it helps someone out there.

Summary/Resolution/Findings:

After reviewing similar cases in our history and all the tickets escalated to the Product Group, we have identified a single root cause analysis (RCA):

The alerts always coincide with the restarts of the API server pods, and for your cluster, this is indeed the case. This alert/error is inevitable because it is due to the destruction/creation (considering the container startup) of the API server pods, knowing that there are several replicas that coexist at all times to maintain a 100% SLA. According to your cluster metrics, the SLA has always been maintained at 100%. Conclusion: This Prometheus alert is unnecessary in a PaaS service like AKS because the control plane is completely managed on our side, and the API availability is automatically guaranteed, except in the case of a regional/global outage.

PS: The alerts coincide with the restarts, but this does not mean they are triggered at every restart. Restarts can occur several times a day.

Should you require any further assistance or have any questions regarding this matter, please do not hesitate to reach out. We are committed to providing you with the highest level of support and are here to help with any additional needs you may have.

Thank you for choosing Microsoft…!!

damienvergnaud · 2024-11-25T15:16:02Z

We actually removed this alert of prometheus (KubeAggregatedAPIErrors) from what we are alerted on, as per Microsoft response after a 1 year troubleshooting on the causes, hope it helps someone out there.

Summary/Resolution/Findings:

After reviewing similar cases in our history and all the tickets escalated to the Product Group, we have identified a single root cause analysis (RCA):

The alerts always coincide with the restarts of the API server pods, and for your cluster, this is indeed the case. This alert/error is inevitable because it is due to the destruction/creation (considering the container startup) of the API server pods, knowing that there are several replicas that coexist at all times to maintain a 100% SLA. According to your cluster metrics, the SLA has always been maintained at 100%. Conclusion: This Prometheus alert is unnecessary in a PaaS service like AKS because the control plane is completely managed on our side, and the API availability is automatically guaranteed, except in the case of a regional/global outage.

PS: The alerts coincide with the restarts, but this does not mean they are triggered at every restart. Restarts can occur several times a day.

Should you require any further assistance or have any questions regarding this matter, please do not hesitate to reach out. We are committed to providing you with the highest level of support and are here to help with any additional needs you may have.

Thank you for choosing Microsoft…!!

This proposal has been made during my working time at WeScale company ;)

sebastiangaiser · 2024-11-27T12:07:06Z

I think the root cause is a not correct alert addressed in kubernetes-monitoring/kubernetes-mixin#774

johnswarbrick-napier added the bug Something isn't working label Jun 30, 2023

zeritti changed the title ~~[prometheus-kube-stack] Frequent errors for alert rule KubeAggregatedAPIErrors (aggregator_unavailable_apiservice_total)~~ [kube-prometheus-stack] Frequent errors for alert rule KubeAggregatedAPIErrors (aggregator_unavailable_apiservice_total) Jul 3, 2023

This was referenced Aug 6, 2024

ERROR Reconciler error {"controller": "cert-rotator", "object": {"name":"kedaorg-certs","namespace":"keda"} kedacore/keda#5542

Closed

[BUG] AKS - KubeAPI high latency / failures during huge surges in requests Azure/AKS#3685

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kube-prometheus-stack] Frequent errors for alert rule KubeAggregatedAPIErrors (aggregator_unavailable_apiservice_total) #3539

[kube-prometheus-stack] Frequent errors for alert rule KubeAggregatedAPIErrors (aggregator_unavailable_apiservice_total) #3539

johnswarbrick-napier commented Jun 30, 2023

kladiv commented Jul 18, 2023

PhilipNO commented Aug 10, 2023

Vandersteen commented Sep 11, 2023 •

edited

Loading

johnswarbrick-napier commented Sep 11, 2023

chencivalue commented Sep 14, 2023

johnswarbrick-napier commented Sep 14, 2023

chencivalue commented Sep 14, 2023

Vandersteen commented Sep 18, 2023

elghazal-a commented Jan 22, 2024

damienvergnaud commented Apr 22, 2024

jfouche-vendavo commented Aug 6, 2024 •

edited

Loading

jfouche-vendavo commented Aug 7, 2024

damienvergnaud commented Nov 25, 2024

damienvergnaud commented Nov 25, 2024 •

edited

Loading

sebastiangaiser commented Nov 27, 2024 •

edited

Loading

[kube-prometheus-stack] Frequent errors for alert rule KubeAggregatedAPIErrors (aggregator_unavailable_apiservice_total) #3539

[kube-prometheus-stack] Frequent errors for alert rule KubeAggregatedAPIErrors (aggregator_unavailable_apiservice_total) #3539

Comments

johnswarbrick-napier commented Jun 30, 2023

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

Which chart?

What's the chart version?

What happened?

What you expected to happen?

How to reproduce it?

Enter the changed values of values.yaml?

Enter the command that you execute and failing/misfunctioning.

Anything else we need to know?

kladiv commented Jul 18, 2023

PhilipNO commented Aug 10, 2023

Vandersteen commented Sep 11, 2023 • edited Loading

johnswarbrick-napier commented Sep 11, 2023

chencivalue commented Sep 14, 2023

johnswarbrick-napier commented Sep 14, 2023

chencivalue commented Sep 14, 2023

Vandersteen commented Sep 18, 2023

elghazal-a commented Jan 22, 2024

damienvergnaud commented Apr 22, 2024

jfouche-vendavo commented Aug 6, 2024 • edited Loading

jfouche-vendavo commented Aug 7, 2024

damienvergnaud commented Nov 25, 2024

damienvergnaud commented Nov 25, 2024 • edited Loading

sebastiangaiser commented Nov 27, 2024 • edited Loading

Vandersteen commented Sep 11, 2023 •

edited

Loading

jfouche-vendavo commented Aug 6, 2024 •

edited

Loading

damienvergnaud commented Nov 25, 2024 •

edited

Loading

sebastiangaiser commented Nov 27, 2024 •

edited

Loading