-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147
Comments
I wonder if #9166 is related |
I see that according to Prometheus data the grpc calls to etcd's Watch method have either "Cancelled" or "Unknown" grpc code. Check the graph of below expression e.g. last 24 hours:
So effectively if there are no calls to Watch API, the service looks healthy. If there are calls, they fail with !="OK" grpc code, and Prometheus produces an alert. Can someone comment on whether this behavior is expected? |
I believe when I've looked into this before it is #10289 |
(The fix appears in the etcd 3.5 changelog, though I have not upgraded yet) |
kube-prometheus-stack is currently not able to be easily updated with the latest etcd rules due to prometheus-community/helm-charts#225 |
I think this issue is fixed and can be closed. |
@allenporter thanks for findings! |
This fixes false positives for `etcdHighNumberOfFailedGRPCRequests` alerts (see etcd-io/etcd#13147)
This fixes false positives for `etcdHighNumberOfFailedGRPCRequests` alerts (see etcd-io/etcd#13147)
ISSUE TYPE
SUMMARY
We have a k8s cluster using kube-prometheus-stack for monitoring. This prometheus have a several alert rules to check all the cluster. One of those alert is to check request on etcd of kubernetes. The problem is that this rule is persistently alerting:
message = etcd cluster "kube-etcd": 100% of requests for Watch failed on etcd instance ". But the cluster is running properly.
We have changed the image of etcd to new one and we have the same problem.
ENVIRONMENT
EXPECTED RESULTS
No receive any alert about etcd
ACTUAL RESULTS
Labels
alertname = etcdHighNumberOfFailedGRPCRequests
grpc_method = Watch
grpc_service = etcdserverpb.Watch
instance = 192.168.251.221:2381
job = kube-etcd
prometheus = monitoring/kube-prometheus-stack-prometheus
severity = warning
Annotations
message = etcd cluster "kube-etcd": 100% of requests for Watch failed on etcd instance 192.168.251.221:2381.
The text was updated successfully, but these errors were encountered: