-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s metadata for metricbeat Kubernetes module missing at startup #41213
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I assigned this to myself, as I already built a fix in the process of debugging the root cause. I'll submit it after polishing it a bit. |
@swiatekm could you please provide some details for the dashboards of the description? From what I see we have the same replicaset.names in pod and state_pod. So if it was missing I would expect to have more state_pod and pod entries and less replicas. Do I miss something?
Do you think that this might have to do with the fact that your pods where unscheduled in some of your test? Do you have the same phenomenon in clusters where scheduling is successful? |
To be exact, I was using the following Helm Chart used internally by our SRE: https://github.com/elastic/elastic-agent-service. This Chart has agent run as a sidecar to kube-state-metrics, and a separate agent DaemonSet which collects kubelet metrics (so The cluster I tested this in has a lot of empty ReplicaSets, which is why the data looks the way it does, but I can easily reproduce it in a more normal-looking one. I suppose I should be able to reproduce this effect with pod metadata for For the record, I don't have any unscheduled Pods in this cluster, just a lot of ReplicaSets scaled down to 0. |
I will also try to reproduce, but what I am trying to understand is how those visualisations prove that metadata are missing at some point?
Ok clear. The code on the other side looks ok |
Thinking about it more, if this is a race condition in initialization, then having a lot of resources probably makes it easier to trigger. When you start a watcher, it gets events about all the existing resources - which in my case was ~7500 ReplicaSets. That's a fair amount of time for something weird to happen. On the other hand, I found this effect to be completely binary - an agent Pod would either attach metadata to all events, or none. You can see it on my graph, where the changes are very discrete. I had three kube-state-metrics agents there, and the line always changes by 1/3rd of the total.
The graphs show metric events where |
Ah now got it, this is the info was missing. You check for kubernetes.deployment.name. |
When using the kubernetes module, metadata can randomly be missing after metricbeat starts. I've noticed this with
Deployment
metadata while testing fixes for elastic/elastic-agent#5623, but I suspect it can happen whenever a metadata watcher is shared between multiple enrichers.The problem presents as metadata not being present for data coming from a given metricbeat instance. Restarting can fix the issue, or create it if it wasn't present. Modifying any of the resources which are the source of the metadata (in my case, ReplicaSets) makes the metadata appear, suggesting that the enricher doesn't see the initial resource list from the watcher, but does see subsequent changes.
Here's a simple dashboard I've created to track the metadata across elastic-agent restarts:
I strongly suspect we have a race condition in the enricher initialization code, and I've fixed the issue in a PoC by getting rid of this initialization. However, I haven't been able to definitively explain how it could happen given the locking we do around it. In any case, what we do there is far from optimal, and worth improving if it fixes this issue.
The text was updated successfully, but these errors were encountered: