K8s metadata for metricbeat Kubernetes module missing at startup #41213

swiatekm · 2024-10-14T08:38:20Z

When using the kubernetes module, metadata can randomly be missing after metricbeat starts. I've noticed this with Deployment metadata while testing fixes for elastic/elastic-agent#5623, but I suspect it can happen whenever a metadata watcher is shared between multiple enrichers.

The problem presents as metadata not being present for data coming from a given metricbeat instance. Restarting can fix the issue, or create it if it wasn't present. Modifying any of the resources which are the source of the metadata (in my case, ReplicaSets) makes the metadata appear, suggesting that the enricher doesn't see the initial resource list from the watcher, but does see subsequent changes.

Here's a simple dashboard I've created to track the metadata across elastic-agent restarts:

I strongly suspect we have a race condition in the enricher initialization code, and I've fixed the issue in a PoC by getting rid of this initialization. However, I haven't been able to definitively explain how it could happen given the locking we do around it. In any case, what we do there is far from optimal, and worth improving if it fixes this issue.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-10-14T09:06:22Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

swiatekm · 2024-10-14T10:09:39Z

I assigned this to myself, as I already built a fix in the process of debugging the root cause. I'll submit it after polishing it a bit.

gizas · 2024-10-15T12:52:38Z

@swiatekm could you please provide some details for the dashboards of the description? From what I see we have the same replicaset.names in pod and state_pod. So if it was missing I would expect to have more state_pod and pod entries and less replicas. Do I miss something?

but does see subsequent changes.

Do you think that this might have to do with the fact that your pods where unscheduled in some of your test? Do you have the same phenomenon in clusters where scheduling is successful?

swiatekm · 2024-10-15T13:17:08Z

@swiatekm could you please provide some details for the dashboards of the description? From what I see we have the same replicaset.names in pod and state_pod. So if it was missing I would expect to have more state_pod and pod entries and less replicas. Do I miss something?

but does see subsequent changes.

Do you think that this might have to do with the fact that your pods where unscheduled in some of your test? Do you have the same phenomenon in clusters where scheduling is successful?

To be exact, I was using the following Helm Chart used internally by our SRE: https://github.com/elastic/elastic-agent-service. This Chart has agent run as a sidecar to kube-state-metrics, and a separate agent DaemonSet which collects kubelet metrics (so kubernetes.pod for example). The difference you see in the graph simply has to do with which agent Pod the data passes through, and the changes in time is me restarting the agent Pods.

The cluster I tested this in has a lot of empty ReplicaSets, which is why the data looks the way it does, but I can easily reproduce it in a more normal-looking one. I suppose I should be able to reproduce this effect with pod metadata for state_pod and state_container as well - would that help convince you?

For the record, I don't have any unscheduled Pods in this cluster, just a lot of ReplicaSets scaled down to 0.

gizas · 2024-10-15T13:27:16Z

I will also try to reproduce, but what I am trying to understand is how those visualisations prove that metadata are missing at some point?

For the record, I don't have any unscheduled Pods in this cluster, just a lot of ReplicaSets scaled down to 0.

Ok clear.

The code on the other side looks ok

swiatekm · 2024-10-15T13:33:28Z

Thinking about it more, if this is a race condition in initialization, then having a lot of resources probably makes it easier to trigger. When you start a watcher, it gets events about all the existing resources - which in my case was ~7500 ReplicaSets. That's a fair amount of time for something weird to happen.

On the other hand, I found this effect to be completely binary - an agent Pod would either attach metadata to all events, or none. You can see it on my graph, where the changes are very discrete. I had three kube-state-metrics agents there, and the line always changes by 1/3rd of the total.

I will also try to reproduce, but what I am trying to understand is how those visualisations prove that metadata are missing at some point?

The graphs show metric events where kubernetes.replicaset.name is set. The configuration has deployment metadata enabled for these metricsets, so if they have kubernetes.replicaset.name, they should also have kubernetes.deployment.name. The top graph shows records where this is not the case, while the bottom one is the opposite.

gizas · 2024-10-15T13:37:56Z

The graphs show metric events where kubernetes.replicaset.name is set. The configuration has deployment metadata enabled for these metricsets, so if they have kubernetes.replicaset.name, they should also have kubernetes.deployment.name. The top graph shows records where this is not the case, while the bottom one is the opposite.

Ah now got it, this is the info was missing. You check for kubernetes.deployment.name.

swiatekm self-assigned this Oct 14, 2024

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2024

swiatekm added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Oct 14, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2024

swiatekm mentioned this issue Oct 14, 2024

Fix Metricbeat k8s metadata sometimes not being present at startup #41216

Merged

4 tasks

swiatekm closed this as completed in #41216 Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s metadata for metricbeat Kubernetes module missing at startup #41213

K8s metadata for metricbeat Kubernetes module missing at startup #41213

swiatekm commented Oct 14, 2024

elasticmachine commented Oct 14, 2024

swiatekm commented Oct 14, 2024

gizas commented Oct 15, 2024

swiatekm commented Oct 15, 2024

gizas commented Oct 15, 2024 •

edited

Loading

swiatekm commented Oct 15, 2024

gizas commented Oct 15, 2024

K8s metadata for metricbeat Kubernetes module missing at startup #41213

K8s metadata for metricbeat Kubernetes module missing at startup #41213

Comments

swiatekm commented Oct 14, 2024

elasticmachine commented Oct 14, 2024

swiatekm commented Oct 14, 2024

gizas commented Oct 15, 2024

swiatekm commented Oct 15, 2024

gizas commented Oct 15, 2024 • edited Loading

swiatekm commented Oct 15, 2024

gizas commented Oct 15, 2024

gizas commented Oct 15, 2024 •

edited

Loading