Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s metadata for metricbeat Kubernetes module missing at startup #41213

Closed
swiatekm opened this issue Oct 14, 2024 · 7 comments · Fixed by #41216
Closed

K8s metadata for metricbeat Kubernetes module missing at startup #41213

swiatekm opened this issue Oct 14, 2024 · 7 comments · Fixed by #41216
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@swiatekm
Copy link
Contributor

When using the kubernetes module, metadata can randomly be missing after metricbeat starts. I've noticed this with Deployment metadata while testing fixes for elastic/elastic-agent#5623, but I suspect it can happen whenever a metadata watcher is shared between multiple enrichers.

The problem presents as metadata not being present for data coming from a given metricbeat instance. Restarting can fix the issue, or create it if it wasn't present. Modifying any of the resources which are the source of the metadata (in my case, ReplicaSets) makes the metadata appear, suggesting that the enricher doesn't see the initial resource list from the watcher, but does see subsequent changes.

Here's a simple dashboard I've created to track the metadata across elastic-agent restarts:

Image

I strongly suspect we have a race condition in the enricher initialization code, and I've fixed the issue in a PoC by getting rid of this initialization. However, I haven't been able to definitively explain how it could happen given the locking we do around it. In any case, what we do there is far from optimal, and worth improving if it fixes this issue.

@swiatekm swiatekm self-assigned this Oct 14, 2024
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2024
@swiatekm swiatekm added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Oct 14, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2024
@swiatekm
Copy link
Contributor Author

I assigned this to myself, as I already built a fix in the process of debugging the root cause. I'll submit it after polishing it a bit.

@gizas
Copy link
Contributor

gizas commented Oct 15, 2024

@swiatekm could you please provide some details for the dashboards of the description? From what I see we have the same replicaset.names in pod and state_pod. So if it was missing I would expect to have more state_pod and pod entries and less replicas. Do I miss something?

but does see subsequent changes.

Do you think that this might have to do with the fact that your pods where unscheduled in some of your test? Do you have the same phenomenon in clusters where scheduling is successful?

@swiatekm
Copy link
Contributor Author

@swiatekm could you please provide some details for the dashboards of the description? From what I see we have the same replicaset.names in pod and state_pod. So if it was missing I would expect to have more state_pod and pod entries and less replicas. Do I miss something?

but does see subsequent changes.

Do you think that this might have to do with the fact that your pods where unscheduled in some of your test? Do you have the same phenomenon in clusters where scheduling is successful?

To be exact, I was using the following Helm Chart used internally by our SRE: https://github.com/elastic/elastic-agent-service. This Chart has agent run as a sidecar to kube-state-metrics, and a separate agent DaemonSet which collects kubelet metrics (so kubernetes.pod for example). The difference you see in the graph simply has to do with which agent Pod the data passes through, and the changes in time is me restarting the agent Pods.

The cluster I tested this in has a lot of empty ReplicaSets, which is why the data looks the way it does, but I can easily reproduce it in a more normal-looking one. I suppose I should be able to reproduce this effect with pod metadata for state_pod and state_container as well - would that help convince you?

For the record, I don't have any unscheduled Pods in this cluster, just a lot of ReplicaSets scaled down to 0.

@gizas
Copy link
Contributor

gizas commented Oct 15, 2024

I will also try to reproduce, but what I am trying to understand is how those visualisations prove that metadata are missing at some point?

For the record, I don't have any unscheduled Pods in this cluster, just a lot of ReplicaSets scaled down to 0.

Ok clear.

The code on the other side looks ok

@swiatekm
Copy link
Contributor Author

Thinking about it more, if this is a race condition in initialization, then having a lot of resources probably makes it easier to trigger. When you start a watcher, it gets events about all the existing resources - which in my case was ~7500 ReplicaSets. That's a fair amount of time for something weird to happen.

On the other hand, I found this effect to be completely binary - an agent Pod would either attach metadata to all events, or none. You can see it on my graph, where the changes are very discrete. I had three kube-state-metrics agents there, and the line always changes by 1/3rd of the total.

I will also try to reproduce, but what I am trying to understand is how those visualisations prove that metadata are missing at some point?

The graphs show metric events where kubernetes.replicaset.name is set. The configuration has deployment metadata enabled for these metricsets, so if they have kubernetes.replicaset.name, they should also have kubernetes.deployment.name. The top graph shows records where this is not the case, while the bottom one is the opposite.

@gizas
Copy link
Contributor

gizas commented Oct 15, 2024

The graphs show metric events where kubernetes.replicaset.name is set. The configuration has deployment metadata enabled for these metricsets, so if they have kubernetes.replicaset.name, they should also have kubernetes.deployment.name. The top graph shows records where this is not the case, while the bottom one is the opposite.

Ah now got it, this is the info was missing. You check for kubernetes.deployment.name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants