Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod and container resource limit metrics missing intermittently #41432

Closed
swiatekm opened this issue Oct 24, 2024 · 3 comments · Fixed by #41453
Closed

Pod and container resource limit metrics missing intermittently #41432

swiatekm opened this issue Oct 24, 2024 · 3 comments · Fixed by #41453
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@swiatekm
Copy link
Contributor

In #41216, calculating metadata for events in the metricbeat kubernetes module was made eager. As a result, it is only calculated when an event needs it, cached, and the cache is invalidated when an update arrives from the API Server. However, the update code not only computes metadata, but also calculates the limit metrics. This now only happens when we attach Pod metadata to an event for the first time after an update, so the presence of these metrics depends on the ordering of metric fetches.

In the short term, we should decouple these metrics updates from metadata enrichment and have the watcher apply them directly. In the longer term, maybe they can be made eager as well?

For confirmed bugs, please report:

  • Version: 8.15.4-SNAPSHOT, 8.16.0-SNAPSHOT, 8.17.0-SNAPSHOT, 9.0.0-SNAPSHOT
  • Steps to Reproduce:
    • Start metricbeat in K8s with the kubernetes module enabled
    • Wait until one of the Pods is updated
    • The limit metrics should be missing
@swiatekm swiatekm added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Oct 24, 2024
@swiatekm swiatekm self-assigned this Oct 24, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@nkvoll
Copy link
Member

nkvoll commented Oct 30, 2024

Since this is marked as "intermittently": We've described what can cause it to become missing -- what causes these metrics to become available again?

@swiatekm
Copy link
Contributor Author

Since this is marked as "intermittently": We've described what can cause it to become missing -- what causes these metrics to become available again?

They come back if other metrics related to Pods are fetched first, and there's no updates to Pods on the Node in the meantime. So, for example:

  1. We fetch Pod state metrics
  2. We'd compute metadata for them, also computing the resource limit metrics as a side effect
  3. We fetch Pod kubelet metrics (no Pod update happens between here and point 2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants