-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Node and container resource limit metrics missing intermittently #41453
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I've no idea what the problem is with linting on darwin. Looks like a build error, but we don't build metricbeat on MacOS, so it's difficult to diagnose. I can reproduce it locally by running golangci-lint with EDIT: Just added an exception, similar to #33649 . |
b4248b4
to
17cb914
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good overall, but I don't have deep knowledge of the integration. It would be helpful to get a review from another developer.
@swiatekm The changes look good to me. I tested them also and everything looks to be working as expected. |
@MichaelKatsoulis I'll update the documentation in a follow-up, I don't want to hold this PR up. |
…41453) (#41484) * Fix Pod and container resource limit metrics missing intermittently * Add another exception to typecheck linter (cherry picked from commit e7cc6fc) Co-authored-by: Mikołaj Świątek <[email protected]>
…41453) (#41483) * Fix Pod and container resource limit metrics missing intermittently * Add another exception to typecheck linter (cherry picked from commit e7cc6fc) Co-authored-by: Mikołaj Świątek <[email protected]>
…41453) (#41485) * Fix Pod and container resource limit metrics missing intermittently * Add another exception to typecheck linter (cherry picked from commit e7cc6fc) Co-authored-by: Mikołaj Świątek <[email protected]>
Proposed commit message
Fix Node and container resource limit metrics missing intermittently.
This is a bug very recently introduced by the refactor in #41216. Metadata watchers are not just responsible for updating metadata, but also Node and container metrics. Only updating the latter eagerly when metadata is requested leads to races, where the values may be missing depending on the order in which metrics are fetched.
This fix decouples metrics calculation from metadata calculation. Metrics now have their own handlers attached to the watcher, and are completely detached from metadata enrichers. I don't like the resulting architecture that much, as it concentrates a lot of logic in the watcher. But it is an improvement over the status quo, and I'd like to fix this bug promptly before we release it to users.
The bug was quite difficult to catch in E2E tests, as it could take some time to appear. I've tested this change much more carefully, and haven't seen any issues after hours of running it in my test cluster.
Checklist
How to test this PR locally
Simplest way is to install elastic-agent standalone and look at the default Kubernetes dashboard.
Related issues