Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Node and container resource limit metrics missing intermittently #41453

Merged
merged 2 commits into from
Oct 30, 2024

Conversation

swiatekm
Copy link
Contributor

Proposed commit message

Fix Node and container resource limit metrics missing intermittently.

This is a bug very recently introduced by the refactor in #41216. Metadata watchers are not just responsible for updating metadata, but also Node and container metrics. Only updating the latter eagerly when metadata is requested leads to races, where the values may be missing depending on the order in which metrics are fetched.

This fix decouples metrics calculation from metadata calculation. Metrics now have their own handlers attached to the watcher, and are completely detached from metadata enrichers. I don't like the resulting architecture that much, as it concentrates a lot of logic in the watcher. But it is an improvement over the status quo, and I'd like to fix this bug promptly before we release it to users.

The bug was quite difficult to catch in E2E tests, as it could take some time to appear. I've tested this change much more carefully, and haven't seen any issues after hours of running it in my test cluster.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

Simplest way is to install elastic-agent standalone and look at the default Kubernetes dashboard.

Related issues

@swiatekm swiatekm added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Oct 25, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Oct 25, 2024
Copy link
Contributor

mergify bot commented Oct 25, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @swiatekm? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Oct 25, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 25, 2024
@swiatekm swiatekm added backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify labels Oct 25, 2024
@swiatekm swiatekm marked this pull request as ready for review October 25, 2024 13:01
@swiatekm swiatekm requested a review from a team as a code owner October 25, 2024 13:01
@swiatekm swiatekm requested review from gizas and constanca-m October 25, 2024 13:01
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 25, 2024

I've no idea what the problem is with linting on darwin. Looks like a build error, but we don't build metricbeat on MacOS, so it's difficult to diagnose. I can reproduce it locally by running golangci-lint with GOOS=darwin CGO_ENABLED=1, but the error is plain incorrect.

EDIT: Just added an exception, similar to #33649 .

@swiatekm swiatekm force-pushed the fix-metricbeat-container-metrics branch from b4248b4 to 17cb914 Compare October 28, 2024 10:46
@swiatekm swiatekm requested a review from a team as a code owner October 28, 2024 10:46
@swiatekm swiatekm requested review from faec and VihasMakwana October 28, 2024 10:47
@pierrehilbert pierrehilbert requested review from mauri870 and removed request for faec October 28, 2024 11:43
Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good overall, but I don't have deep knowledge of the integration. It would be helpful to get a review from another developer.

@MichaelKatsoulis
Copy link
Contributor

@swiatekm The changes look good to me. I tested them also and everything looks to be working as expected.
Maybe you could also update the enrichers.md file to also describe the updated process.

@swiatekm
Copy link
Contributor Author

@MichaelKatsoulis I'll update the documentation in a follow-up, I don't want to hold this PR up.

@swiatekm swiatekm merged commit e7cc6fc into main Oct 30, 2024
37 checks passed
@swiatekm swiatekm deleted the fix-metricbeat-container-metrics branch October 30, 2024 15:16
mergify bot pushed a commit that referenced this pull request Oct 30, 2024
…41453)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)
mergify bot pushed a commit that referenced this pull request Oct 30, 2024
…41453)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)
mergify bot pushed a commit that referenced this pull request Oct 30, 2024
…41453)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)
swiatekm added a commit that referenced this pull request Oct 30, 2024
…41453) (#41484)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)

Co-authored-by: Mikołaj Świątek <[email protected]>
pierrehilbert pushed a commit that referenced this pull request Oct 30, 2024
…41453) (#41483)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)

Co-authored-by: Mikołaj Świątek <[email protected]>
pierrehilbert pushed a commit that referenced this pull request Oct 30, 2024
…41453) (#41485)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)

Co-authored-by: Mikołaj Świątek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod and container resource limit metrics missing intermittently
4 participants