Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix recovering from invalid output configuration under Elastic-Agent #36016

Merged
merged 5 commits into from
Jul 7, 2023

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Jul 6, 2023

What does this PR do?

This PR fixes two problems in the ManagerV2:

  1. If the first output configuration was invalid, it would never be
    "saved", so if a new output configuration was received, the ManagerV2
    would try to reload it even with stopOnOutputReload enabled.

  2. When stopOnOutputReload was set the output reload was skipped,
    but the ManagerV2 would still proceed with reloading inputs. This
    created a race condition where some inputs/runners would receive their
    stop signal before even start, effectively locking the shutdown
    process.

It also refactors some of the code used to mock the Elastic-Agent and
aggregates some shared code into a single package.

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Author's Checklist

How to test this PR locally

Run the Go integration tests. Optionally you can also run the single test added by this PR:

cd x-pack/filebeat
mage buildSystemTestBinary && go test -v -count=1 -tags=integration -run=TestRecover ./tests/integration

Related issues

## Use cases
## Screenshots
## Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 6, 2023
@mergify
Copy link
Contributor

mergify bot commented Jul 6, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@belimawr belimawr marked this pull request as ready for review July 6, 2023 16:47
@belimawr belimawr requested a review from a team as a code owner July 6, 2023 16:47
@belimawr belimawr requested review from rdner and leehinman July 6, 2023 16:47
@belimawr belimawr added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-v8.9.0 Automated backport with mergify labels Jul 6, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 6, 2023
@mergify
Copy link
Contributor

mergify bot commented Jul 6, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-agent-unhealthy-on-broken-output upstream/fix-agent-unhealthy-on-broken-output
git merge upstream/main
git push upstream fix-agent-unhealthy-on-broken-output

belimawr added 4 commits July 6, 2023 18:54
This commit fixes two problems in the `ManagerV2`:
- 1. If the first output configuration was invalid, it would never be
"saved", so if a new output configuration was received, the ManagerV2
would try to reload it even with `stopOnOutputReload` enabled.

2. When `stopOnOutputReload` was set the output reload was skipped,
but the ManagerV2 would still proceed with reloading inputs. This
created a race condition where some inputs/runners would receive their
stop signal before even start, effectively locking the shutdown
process.
This commit add tests to ensure Beats can recover from an invalid
output configuration when running under Elastic-Agent.

It also refactors some of the code used to mock the Elastic-Agent and
aggregates some shared code into a single package.
@belimawr belimawr force-pushed the fix-agent-unhealthy-on-broken-output branch from 24e4cbe to e3367ed Compare July 6, 2023 16:54
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 6, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-07-07T09:20:44.710+0000

  • Duration: 93 min 25 sec

Test stats 🧪

Test Results
Failed 0
Passed 27494
Skipped 2023
Total 29517

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@belimawr belimawr force-pushed the fix-agent-unhealthy-on-broken-output branch from a3abbbe to 8c8bb9a Compare July 7, 2023 09:20
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, individually reverting the two fixes here (exiting reload when a restart is detected, unconditionally saving the output configuration) both cause the new test to fail.

@cmacknz cmacknz merged commit 15a9ec7 into elastic:main Jul 7, 2023
mergify bot pushed a commit that referenced this pull request Jul 7, 2023
…36016)

* Fix recovering from invalid output configuration under Elastic-Agent

This commit fixes two problems in the `ManagerV2`:
- 1. If the first output configuration was invalid, it would never be
"saved", so if a new output configuration was received, the ManagerV2
would try to reload it even with `stopOnOutputReload` enabled.

2. When `stopOnOutputReload` was set the output reload was skipped,
but the ManagerV2 would still proceed with reloading inputs. This
created a race condition where some inputs/runners would receive their
stop signal before even start, effectively locking the shutdown
process.

* Add and refactor tests

This commit add tests to ensure Beats can recover from an invalid
output configuration when running under Elastic-Agent.

It also refactors some of the code used to mock the Elastic-Agent and
aggregates some shared code into a single package.

* PR improvements

* Add changelog

* PR improvements

(cherry picked from commit 15a9ec7)

# Conflicts:
#	libbeat/tests/integration/framework.go
cmacknz added a commit that referenced this pull request Jul 8, 2023
…36016) (#36022)

* Fix recovering from invalid output configuration under Elastic-Agent

This commit fixes two problems in the `ManagerV2`:
- 1. If the first output configuration was invalid, it would never be
"saved", so if a new output configuration was received, the ManagerV2
would try to reload it even with `stopOnOutputReload` enabled.

2. When `stopOnOutputReload` was set the output reload was skipped,
but the ManagerV2 would still proceed with reloading inputs. This
created a race condition where some inputs/runners would receive their
stop signal before even start, effectively locking the shutdown
process.

* Add and refactor tests

This commit add tests to ensure Beats can recover from an invalid
output configuration when running under Elastic-Agent.

It also refactors some of the code used to mock the Elastic-Agent and
aggregates some shared code into a single package.

* PR improvements

* Add changelog

* PR improvements

(cherry picked from commit 15a9ec7)

# Conflicts:
#	libbeat/tests/integration/framework.go

Co-authored-by: Tiago Queiroz <[email protected]>
Co-authored-by: Craig MacKenzie <[email protected]>
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
…lastic#36016)

* Fix recovering from invalid output configuration under Elastic-Agent

This commit fixes two problems in the `ManagerV2`:
- 1. If the first output configuration was invalid, it would never be
"saved", so if a new output configuration was received, the ManagerV2
would try to reload it even with `stopOnOutputReload` enabled.

2. When `stopOnOutputReload` was set the output reload was skipped,
but the ManagerV2 would still proceed with reloading inputs. This
created a race condition where some inputs/runners would receive their
stop signal before even start, effectively locking the shutdown
process.

* Add and refactor tests

This commit add tests to ensure Beats can recover from an invalid
output configuration when running under Elastic-Agent.

It also refactors some of the code used to mock the Elastic-Agent and
aggregates some shared code into a single package.

* PR improvements

* Add changelog

* PR improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.9.0 Automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agents remain unhealthy on switching from invalid output to valid output.
3 participants