Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix awscloudwatch worker allocation #38953

Merged
merged 6 commits into from
Apr 23, 2024
Merged

Fix awscloudwatch worker allocation #38953

merged 6 commits into from
Apr 23, 2024

Conversation

faec
Copy link
Contributor

@faec faec commented Apr 15, 2024

Fix a bug in cloudwatch worker allocation that could cause data loss (#38918).

The previous behavior wasn't really tested, since worker tasks were computed in cloudwatchPoller's polling loop which required live AWS connections. So in addition to the basic logical fix, I did some refactoring to cloudwatchPoller that makes the task iteration visible to unit tests.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@faec faec added bug Team:Elastic-Agent Label for the Agent team Team:Cloud-Monitoring Label for the Cloud Monitoring team labels Apr 15, 2024
@faec faec self-assigned this Apr 15, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 15, 2024
@faec faec marked this pull request as ready for review April 15, 2024 18:42
@faec faec requested a review from a team as a code owner April 15, 2024 18:42
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Apr 15, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 15, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 136 min 51 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

mergify bot commented Apr 16, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b cloudwatch-fix upstream/cloudwatch-fix
git merge upstream/main
git push upstream cloudwatch-fix

// main loop doesn't have to block on the workers
// while distributing new data.
workRequestChan: make(chan struct{}),
workResponseChan: make(chan workResponse, 10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 10 chosen for a reason? Or just good practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice to buffer it is because it lets the polling loop send multiple responses in one scheduler interval, but the choice of 10 is just an arbitrary small number. (Heuristically I think of it, buffer of 10 = ~90% less contention in the main loop than it would have had if it was synchronous. And I don't know exactly how much contention the synchronous case is, but surely eliminating 90% of it will avoid any bottleneck.)

@cmacknz cmacknz added the backport-v8.14.0 Automated backport with mergify label Apr 18, 2024
@faec faec merged commit deece39 into elastic:main Apr 23, 2024
33 of 35 checks passed
@faec faec deleted the cloudwatch-fix branch April 23, 2024 16:28
mergify bot pushed a commit that referenced this pull request Apr 23, 2024
Fix a bug in cloudwatch worker allocation that could cause data loss (#38918).

The previous behavior wasn't really tested, since worker tasks were computed in cloudwatchPoller's polling loop which required live AWS connections. So in addition to the basic logical fix, I did some refactoring to cloudwatchPoller that makes the task iteration visible to unit tests.

(cherry picked from commit deece39)
faec added a commit that referenced this pull request Apr 24, 2024
Fix a bug in cloudwatch worker allocation that could cause data loss (#38918).

The previous behavior wasn't really tested, since worker tasks were computed in cloudwatchPoller's polling loop which required live AWS connections. So in addition to the basic logical fix, I did some refactoring to cloudwatchPoller that makes the task iteration visible to unit tests.

(cherry picked from commit deece39)

Co-authored-by: Fae Charlton <[email protected]>
faec added a commit that referenced this pull request May 9, 2024
…jects (#39353)

A large cleanup in the `aws-s3` input, reorganizing the file structure and splitting internal APIs into additional helpers.

This change is meant to have no functional effect, it is strictly a cleanup and reorganization in preparation for future changes. The hope is that the new layout makes initialization steps and logical dependencies clearer. The main changes are:

- Make `s3Poller` and `sqsReader` into standalone input objects, `s3PollerInput` and `sqsReaderInput`, that implement the `v2.Input` interface, instead of interleaving the two implementations within the same object.
  * Choose the appropriate input in `(*s3InputManager).Create` based on configuration
  * Move associated internal API out of the shared `input.go` into the new `s3_input.go` and `sqs_input.go`, while leaving `s3.go` and `sqs.go` for auxiliary helpers.
  * Give each input a copy of `config` and `awsConfig`, and remove redundant struct fields that simply shadowed fields already in those configs.
- In `sqsReaderInput`, use a fixed set of worker goroutines and track task allocation via channel-based work requests instead of creating ephemeral workers via the previous custom semaphore implementation (similar to the [recent cloudwatch cleanup](#38953)).
  * Delete `aws.Sem`, since this was its last remaining caller
- Collect the helpers related to approximate message count polling into a helper object, `messageCountMonitor`, so their role in the input is clearer.
- Generally, break larger steps up into smaller helper functions
- Generally, collect initialization dependencies in the same place so the sequencing is clearer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.14.0 Automated backport with mergify bug Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

awscloudwatch input drops data
4 participants