Fix awscloudwatch worker allocation #38953

faec · 2024-04-15T18:42:21Z

Fix a bug in cloudwatch worker allocation that could cause data loss (#38918).

The previous behavior wasn't really tested, since worker tasks were computed in cloudwatchPoller's polling loop which required live AWS connections. So in addition to the basic logical fix, I did some refactoring to cloudwatchPoller that makes the task iteration visible to unit tests.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

Fixes awscloudwatch input drops data #38918

elasticmachine · 2024-04-15T18:42:47Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2024-04-15T18:42:58Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2024-04-15T18:48:48Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 136 min 51 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2024-04-16T18:55:22Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b cloudwatch-fix upstream/cloudwatch-fix
git merge upstream/main
git push upstream cloudwatch-fix

kaiyan-sheng · 2024-04-18T03:04:28Z

x-pack/filebeat/input/awscloudwatch/cloudwatch.go

+		// main loop doesn't have to block on the workers
+		// while distributing new data.
+		workRequestChan:  make(chan struct{}),
+		workResponseChan: make(chan workResponse, 10),


Is this 10 chosen for a reason? Or just good practice?

The choice to buffer it is because it lets the polling loop send multiple responses in one scheduler interval, but the choice of 10 is just an arbitrary small number. (Heuristically I think of it, buffer of 10 = ~90% less contention in the main loop than it would have had if it was synchronous. And I don't know exactly how much contention the synchronous case is, but surely eliminating 90% of it will avoid any bottleneck.)

Fix a bug in cloudwatch worker allocation that could cause data loss (#38918). The previous behavior wasn't really tested, since worker tasks were computed in cloudwatchPoller's polling loop which required live AWS connections. So in addition to the basic logical fix, I did some refactoring to cloudwatchPoller that makes the task iteration visible to unit tests. (cherry picked from commit deece39)

Fix a bug in cloudwatch worker allocation that could cause data loss (#38918). The previous behavior wasn't really tested, since worker tasks were computed in cloudwatchPoller's polling loop which required live AWS connections. So in addition to the basic logical fix, I did some refactoring to cloudwatchPoller that makes the task iteration visible to unit tests. (cherry picked from commit deece39) Co-authored-by: Fae Charlton <[email protected]>

…jects (#39353) A large cleanup in the `aws-s3` input, reorganizing the file structure and splitting internal APIs into additional helpers. This change is meant to have no functional effect, it is strictly a cleanup and reorganization in preparation for future changes. The hope is that the new layout makes initialization steps and logical dependencies clearer. The main changes are: - Make `s3Poller` and `sqsReader` into standalone input objects, `s3PollerInput` and `sqsReaderInput`, that implement the `v2.Input` interface, instead of interleaving the two implementations within the same object. * Choose the appropriate input in `(*s3InputManager).Create` based on configuration * Move associated internal API out of the shared `input.go` into the new `s3_input.go` and `sqs_input.go`, while leaving `s3.go` and `sqs.go` for auxiliary helpers. * Give each input a copy of `config` and `awsConfig`, and remove redundant struct fields that simply shadowed fields already in those configs. - In `sqsReaderInput`, use a fixed set of worker goroutines and track task allocation via channel-based work requests instead of creating ephemeral workers via the previous custom semaphore implementation (similar to the [recent cloudwatch cleanup](#38953)). * Delete `aws.Sem`, since this was its last remaining caller - Collect the helpers related to approximate message count polling into a helper object, `messageCountMonitor`, so their role in the input is clearer. - Generally, break larger steps up into smaller helper functions - Generally, collect initialization dependencies in the same place so the sequencing is clearer.

faec added 2 commits April 15, 2024 11:23

Refactor cloudwatch worker task allocation

219e857

add unit tests for cloudwatchPoller.receive

977a0d3

faec added bug Team:Elastic-Agent Label for the Agent team Team:Cloud-Monitoring Label for the Cloud Monitoring team labels Apr 15, 2024

faec self-assigned this Apr 15, 2024

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 15, 2024

faec marked this pull request as ready for review April 15, 2024 18:42

faec requested a review from a team as a code owner April 15, 2024 18:42

faec added 2 commits April 15, 2024 14:43

Merge branch 'main' of github.com:elastic/beats into cloudwatch-fix

71134f9

update changelog

6cf5506

make check

ff24571

faec requested a review from kaiyan-sheng April 15, 2024 18:55

faec mentioned this pull request Apr 15, 2024

Meta: Improve performance and reliability of awss3 and awscloudwatch inputs #38956

Open

kaiyan-sheng reviewed Apr 18, 2024

View reviewed changes

cmacknz added the backport-v8.14.0 Automated backport with mergify label Apr 18, 2024

kaiyan-sheng approved these changes Apr 23, 2024

View reviewed changes

Merge branch 'main' of github.com:elastic/beats into cloudwatch-fix

f6a51d1

faec merged commit deece39 into elastic:main Apr 23, 2024
33 of 35 checks passed

faec deleted the cloudwatch-fix branch April 23, 2024 16:28

mergify bot mentioned this pull request Apr 23, 2024

[8.14](backport #38953) Fix awscloudwatch worker allocation #39164

Merged

6 tasks

faec mentioned this pull request May 1, 2024

aws-s3 input: Split S3 poller and SQS reader into explicit input objects #39353

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix awscloudwatch worker allocation #38953

Fix awscloudwatch worker allocation #38953

faec commented Apr 15, 2024 •

edited

Loading

elasticmachine commented Apr 15, 2024

mergify bot commented Apr 15, 2024

elasticmachine commented Apr 15, 2024 •

edited

Loading

Build stats

mergify bot commented Apr 16, 2024

kaiyan-sheng Apr 18, 2024

faec Apr 18, 2024

Fix awscloudwatch worker allocation #38953

Fix awscloudwatch worker allocation #38953

Conversation

faec commented Apr 15, 2024 • edited Loading

Checklist

Related issues

elasticmachine commented Apr 15, 2024

mergify bot commented Apr 15, 2024

elasticmachine commented Apr 15, 2024 • edited Loading

💚 Build Succeeded

Build stats

❕ Flaky test report

🤖 GitHub comments

mergify bot commented Apr 16, 2024

kaiyan-sheng Apr 18, 2024

Choose a reason for hiding this comment

faec Apr 18, 2024

Choose a reason for hiding this comment

faec commented Apr 15, 2024 •

edited

Loading

elasticmachine commented Apr 15, 2024 •

edited

Loading