Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add asynchronous ACK handling to S3 and SQS inputs #40699

Merged
merged 41 commits into from
Oct 16, 2024

Conversation

faec
Copy link
Contributor

@faec faec commented Sep 5, 2024

Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

This can best be tested by ingesting data from a live S3 or SQS queue. The scenario that most highlights the changed performance is ingesting many small individual objects.

Related issues

@faec faec added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-8.15 Automated backport to the 8.15 branch with mergify labels Sep 5, 2024
@faec faec self-assigned this Sep 5, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Sep 5, 2024
@jlind23
Copy link
Collaborator

jlind23 commented Sep 10, 2024

@faec is this ready to be reviewed by O11y team?

@faec
Copy link
Contributor Author

faec commented Sep 24, 2024

@faec is this ready to be reviewed by O11y team?

I think we've talked directly since this ping, but for visibility: I don't consider this ready for review until I can try it on a live SQS queue, which has been deferred for a few reasons (previously SDH rotation and illness, currently the OTel remote-offsite). I expect that running on a live queue will immediately fail in some obvious fixable ways and I want to get those out of the way before proper review.

You also asked me to integrate #39709 (which was never merged to main) with this PR before finalizing, which hasn't been started and will probably take a day or two beyond the basic smoke check.

Copy link
Contributor

mergify bot commented Sep 24, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b awss3-ack-handling upstream/awss3-ack-handling
git merge upstream/main
git push upstream awss3-ack-handling

Copy link
Contributor

mergify bot commented Sep 24, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 24, 2024
@pierrehilbert
Copy link
Collaborator

Let's separate the work here, I added issues in your next sprint to take care of #39718

@jlind23
Copy link
Collaborator

jlind23 commented Sep 24, 2024

Thanks @faec for the reply which close some of the knowledge gaps on my end.
I let you focus on the OTel related workshop and will come back in a few days.
The target remains unchanged and we should try to make all of those changes land before 8.16 feature freeze.

@faec faec requested a review from a team as a code owner October 14, 2024 23:08
@@ -46,6 +46,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
- Added `container.image.name` to `journald` Filebeat input's Docker-specific translated fields. {pull}40450[40450]
- Change log.file.path field in awscloudwatch input to nested object. {pull}41099[41099]
- Remove deprecated awscloudwatch field from Filebeat. {pull}41089[41089]
- `max_number_of_messages` config for S3 input's SQS mode is now ignored. Instead use `number_of_workers` to scale ingestion rate in both S3 and SQS modes. {pull}40699[40699]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What impact will this have for users that were relying on max_number_of_messages to try to tune the input?

Is it just going to perform better with no user intervention? Could there be unintended side effects, like agents running out of memory because they can now fill up their queues?

Under what circumstances would number_of_workers be something a user has to adjust?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What impact will this have for users that were relying on max_number_of_messages to try to tune the input?

For users that set the field to something unreasonably high to improve throughput (which was previously the only way to work around this bottleneck), they will instead get good performance by default with no added cost. The only scenario that should be "bad" in this change is if the input objects are very small (the worst-case bottleneck) and they intentionally tuned max_number_of_messages low so that ingestion would be extremely slow. In that case, ingestion speed will significantly increase (which costs a fair amount more network bandwidth, and marginally more CPU). [Though "I used this special config value to intentionally throttle my ingestion to one event per second" is very much in the "holding the escape key down to warm up the room" category imo -- total ingestion costs should not go up, it should just empty the initial queue faster.]

agents running out of memory because they can now fill up their queues?

It isn't impossible for this to happen -- if users configured a queue larger than they actually used, and were relying on the fact that this input never has more than 5-10 active events, then the new version will use more memory. However, even a full queue is usually less than 10% of Filebeat's memory footprint, so this wouldn't be a significant factor unless they had explicitly configured a very large queue that they never use.

Under what circumstances would number_of_workers be something a user has to adjust?

If they have a cap on network resources, or to a lesser extent CPU ("number of workers" is also "number of potentially parallel AWS API calls/downloads", and also number of downloaded json blobs parsed in parallel). Or in the other direction, I suppose, if they have a lot of cores and bandwidth and really want the SQS queue to move fast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let's add a little bit more context around this.

I think this changelog is a bit too modest, it should specifically mention the performance improvement and that tuning max_number_of_messages should not be necessary anymore.

You should also mention the small object use case and the potential to increase memory usage.

The changelogs here are usually terse but that isn't actually a requirement. You could add an entire sub-section if you wanted to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@faec faec enabled auto-merge (squash) October 15, 2024 23:44
@faec faec merged commit d2867fd into elastic:main Oct 16, 2024
141 of 143 checks passed
mergify bot pushed a commit that referenced this pull request Oct 16, 2024
Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster.

(cherry picked from commit d2867fd)

# Conflicts:
#	go.sum
#	x-pack/filebeat/input/awss3/input_benchmark_test.go
#	x-pack/filebeat/input/awss3/s3_objects.go
#	x-pack/filebeat/input/awss3/sqs_s3_event_test.go
mergify bot pushed a commit that referenced this pull request Oct 16, 2024
Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster.

(cherry picked from commit d2867fd)

# Conflicts:
#	x-pack/filebeat/input/awss3/input_benchmark_test.go
#	x-pack/filebeat/input/awss3/sqs_s3_event_test.go
faec added a commit that referenced this pull request Oct 16, 2024
…puts (#41249)

* Add asynchronous ACK handling to S3 and SQS inputs (#40699)

Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster.

(cherry picked from commit d2867fd)

# Conflicts:
#	x-pack/filebeat/input/awss3/input_benchmark_test.go
#	x-pack/filebeat/input/awss3/sqs_s3_event_test.go

* fix broken merge

---------

Co-authored-by: Fae Charlton <[email protected]>
@faec faec deleted the awss3-ack-handling branch October 16, 2024 13:45
belimawr pushed a commit to belimawr/beats that referenced this pull request Oct 18, 2024
Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Enable builds in the CI for aws cloud testing backport-8.x Automated backport to the 8.x branch with mergify backport-8.15 Automated backport to the 8.15 branch with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws-s3 input workers shouldn't wait for objects to be fully ingested before starting the next object
8 participants