Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decompression: worse throughput when using tower_http::decompression than manual impl with async-compression #520

Closed
1 task done
magurotuna opened this issue Sep 22, 2024 · 0 comments · Fixed by #521
Closed
1 task done

Comments

@magurotuna
Copy link
Contributor

  • I have looked for existing issues (including closed) about this

Bug Report

I've seen in some situations the throughput of decompression gets significantly worse when using tower_http::decompression compared to manually implementing a similar logic with async-compression crate.

Version

Platform

Apple silicon macOS

(Not 100% sure, but should happen in Linux as well)

Description

In Deno, we switched the inner implementation of fetch (JavaScript API) from reqwest based to hyper-util based.

denoland/deno#24593

In the hyper-util based implementation, it uses tower_http::decompression to decompress the fetched data if necessary. Note here that reqwest doesn't use tower_http.

After this change, we started to see the throughput to be degraded especially when the server serves compressed large data. Looks at the following graph, showing how long each Deno version takes to 2k requests where it fetches compressed data from the upstream server and then forwards it the end client.

performance of proxying compressed data in different versions of Deno

v1.45.2 is before we switched to hyper-based fetch implementation. Since v1.45.3 when we landed it, the throughput got 10x worse.

Then I identified that tower_http::decompression causes this issue, and figured out that if we implement a decompression logic by directly using the async-compression crate, the performance gets back to what it was. (see denoland/deno#25800 for how manual implementation with async-compression affects the performance)

You can find how I performed the benchmark at https://github.com/magurotuna/deno_fetch_decompression_throughput

magurotuna added a commit to magurotuna/tower-http that referenced this issue Sep 22, 2024
Currently, every time `WrapBody::poll_frame` is called, new instance of
`BytesMut` is created with the default capacity, which is effectively
64 bytes. This ends up with a lot of memory allocation in certain
situations, making the throughput significantly worse.

To optimize memory allocation, `WrapBody` now gets `BytesMut` as its
field, with initial capacity of 4096 bytes. This buffer will be reused
as much as possible across multiple `poll_frame` calls, and only when
its capacity becomes 0, new allocation of another 4096 bytes is
performed.

Fixes: tower-rs#520
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant