Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass decompressed size to parquet Codec::decompress (#2956) #2959

Merged
merged 2 commits into from
Oct 29, 2022

Conversation

marioloko
Copy link
Contributor

Which issue does this PR close?

Closes #2956.

Rationale for this change

What changes are included in this PR?

Added optional argument uncompressed_size to Coded::decompress to do a better estimation of the required uncompress size.

  • snappy: Probably no much improvement as decompress_len is already accurate.
  • gzip: No improvement. Ignores the size hint.
  • brotli: Probably no much improvement. The buffer size will be equal to the uncompressed_size size.
  • lz4: No improvement. As the buffer is located at the stack there are no extra allocations. Then, it probably is better to keep it working as it is.
  • zstd: No improvement. Ignores the size hint.
  • lz4_raw: Improvement. The estimation method over-estimates, so knowin the uncompressed size reduces allocations.

Are there any user-facing changes?

Breaking changes on Codec trait. It only affects to users with experimental feature enable.

Added optional argument uncompressed_size to Coded::decompress to do a better
estimation of the required uncompress size.

* snappy: Probably no much improvement as `decompress_len` is already accurate.
* gzip: No improvement. Ignores the size hint.
* brotli: Probably no much improvement. The buffer size will be equal to the uncompressed_size size.
* lz4: No improvement. As the buffer is located at the stack there are no extra allocations. Then it probably is better to keep it working as it is.
* zstd: No improvement. Ignores the size hint.
* lz4_raw: Improvement. The estimation method over-estimates, so knowin the uncompressed size reduces allocations.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 27, 2022
A page may contain header, uncompressed size includes the header size. The `decompress` method expects to receive the `uncompress_size` for the compress block, that is without the page headers.
@marioloko
Copy link
Contributor Author

It seems that the estimation of lz4 uncompress size can cause overflow for small compress size. Any compress size smaller than 10 will overflow and as though it will panic.

So I see too options now:

  1. To change predictions formula to return 255 for any compressed size smaller than 10.
  2. To only allow lz4_raw if uncompressed_size is provided, and return an error saying 'LZ4_RAW without known uncompressed_size is unsupported'.

I would go with the second one, as even if the overflow error is only for small compression sizes, if the compressed size is 1G it will reserve ~250GB which is too much. So I would avoid prediction.

What do you think?

@tustvold tustvold merged commit 344c552 into apache:master Oct 29, 2022
@ursabot
Copy link

ursabot commented Oct 29, 2022

Benchmark runs are scheduled for baseline = 94a7f4b and contender = 344c552. 344c552 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass Decompressed Size to Parquet Codec::decompress
3 participants