Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the highest compression level in gzip? #6282

Closed
JakkuSakura opened this issue Aug 21, 2024 · 9 comments · Fixed by #6453
Closed

What is the highest compression level in gzip? #6282

JakkuSakura opened this issue Aug 21, 2024 · 9 comments · Fixed by #6453
Labels
question Further information is requested

Comments

@JakkuSakura
Copy link

Which part is this question about
What is the highest compression level in gzip?

Describe your question
I see from other sources, including flate2, the highest compression level for gzip is 9 instead of 10. If we pass 10, it should be accepted by parquet but rejected by flate2. Am I getting misunderstanding somewhere?

impl CompressionLevel<u32> for GzipLevel {
    const MINIMUM_LEVEL: u32 = 0;
    const MAXIMUM_LEVEL: u32 = 10;
}
@JakkuSakura JakkuSakura added the question Further information is requested label Aug 21, 2024
@ByteBaker
Copy link
Contributor

It seems flate2 documentation is wrong.

/// Returns an integer representing the compression level, typically on a
/// scale of 0-9
pub fn level(&self) -> u32 {
    self.0
}

But internally, inside DeflateBackend::make they have debug_assert!(level.level() <= 10);. Using compression level up to 10 works fine but panics for bigger values.

@ByteBaker
Copy link
Contributor

ByteBaker commented Sep 23, 2024

@JakkuSakura After discussing with flate2 maintainers, I've confirmed that it's actually a documentation issue, but there's a slight caveat as well. flate2 supports both miniz and zlib backends, the former enabled by default.

For consistency with zlib (which supports up to 9), the documentation states the compression range as 0-9, but 10 is supported miniz, enabled by default. If the backend is switched to zlib, then an attempt to use a compression level 10 will cause a panic.

I've opened a pull request in flate2 to explicitly mention this in the docs, so that there's no confusion around this behavior. I hope this resolves your query?

@ByteBaker
Copy link
Contributor

One more thing to add here. Parquet (and the entire arrow project) uses flate2 with rust_backend feature enabled. Which uses miniz backend, thereby supporting level 10 compression, aka UBER COMPRESSION. Flate2 still chooses to call 9 as the best level of compression because with 10 we might run into performance issues on the user's device.

The PR I created in flate2 is merged. So the docs should mention this caveat very soon hopefully. But behaviorally speaking, using level = 10 in parquet shouldn't be a problem at all. Discretion is advised when using flate2 separately.

@alamb @tustvold what do you think? And should we close this?

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

Maybe we can add a note to the arrow documentation with a link to flate2 and close this issuse?

@ByteBaker
Copy link
Contributor

Sounds like a good idea. I'll make this a part of #37 exercise itself.

@ByteBaker
Copy link
Contributor

ByteBaker commented Sep 27, 2024

@alamb @tustvold have a look please before I add anything to the docs.

rust-lang/flate2-rs#427 (comment) and rust-lang/flate2-rs#429

@alamb
Copy link
Contributor

alamb commented Sep 28, 2024

@alamb @tustvold have a look please before I add anything to the docs.

rust-lang/flate2-rs#427 (comment) and rust-lang/flate2-rs#429

I recommend linking to the docs added in rust-lang/flate2-rs#430 -- they are pretty clear to me. Basically we can say the max is 10 but offer the caveat that nothing else will be able to read the parquet files

ByteBaker added a commit to ByteBaker/arrow-rs that referenced this issue Sep 29, 2024
- add pragma `#![warn(missing_docs)]` to the following
  - `arrow-flight`
  - `arrow-ipc`
  - `arrow-integration-test`
  - `arrow-integration-testing`
  - `object_store`

- also document the caveat with using level 10 GZIP compression in
  parquet. See apache#6282.
@JakkuSakura
Copy link
Author

Thanks for clarifying everything here. After the docs changes, nobody be confused by this

@ByteBaker
Copy link
Contributor

ByteBaker commented Sep 29, 2024

Summarizing for future visitors:

flate2 supports zlib and miniz backends. Both support levels 0-9 compression. However, miniz also supports a level 10, calling it UBER COMPRESSION. It also makes it clear that it's non-standard compression level and hence any data compressed using level 10 compression will not be readable by a standard GZIP reader.

Parquet (and Arrow) uses flate2 with miniz enabled. Thus supporting level 10 compression. By extension, using level 10 compression in Parquet would mean that the generated file won't be readable by any other Parquet reader.

Users should not use level 10 compression with parquet if they intend to read the file with other readers as well.

References:

rust-lang/flate2-rs#427 (comment)
rust-lang/flate2-rs#430
https://github.com/Frommi/miniz_oxide/blob/master/miniz/miniz.h#L235
https://github.com/Frommi/miniz_oxide/blob/master/miniz/miniz.h#L229

alamb pushed a commit that referenced this issue Oct 1, 2024
* chore: add docs, part of #37
- add pragma `#![warn(missing_docs)]` to the following
  - `arrow-flight`
  - `arrow-ipc`
  - `arrow-integration-test`
  - `arrow-integration-testing`
  - `object_store`

- also document the caveat with using level 10 GZIP compression in
  parquet. See #6282.

* chore: resolve PR comments from #6453
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants