Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: don't truncate f16/decimal min/max stats #5154

Merged

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Dec 1, 2023

Which issue does this PR close?

Closes #5075

Rationale for this change

All logical types with BYTE_ARRAY/FIXED_LEN_BYTE_ARRAY underlying physical types share the sort order as their physical type, that being unsigned byte-wise comparison, except for Float16 and Decimal. Applying the truncation logic to their min/max statistics can therefore lead to inaccurate statistics.

What changes are included in this PR?

Add check before truncating min/max statistics to ensure Decimal and Float16 columns don't get truncated.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 1, 2023
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Dec 1, 2023

For reference parquet-mr has similar logic, difference being instead of opting out of truncation for Decimal, they enforce that types need to opt-in:

https://github.com/apache/parquet-mr/blob/97e6ba8e8343a7274e530a83496a7933cc5ffc01/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L182-L225

Which has similar effect in ensuring Decimal isn't truncated

(Float16 support isn't yet merged in for parquet-mr, but they aren't changing that logic in the Float16 PR so presumably will also not do truncation: apache/parquet-java#1142)

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only place we perform truncation, or do we also need to do something similar for the ColumnChunkMetadata statistics?

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Dec 5, 2023

Is this the only place we perform truncation, or do we also need to do something similar for the ColumnChunkMetadata statistics?

Those were the only places that called the truncate min/max functions

@tustvold
Copy link
Contributor

tustvold commented Dec 5, 2023

What about in write_column_metadata?

Comment on lines +967 to +969
Statistics::FixedLenByteArray(stats)
if (stats.has_min_max_set() && self.can_truncate_value()) =>
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in write_column_metadata?

This covers the truncation in write_column_metadata(..)

@tustvold tustvold merged commit b8d3f33 into apache:master Dec 5, 2023
16 checks passed
@Jefffrey Jefffrey deleted the disable_f16_decimal_stats_truncation branch December 5, 2023 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: don't truncate min/max statistics for float16 and decimal when writing file
2 participants