Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable truncation of binary statistics columns #5076

Merged
merged 7 commits into from
Nov 15, 2023

Conversation

emcake
Copy link
Contributor

@emcake emcake commented Nov 14, 2023

Which issue does this PR close?

Closes #5037.

Rationale for this change

Similar to parquet-mr (apache/parquet-java#696), we allow truncation of statistics for binary and fix-length binary columns.

What changes are included in this PR?

7b37fd4 introduces the min/max exactness parameters and parses them for various statistics, and ensures round-tripping.

e57634d creates a new writer property, and implements the truncation. It's tested for both strings and for decimals, and in the decimal case we ensure that re-constructed min and max decimals of the correct byte length will properly bound the true value.

Are there any user-facing changes?

Introduction of new functionality to set the truncation length, but no breaking changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 14, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, I left some comments that I think might reduce the diff resulting from this

Comment on lines 93 to 94
is_max_value_exact: bool,
is_min_value_exact: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this function signature will result in some non-trivial code churn, what do you think of keeping this function as-is, defaulting the values to true and then adding two methods like

pub fn with_is_max_value_exact(self, exact: bool) -> Self {
     ...
}

pub fn with_is_min_value_exact(self, exact: bool) -> Self {
     ...
}

@@ -152,6 +158,12 @@ pub fn from_thrift(
stats.max_value
};

// Whether or not the min/max values are exact. Due to pre-existing truncation
// in other libraries such as parquet-mr, we can't assume that any given parquet file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does parquet-mr truncate non-binary columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, parquet-mr only applies this to binary statistics: https://github.com/apache/parquet-mr/pull/696/files#diff-1afc9f89a782ddd4e7cd17546ca048954091627d7a31597ab88892eb2a7a76abR618

Pertaining to the conversation above as well - I could reduce churn by only allowing the setting of min/max exactness on the constructors for binary-like stats, by splitting the statistics_new_func macro into a statistics_new_func_always_exact and statistics_new_func_inexact that generates a binary_with_inexact method? Given that there's only one place in the code in column/mod.rs where we set these to something other than true, would reduce the churn significantly.

Copy link
Contributor

@tustvold tustvold Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, parquet-mr only applies this to binary statistics

In which case perhaps we can have slightly less pessimistic defaulting behaviour here?

by splitting the statistics_new_func macro into a statistics_new_func_always_exact and statistics_new_func_inexact

I think I would prefer to avoid this being a breaking change at all, the approach in #5076 (comment) would be consistent with other structures in this codebase and would be my preference unless there is a reason it is insufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will update to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold should be OK to re-review? I've also fixed the lints that were causing issues with the checks.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me thank you

@tustvold tustvold merged commit 7941577 into apache:master Nov 15, 2023
16 checks passed
@emcake emcake deleted the stats-truncation branch November 15, 2023 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Binary columns do not receive truncated statistics
2 participants