-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable truncation of binary statistics columns #5076
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this, I left some comments that I think might reduce the diff resulting from this
parquet/src/file/statistics.rs
Outdated
is_max_value_exact: bool, | ||
is_min_value_exact: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this function signature will result in some non-trivial code churn, what do you think of keeping this function as-is, defaulting the values to true
and then adding two methods like
pub fn with_is_max_value_exact(self, exact: bool) -> Self {
...
}
pub fn with_is_min_value_exact(self, exact: bool) -> Self {
...
}
parquet/src/file/statistics.rs
Outdated
@@ -152,6 +158,12 @@ pub fn from_thrift( | |||
stats.max_value | |||
}; | |||
|
|||
// Whether or not the min/max values are exact. Due to pre-existing truncation | |||
// in other libraries such as parquet-mr, we can't assume that any given parquet file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does parquet-mr truncate non-binary columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, parquet-mr only applies this to binary statistics: https://github.com/apache/parquet-mr/pull/696/files#diff-1afc9f89a782ddd4e7cd17546ca048954091627d7a31597ab88892eb2a7a76abR618
Pertaining to the conversation above as well - I could reduce churn by only allowing the setting of min/max exactness on the constructors for binary-like stats, by splitting the statistics_new_func
macro into a statistics_new_func_always_exact
and statistics_new_func_inexact
that generates a binary_with_inexact
method? Given that there's only one place in the code in column/mod.rs
where we set these to something other than true
, would reduce the churn significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, parquet-mr only applies this to binary statistics
In which case perhaps we can have slightly less pessimistic defaulting behaviour here?
by splitting the statistics_new_func macro into a statistics_new_func_always_exact and statistics_new_func_inexact
I think I would prefer to avoid this being a breaking change at all, the approach in #5076 (comment) would be consistent with other structures in this codebase and would be my preference unless there is a reason it is insufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will update to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold should be OK to re-review? I've also fixed the lints that were causing issues with the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me thank you
Which issue does this PR close?
Closes #5037.
Rationale for this change
Similar to parquet-mr (apache/parquet-java#696), we allow truncation of statistics for binary and fix-length binary columns.
What changes are included in this PR?
7b37fd4
introduces the min/max exactness parameters and parses them for various statistics, and ensures round-tripping.e57634d
creates a new writer property, and implements the truncation. It's tested for both strings and for decimals, and in the decimal case we ensure that re-constructed min and max decimals of the correct byte length will properly bound the true value.Are there any user-facing changes?
Introduction of new functionality to set the truncation length, but no breaking changes.