-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncate Min/Max values in the Column Index #4389
Truncate Min/Max values in the Column Index #4389
Conversation
Co-authored-by: Will Jones <[email protected]>
@AdamGS I guess we can easily truncate min. For maximum, a |
I think this attempt should be more correct and compatible with parquet-mr. I also added a note in the PR description about the changed max values. |
parquet/src/column/writer/mod.rs
Outdated
/// Try and increment the bytes from right to left. | ||
fn increment(data: &mut [u8]) { | ||
for byte in data.iter_mut().rev() { | ||
if *byte == u8::MAX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the sequence is 0xFF 0xFF 0xFF 0xFF
. I guess we cannot truncate it if that. (Parquet-mr handles this well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still outstanding, right? Is the solution to just return None
?
That seems to be what parquet-mr does: https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L133-L133
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adopted their (parquet-mr) general structure with functions returning None when they can't truncate/increment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like where this is headed, thank you for driving this forward
@@ -868,13 +868,13 @@ impl ColumnIndexBuilder { | |||
pub fn append( | |||
&mut self, | |||
null_page: bool, | |||
min_value: &[u8], | |||
max_value: &[u8], | |||
min_value: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This an API change, which is fine
parquet/src/column/writer/mod.rs
Outdated
} | ||
|
||
fn truncate_max_value(&self, data: &[u8]) -> Vec<u8> { | ||
// Even if the user disables value truncation, we want to make sure to increase the max value so the user doesn't miss it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only increment if we are truncating, I think it would be a bit surprising for users to write b"hello"
and get back b"hellp"
. Whereas writing "really long string"
and getting back an obviously truncated string is perhaps more understandable "really long su"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it should be handled correctly now
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
parquet/src/column/writer/mod.rs
Outdated
/// Try and increment the bytes from right to left. | ||
fn increment(data: &mut [u8]) { | ||
for byte in data.iter_mut().rev() { | ||
if *byte == u8::MAX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still outstanding, right? Is the solution to just return None
?
That seems to be what parquet-mr does: https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L133-L133
Co-authored-by: Will Jones <[email protected]>
Improved error handling to something closer to what parquet-mr does, and also fixed some issues and hopefully improved naming a bit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thanks for handling those edge cases.
I'll wait to merge until Monday in case Raphael has any final comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, left some minor comments
parquet/src/column/writer/mod.rs
Outdated
} | ||
} | ||
|
||
unreachable!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unreachable!() | |
None |
I think this is reachable.
Consider data
containing a single character with a 3 byte encoding, and a length
of 1
. data.len() >= length
but the loop will fail to find a character
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this today but was drawing a blank when trying to formulate at test case for this issue. I think I originally assumed that any UTF8 character has a valid sub-character, which is wrong. Also added a test case to validate that we're handling this case correctly.
parquet/src/column/writer/mod.rs
Outdated
// We return values like that at an earlier stage in the process. | ||
assert!(data.len() >= length); | ||
// If all bytes are already maximal, no need to truncate | ||
if data.iter().all(|b| *b == u8::MAX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can truncate if they are all maximal, we just can't increment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can because &[0xFF]
< &[0xFF, 0xFF]
and we want to maintain this info for max values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that is only because the increment step can't proceed - which is what will then return None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold good point about the increment failing, I'll change that. I'm keeping the truncate_binary
return type as an Option
, I think the composition it creates is very nice here and there is very little mental/compute overhead to handling it.
parquet/src/column/writer/mod.rs
Outdated
*byte = byte.checked_add(1).unwrap_or(0); | ||
|
||
if *byte != 0 { | ||
return Some(data); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*byte = byte.checked_add(1).unwrap_or(0); | |
if *byte != 0 { | |
return Some(data); | |
} | |
let (incremented, overflow) = byte.overflowing_add(1); | |
*byte = incremented; | |
if !overflow { | |
return Some(data); | |
} |
let original = data[idx]; | ||
let mut byte = data[idx].checked_add(1).unwrap_or(0); | ||
|
||
// Until overflow: 0xFF -> 0x00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use overflow_add here, might be a touch clearer
/// If set to `None` - there's no effective limit. | ||
pub fn set_column_index_truncate_length(mut self, max_length: Option<usize>) -> Self { | ||
if let Some(value) = max_length { | ||
assert!(value > 0, "Cannot have a 0 column index truncate length. If you wish to disable min/max value truncation, set it to `None`."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
parquet/src/column/writer/mod.rs
Outdated
} | ||
|
||
fn truncate_min_value(&self, data: &[u8]) -> Vec<u8> { | ||
let effective_column_index_truncate_length = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be written more concisely as
self.props
.column_index_truncate_length()
.filter(|l| data.len() > *l)
.and_then(|l| match str::from_utf8(data) {
Ok(str_data) => truncate_utf8(str_data, l),
Err(_) => truncate_binary(data, l),
})
.unwrap_or_else(|| data.to_vec())
The unwrap_or_else
also avoids allocating a vec if not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for the functional style, I just find that some people are less open to it and I didn't want to presume here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a fan of whichever makes the intent clearer and more concise, there are definitely cases (e.g. involving lifetimes or async) where the functional style becomes obtuse
parquet/src/column/writer/mod.rs
Outdated
// We return values like that at an earlier stage in the process. | ||
assert!(data.len() >= length); | ||
// If all bytes are already maximal, no need to truncate | ||
if data.iter().all(|b| *b == u8::MAX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that is only because the increment step can't proceed - which is what will then return None
?
Thank you for this, very nice work 👍 |
@tustvold @mapleFU @wjones127 Thank you all so much for the patient and quick reviews! |
} | ||
} | ||
|
||
// update the offset index | ||
self.offset_index_builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change breaks the offset index for entirely null pages - #4459
Which issue does this PR close?
Closes #4126 .
Rationale for this change
For use cases that store large binary or string values in Parquet files, page-level statistics might blow up as the min/max values are stored in them as part of the Column Index. This "feature" is part of the spec, and is implemented in other languages.
What changes are included in this PR?
This adds a new member to
WriterProperties
and its builder, and truncates min/max values at a specific length, while still allowing shorter ones or disabeling truncation all together (maintaing the current default behavior).Are there any user-facing changes?
There are two user-facing changes in this PR:
WriterPropertiesBuilder
to specify the max length of min/max values before they are truncated.