Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncate Min/Max values in the Column Index #4389

Merged
merged 15 commits into from
Jun 11, 2023
Merged

Truncate Min/Max values in the Column Index #4389

merged 15 commits into from
Jun 11, 2023

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Jun 8, 2023

Which issue does this PR close?

Closes #4126 .

Rationale for this change

For use cases that store large binary or string values in Parquet files, page-level statistics might blow up as the min/max values are stored in them as part of the Column Index. This "feature" is part of the spec, and is implemented in other languages.

What changes are included in this PR?

This adds a new member to WriterProperties and its builder, and truncates min/max values at a specific length, while still allowing shorter ones or disabeling truncation all together (maintaing the current default behavior).

Are there any user-facing changes?

There are two user-facing changes in this PR:

  1. It adds a new option to WriterPropertiesBuilderto specify the max length of min/max values before they are truncated.
  2. It will change the default behavior to truncating at the 128 byte line. That might cause some performance changes to users that have long binary arrays with shared prefixes.
  3. Max values will now be "Incremented", like in this example from the spec.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 8, 2023
@AdamGS AdamGS changed the title Initial work Truncate Min/Max values in the Colum Index Jun 8, 2023
@AdamGS AdamGS marked this pull request as ready for review June 8, 2023 20:53
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member

mapleFU commented Jun 9, 2023

@AdamGS I guess we can easily truncate min. For maximum, a increment and incrementUtf8 would be help. Maybe it need to handle cascading carry byte

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 9, 2023

I think this attempt should be more correct and compatible with parquet-mr. I also added a note in the PR description about the changed max values.

@tustvold tustvold added the api-change Changes to the arrow API label Jun 9, 2023
@AdamGS AdamGS requested review from wjones127 and mapleFU June 9, 2023 13:53
/// Try and increment the bytes from right to left.
fn increment(data: &mut [u8]) {
for byte in data.iter_mut().rev() {
if *byte == u8::MAX {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the sequence is 0xFF 0xFF 0xFF 0xFF. I guess we cannot truncate it if that. (Parquet-mr handles this well)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@AdamGS AdamGS Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted their (parquet-mr) general structure with functions returning None when they can't truncate/increment.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like where this is headed, thank you for driving this forward

@@ -868,13 +868,13 @@ impl ColumnIndexBuilder {
pub fn append(
&mut self,
null_page: bool,
min_value: &[u8],
max_value: &[u8],
min_value: Vec<u8>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This an API change, which is fine

parquet/src/file/properties.rs Outdated Show resolved Hide resolved
parquet/src/file/properties.rs Outdated Show resolved Hide resolved
}

fn truncate_max_value(&self, data: &[u8]) -> Vec<u8> {
// Even if the user disables value truncation, we want to make sure to increase the max value so the user doesn't miss it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only increment if we are truncating, I think it would be a bit surprising for users to write b"hello" and get back b"hellp". Whereas writing "really long string" and getting back an obviously truncated string is perhaps more understandable "really long su"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it should be handled correctly now

parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
/// Try and increment the bytes from right to left.
fn increment(data: &mut [u8]) {
for byte in data.iter_mut().rev() {
if *byte == u8::MAX {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 10, 2023

Improved error handling to something closer to what parquet-mr does, and also fixed some issues and hopefully improved naming a bit.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks for handling those edge cases.

I'll wait to merge until Monday in case Raphael has any final comments.

@AdamGS AdamGS changed the title Truncate Min/Max values in the Colum Index Truncate Min/Max values in the Column Index Jun 11, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, left some minor comments

}
}

unreachable!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unreachable!()
None

I think this is reachable.

Consider data containing a single character with a 3 byte encoding, and a length of 1. data.len() >= length but the loop will fail to find a character

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this today but was drawing a blank when trying to formulate at test case for this issue. I think I originally assumed that any UTF8 character has a valid sub-character, which is wrong. Also added a test case to validate that we're handling this case correctly.

// We return values like that at an earlier stage in the process.
assert!(data.len() >= length);
// If all bytes are already maximal, no need to truncate
if data.iter().all(|b| *b == u8::MAX) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can truncate if they are all maximal, we just can't increment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can because &[0xFF] < &[0xFF, 0xFF] and we want to maintain this info for max values.

Copy link
Contributor

@tustvold tustvold Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that is only because the increment step can't proceed - which is what will then return None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold good point about the increment failing, I'll change that. I'm keeping the truncate_binary return type as an Option, I think the composition it creates is very nice here and there is very little mental/compute overhead to handling it.

Comment on lines 1250 to 1254
*byte = byte.checked_add(1).unwrap_or(0);

if *byte != 0 {
return Some(data);
}
Copy link
Contributor

@tustvold tustvold Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*byte = byte.checked_add(1).unwrap_or(0);
if *byte != 0 {
return Some(data);
}
let (incremented, overflow) = byte.overflowing_add(1);
*byte = incremented;
if !overflow {
return Some(data);
}

let original = data[idx];
let mut byte = data[idx].checked_add(1).unwrap_or(0);

// Until overflow: 0xFF -> 0x00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use overflow_add here, might be a touch clearer

/// If set to `None` - there's no effective limit.
pub fn set_column_index_truncate_length(mut self, max_length: Option<usize>) -> Self {
if let Some(value) = max_length {
assert!(value > 0, "Cannot have a 0 column index truncate length. If you wish to disable min/max value truncation, set it to `None`.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

fn truncate_min_value(&self, data: &[u8]) -> Vec<u8> {
let effective_column_index_truncate_length = self
Copy link
Contributor

@tustvold tustvold Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be written more concisely as

self.props
    .column_index_truncate_length()
    .filter(|l| data.len() > *l)
    .and_then(|l| match str::from_utf8(data) {
        Ok(str_data) => truncate_utf8(str_data, l),
        Err(_) => truncate_binary(data, l),
    })
    .unwrap_or_else(|| data.to_vec())

The unwrap_or_else also avoids allocating a vec if not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for the functional style, I just find that some people are less open to it and I didn't want to presume here.

Copy link
Contributor

@tustvold tustvold Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of whichever makes the intent clearer and more concise, there are definitely cases (e.g. involving lifetimes or async) where the functional style becomes obtuse

parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
// We return values like that at an earlier stage in the process.
assert!(data.len() >= length);
// If all bytes are already maximal, no need to truncate
if data.iter().all(|b| *b == u8::MAX) {
Copy link
Contributor

@tustvold tustvold Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that is only because the increment step can't proceed - which is what will then return None?

@tustvold
Copy link
Contributor

Thank you for this, very nice work 👍

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 11, 2023

@tustvold @mapleFU @wjones127 Thank you all so much for the patient and quick reviews!

}
}

// update the offset index
self.offset_index_builder
Copy link
Contributor

@tustvold tustvold Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change breaks the offset index for entirely null pages - #4459

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Truncate ColumnIndex ByteArray Statistics
4 participants