Truncate Min/Max values in the Column Index #4389

AdamGS · 2023-06-08T20:43:56Z

Which issue does this PR close?

Closes #4126 .

Rationale for this change

For use cases that store large binary or string values in Parquet files, page-level statistics might blow up as the min/max values are stored in them as part of the Column Index. This "feature" is part of the spec, and is implemented in other languages.

What changes are included in this PR?

This adds a new member to WriterProperties and its builder, and truncates min/max values at a specific length, while still allowing shorter ones or disabeling truncation all together (maintaing the current default behavior).

Are there any user-facing changes?

There are two user-facing changes in this PR:

It adds a new option to WriterPropertiesBuilderto specify the max length of min/max values before they are truncated.
It will change the default behavior to truncating at the 128 byte line. That might cause some performance changes to users that have long binary arrays with shared prefixes.
Max values will now be "Incremented", like in this example from the spec.

parquet/src/column/writer/mod.rs

Co-authored-by: Will Jones <[email protected]>

mapleFU · 2023-06-09T06:33:16Z

@AdamGS I guess we can easily truncate min. For maximum, a increment and incrementUtf8 would be help. Maybe it need to handle cascading carry byte

AdamGS · 2023-06-09T09:42:03Z

I think this attempt should be more correct and compatible with parquet-mr. I also added a note in the PR description about the changed max values.

mapleFU · 2023-06-09T14:08:39Z

parquet/src/column/writer/mod.rs

+/// Try and increment the bytes from right to left.
+fn increment(data: &mut [u8]) {
+    for byte in data.iter_mut().rev() {
+        if *byte == u8::MAX {


What if the sequence is 0xFF 0xFF 0xFF 0xFF. I guess we cannot truncate it if that. (Parquet-mr handles this well)

I think this is still outstanding, right? Is the solution to just return None?
That seems to be what parquet-mr does: https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L133-L133

Adopted their (parquet-mr) general structure with functions returning None when they can't truncate/increment.

tustvold

Really like where this is headed, thank you for driving this forward

tustvold · 2023-06-09T13:41:44Z

parquet/src/file/metadata.rs

@@ -868,13 +868,13 @@ impl ColumnIndexBuilder {
    pub fn append(
        &mut self,
        null_page: bool,
-        min_value: &[u8],
-        max_value: &[u8],
+        min_value: Vec<u8>,


This an API change, which is fine

parquet/src/file/properties.rs

tustvold · 2023-06-09T13:48:41Z

parquet/src/column/writer/mod.rs

+    }
+
+    fn truncate_max_value(&self, data: &[u8]) -> Vec<u8> {
+        // Even if the user disables value truncation, we want to make sure to increase the max value so the user doesn't miss it.


I think we should only increment if we are truncating, I think it would be a bit surprising for users to write b"hello" and get back b"hellp". Whereas writing "really long string" and getting back an obviously truncated string is perhaps more understandable "really long su"

Agreed, it should be handled correctly now

parquet/src/column/writer/mod.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

…naryArray data

parquet/src/column/writer/mod.rs

wjones127 · 2023-06-09T23:20:18Z

parquet/src/column/writer/mod.rs

+/// Try and increment the bytes from right to left.
+fn increment(data: &mut [u8]) {
+    for byte in data.iter_mut().rev() {
+        if *byte == u8::MAX {


I think this is still outstanding, right? Is the solution to just return None?
That seems to be what parquet-mr does: https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L133-L133

parquet/src/column/writer/mod.rs

Co-authored-by: Will Jones <[email protected]>

AdamGS · 2023-06-10T13:06:51Z

Improved error handling to something closer to what parquet-mr does, and also fixed some issues and hopefully improved naming a bit.

wjones127

This looks good to me. Thanks for handling those edge cases.

I'll wait to merge until Monday in case Raphael has any final comments.

tustvold

I think this looks good, left some minor comments

tustvold · 2023-06-11T11:37:08Z

parquet/src/column/writer/mod.rs

+        }
+    }
+
+    unreachable!()


Suggested change

unreachable!()

None

I think this is reachable.

Consider data containing a single character with a 3 byte encoding, and a length of 1. data.len() >= length but the loop will fail to find a character

I was thinking about this today but was drawing a blank when trying to formulate at test case for this issue. I think I originally assumed that any UTF8 character has a valid sub-character, which is wrong. Also added a test case to validate that we're handling this case correctly.

tustvold · 2023-06-11T11:38:26Z

parquet/src/column/writer/mod.rs

+    // We return values like that at an earlier stage in the process.
+    assert!(data.len() >= length);
+    // If all bytes are already maximal, no need to truncate
+    if data.iter().all(|b| *b == u8::MAX) {


We can truncate if they are all maximal, we just can't increment?

I don't think we can because &[0xFF] < &[0xFF, 0xFF] and we want to maintain this info for max values.

But that is only because the increment step can't proceed - which is what will then return None?

@tustvold good point about the increment failing, I'll change that. I'm keeping the truncate_binary return type as an Option, I think the composition it creates is very nice here and there is very little mental/compute overhead to handling it.

tustvold · 2023-06-11T11:41:20Z

parquet/src/column/writer/mod.rs

+        *byte = byte.checked_add(1).unwrap_or(0);
+
+        if *byte != 0 {
+            return Some(data);
+        }


Suggested change

*byte = byte.checked_add(1).unwrap_or(0);

if *byte != 0 {

return Some(data);

}

let (incremented, overflow) = byte.overflowing_add(1);

*byte = incremented;

if !overflow {

return Some(data);

}

tustvold · 2023-06-11T11:44:13Z

parquet/src/column/writer/mod.rs

+        let original = data[idx];
+        let mut byte = data[idx].checked_add(1).unwrap_or(0);
+
+        // Until overflow: 0xFF -> 0x00


You could use overflow_add here, might be a touch clearer

tustvold · 2023-06-11T11:45:58Z

parquet/src/file/properties.rs

+    /// If set to `None` - there's no effective limit.
+    pub fn set_column_index_truncate_length(mut self, max_length: Option<usize>) -> Self {
+        if let Some(value) = max_length {
+            assert!(value > 0, "Cannot have a 0 column index truncate length. If you wish to disable min/max value truncation, set it to `None`.");


tustvold · 2023-06-11T12:02:00Z

parquet/src/column/writer/mod.rs

+    }
+
+    fn truncate_min_value(&self, data: &[u8]) -> Vec<u8> {
+        let effective_column_index_truncate_length = self


This can be written more concisely as

self.props .column_index_truncate_length() .filter(|l| data.len() > *l) .and_then(|l| match str::from_utf8(data) { Ok(str_data) => truncate_utf8(str_data, l), Err(_) => truncate_binary(data, l), }) .unwrap_or_else(|| data.to_vec())

The unwrap_or_else also avoids allocating a vec if not needed

I'm all for the functional style, I just find that some people are less open to it and I didn't want to presume here.

I'm a fan of whichever makes the intent clearer and more concise, there are definitely cases (e.g. involving lifetimes or async) where the functional style becomes obtuse

parquet/src/column/writer/mod.rs

tustvold · 2023-06-11T12:16:46Z

parquet/src/column/writer/mod.rs

+    // We return values like that at an earlier stage in the process.
+    assert!(data.len() >= length);
+    // If all bytes are already maximal, no need to truncate
+    if data.iter().all(|b| *b == u8::MAX) {


But that is only because the increment step can't proceed - which is what will then return None?

tustvold · 2023-06-11T12:34:34Z

Thank you for this, very nice work 👍

AdamGS · 2023-06-11T12:37:04Z

@tustvold @mapleFU @wjones127 Thank you all so much for the patient and quick reviews!

tustvold · 2023-06-28T17:30:52Z

parquet/src/column/writer/mod.rs

                }
            }
+
+            // update the offset index
+            self.offset_index_builder


This change breaks the offset index for entirely null pages - #4459

Initial work

ef1b6b2

github-actions bot added the parquet Changes to the parquet crate label Jun 8, 2023

AdamGS changed the title ~~Initial work~~ Truncate Min/Max values in the Colum Index Jun 8, 2023

Slight rename

00bcfb5

AdamGS marked this pull request as ready for review June 8, 2023 20:53

wjones127 requested changes Jun 9, 2023

View reviewed changes

parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved

parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved

Update parquet/src/column/writer/mod.rs

314c2ac

Co-authored-by: Will Jones <[email protected]>

Handle utf8 vs binary truncation, include increases

23d16ee

Small cleanup

a089918

tustvold added the api-change Changes to the arrow API label Jun 9, 2023

AdamGS requested review from wjones127 and mapleFU June 9, 2023 13:53

mapleFU reviewed Jun 9, 2023

View reviewed changes

tustvold reviewed Jun 9, 2023

View reviewed changes

AdamGS and others added 6 commits June 9, 2023 17:30

Update parquet/src/file/properties.rs

1c68fb5

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update parquet/src/file/properties.rs

6a56a51

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update parquet/src/column/writer/mod.rs

0c0520a

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Merge branch 'master' into truncate-column-index-byte-array-statistics

f07cbca

Review notes

6abc63c

Added handeling for some more cases - including not truncating non-Bi…

84cbb32

…naryArray data

wjones127 reviewed Jun 9, 2023

View reviewed changes

AdamGS and others added 2 commits June 10, 2023 11:14

Update parquet/src/column/writer/mod.rs

59020a8

Co-authored-by: Will Jones <[email protected]>

Handels increment better and some refactoring

93c08f7

wjones127 approved these changes Jun 11, 2023

View reviewed changes

AdamGS changed the title ~~Truncate Min/Max values in the Colum Index~~ Truncate Min/Max values in the Column Index Jun 11, 2023

Nicer handeling of physical type

13431b4

tustvold approved these changes Jun 11, 2023

View reviewed changes

tustvold reviewed Jun 11, 2023

View reviewed changes

parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved

tustvold reviewed Jun 11, 2023

View reviewed changes

More review notes

430c612

tustvold approved these changes Jun 11, 2023

View reviewed changes

tustvold merged commit 2462d36 into apache:master Jun 11, 2023

AdamGS deleted the truncate-column-index-byte-array-statistics branch June 12, 2023 07:52

tustvold mentioned this pull request Jun 12, 2023

Faster UTF-8 truncation #4399

Merged

alamb mentioned this pull request Jun 16, 2023

Truncate ColumnIndex ByteArray Statistics #4126

Closed

mapleFU mentioned this pull request Jun 17, 2023

[C++][Parquet] Allow Truncate min-max Statistics apache/arrow#36139

Open

tustvold mentioned this pull request Jun 28, 2023

Regression in in parquet 42.0.0 : Bad parquet column indexes for All Null Columns, resulting in Parquet error: StructArrayReader out of sync on read #4459

Closed

tustvold reviewed Jun 28, 2023

View reviewed changes

alamb mentioned this pull request Jun 28, 2023

Fix empty offset index for all null columns (#4459) #4460

Merged

wjones127 mentioned this pull request Nov 4, 2023

feat: add configuration key to allow filtering stats written by column delta-io/delta-rs#1792

Closed

This was referenced Nov 4, 2023

Delta Stats for binary columns are not truncated delta-io/delta-rs#1805

Open

Binary columns do not receive truncated statistics #5037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncate Min/Max values in the Column Index #4389

Truncate Min/Max values in the Column Index #4389

AdamGS commented Jun 8, 2023 •

edited

Loading

mapleFU commented Jun 9, 2023

AdamGS commented Jun 9, 2023 •

edited

Loading

mapleFU Jun 9, 2023

wjones127 Jun 9, 2023

AdamGS Jun 10, 2023 •

edited

Loading

tustvold left a comment

tustvold Jun 9, 2023

tustvold Jun 9, 2023

AdamGS Jun 10, 2023

wjones127 Jun 9, 2023

AdamGS commented Jun 10, 2023

wjones127 left a comment

tustvold left a comment

tustvold Jun 11, 2023

AdamGS Jun 11, 2023

tustvold Jun 11, 2023

AdamGS Jun 11, 2023

tustvold Jun 11, 2023 •

edited

Loading

AdamGS Jun 11, 2023

tustvold Jun 11, 2023 •

edited

Loading

tustvold Jun 11, 2023

tustvold Jun 11, 2023

tustvold Jun 11, 2023 •

edited

Loading

AdamGS Jun 11, 2023

tustvold Jun 11, 2023 •

edited

Loading

tustvold Jun 11, 2023 •

edited

Loading

tustvold commented Jun 11, 2023

AdamGS commented Jun 11, 2023

tustvold Jun 28, 2023 •

edited

Loading

Truncate Min/Max values in the Column Index #4389

Truncate Min/Max values in the Column Index #4389

Conversation

AdamGS commented Jun 8, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mapleFU commented Jun 9, 2023

AdamGS commented Jun 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamGS Jun 10, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamGS commented Jun 10, 2023

wjones127 left a comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold commented Jun 11, 2023

AdamGS commented Jun 11, 2023

tustvold Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

AdamGS commented Jun 8, 2023 •

edited

Loading

AdamGS commented Jun 9, 2023 •

edited

Loading

AdamGS Jun 10, 2023 •

edited

Loading

tustvold Jun 11, 2023 •

edited

Loading

tustvold Jun 11, 2023 •

edited

Loading

tustvold Jun 11, 2023 •

edited

Loading

tustvold Jun 11, 2023 •

edited

Loading

tustvold Jun 11, 2023 •

edited

Loading

tustvold Jun 28, 2023 •

edited

Loading