-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use native types in PageIndex (#3575) #3578
Conversation
@@ -132,7 +132,7 @@ fn compute_row_counts(offset_index: &[PageLocation], rows: i64) -> Vec<i64> { | |||
} | |||
|
|||
/// Prints index information for a single column chunk | |||
fn print_index<T: std::fmt::Debug>( | |||
fn print_index<T: std::fmt::Display>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relates to #3573 FYI @bmmeijers
The indexes now all implement Display 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to format the types for bytearray/fixedlenbytearray with hex layout (using :x?
) by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be willing to review a PR that altered the display implementation
impl FromBytes for Int96 { | ||
type Buffer = [u8; 12]; | ||
fn from_le_bytes(bs: Self::Buffer) -> Self { | ||
let mut i = Int96::new(); | ||
i.set_data( | ||
from_le_slice(&bs[0..4]), | ||
from_le_slice(&bs[4..8]), | ||
from_le_slice(&bs[8..12]), | ||
); | ||
i | ||
} | ||
fn from_be_bytes(_bs: Self::Buffer) -> Self { | ||
unimplemented!() | ||
} | ||
fn from_ne_bytes(bs: Self::Buffer) -> Self { | ||
let mut i = Int96::new(); | ||
i.set_data( | ||
from_ne_slice(&bs[0..4]), | ||
from_ne_slice(&bs[4..8]), | ||
from_ne_slice(&bs[8..12]), | ||
); | ||
i | ||
} | ||
} | ||
|
||
// FIXME Needed to satisfy the constraint of many decoding functions but ByteArray does not | ||
// appear to actual be converted directly from bytes | ||
impl FromBytes for ByteArray { | ||
type Buffer = Vec<u8>; | ||
fn from_le_bytes(bs: Self::Buffer) -> Self { | ||
ByteArray::from(bs) | ||
} | ||
fn from_be_bytes(_bs: Self::Buffer) -> Self { | ||
unreachable!() | ||
} | ||
fn from_ne_bytes(bs: Self::Buffer) -> Self { | ||
ByteArray::from(bs) | ||
} | ||
} | ||
|
||
impl FromBytes for FixedLenByteArray { | ||
type Buffer = Vec<u8>; | ||
|
||
fn from_le_bytes(bs: Self::Buffer) -> Self { | ||
Self(ByteArray::from(bs)) | ||
} | ||
fn from_be_bytes(_bs: Self::Buffer) -> Self { | ||
unreachable!() | ||
} | ||
fn from_ne_bytes(bs: Self::Buffer) -> Self { | ||
Self(ByteArray::from(bs)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have been moved to live alongside the other FromBytes implementations
@@ -1575,20 +1576,6 @@ mod tests { | |||
}); | |||
} | |||
|
|||
fn check_bytes_page_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need to treat different index types differently 🎉
let mut b = T::Buffer::default(); | ||
{ | ||
let b = b.as_mut(); | ||
let bs = &bs[..b.len()]; | ||
b.copy_from_slice(bs); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was problematic as for ByteArray
and FixedLenByteArray
use Vec
as Self::Buffer
, this would then have a length of 0
and so wouldn't copy anything across...
fn from_be_bytes(bs: Self::Buffer) -> Self { | ||
<$ty>::from_be_bytes(bs) | ||
} | ||
fn from_ne_bytes(bs: Self::Buffer) -> Self { | ||
<$ty>::from_ne_bytes(bs) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data in parquet is always stored as little endian - https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0
The use of ne_bytes was actually incorrect
@@ -181,11 +181,11 @@ pub fn from_thrift( | |||
// min/max statistics for INT96 columns. | |||
let min = min.map(|data| { | |||
assert_eq!(data.len(), 12); | |||
from_ne_slice::<Int96>(&data) | |||
from_le_slice::<Int96>(&data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data is little endian, not native endian - https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For native types, this outputs the data as little endian. Floating point types are encoded in IEEE.
👍
} | ||
T::from_le_bytes(b) | ||
} | ||
|
||
pub trait FromBytes: Sized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this trait is not part of the public API
BOOLEAN(NativeIndex<bool>), | ||
INT32(NativeIndex<i32>), | ||
INT64(NativeIndex<i64>), | ||
INT96(NativeIndex<Int96>), | ||
FLOAT(NativeIndex<f32>), | ||
DOUBLE(NativeIndex<f64>), | ||
BYTE_ARRAY(ByteArrayIndex), | ||
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex), | ||
BYTE_ARRAY(NativeIndex<ByteArray>), | ||
FIXED_LEN_BYTE_ARRAY(NativeIndex<ByteArray>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the breaking change
let bs = &bs[..b.len()]; | ||
b.copy_from_slice(bs); | ||
fn array_from_slice<const N: usize>(bs: &[u8]) -> Result<[u8; N]> { | ||
// Need to slice as may be called with zero-padded values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've confirmed this optimised as you would hope
cc @Ted-Jiang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a super expert in this area but I reviewed the code and test changes / coverage carefully and it looks 👍 to me
The only thing I think worth doing is to review the coverage for FIXED_LEN_BYTE_ARRAY in parquet/src/file/serialized_reader.rs (this PR didn't change the coverage, but it seems like there may be missing statistics coverage for that type)
@@ -156,12 +156,12 @@ fn print_index<T: std::fmt::Debug>( | |||
idx, o.offset, o.compressed_page_size, row_count | |||
); | |||
match &c.min { | |||
Some(m) => print!(", min {:>10?}", m), | |||
Some(m) => print!(", min {:>10}", m), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -181,11 +181,11 @@ pub fn from_thrift( | |||
// min/max statistics for INT96 columns. | |||
let min = min.map(|data| { | |||
assert_eq!(data.len(), 12); | |||
from_ne_slice::<Int96>(&data) | |||
from_le_slice::<Int96>(&data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For native types, this outputs the data as little endian. Floating point types are encoded in IEEE.
👍
use crate::util::{ | ||
bit_util::{self, from_ne_slice, BitReader, BitWriter, FromBytes}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to other reviewers, this PR changes from native endian to little endian (parquet is always little endian) -- more comments on this below
@@ -1502,7 +1503,7 @@ mod tests { | |||
//col9->date_string_col: BINARY UNCOMPRESSED DO:0 FPO:332847 SZ:111948/111948/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 01/01/09, max: 12/31/10, num_nulls: 0] | |||
assert!(!&page_indexes[0][8].is_sorted()); | |||
if let Index::BYTE_ARRAY(index) = &page_indexes[0][8] { | |||
check_bytes_page_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1515,7 +1516,7 @@ mod tests { | |||
//col10->string_col: BINARY UNCOMPRESSED DO:0 FPO:444795 SZ:45298/45298/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0] | |||
assert!(&page_indexes[0][9].is_sorted()); | |||
if let Index::BYTE_ARRAY(index) = &page_indexes[0][9] { | |||
check_bytes_page_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any coverage for FIXED_LENGTH_BYTE_ARRAY?
Benchmark runs are scheduled for baseline = 892a803 and contender = de381ec. de381ec is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3575
Relates to #3577
Rationale for this change
See tickets
What changes are included in this PR?
Are there any user-facing changes?