Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use native types in PageIndex (#3575) #3578

Merged
merged 5 commits into from
Jan 23, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #3575
Relates to #3577

Rationale for this change

See tickets

What changes are included in this PR?

Are there any user-facing changes?

@tustvold tustvold added the api-change Changes to the arrow API label Jan 20, 2023
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 20, 2023
@@ -132,7 +132,7 @@ fn compute_row_counts(offset_index: &[PageLocation], rows: i64) -> Vec<i64> {
}

/// Prints index information for a single column chunk
fn print_index<T: std::fmt::Debug>(
fn print_index<T: std::fmt::Display>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relates to #3573 FYI @bmmeijers

The indexes now all implement Display 🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to format the types for bytearray/fixedlenbytearray with hex layout (using :x?) by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be willing to review a PR that altered the display implementation

Comment on lines -1229 to -1281
impl FromBytes for Int96 {
type Buffer = [u8; 12];
fn from_le_bytes(bs: Self::Buffer) -> Self {
let mut i = Int96::new();
i.set_data(
from_le_slice(&bs[0..4]),
from_le_slice(&bs[4..8]),
from_le_slice(&bs[8..12]),
);
i
}
fn from_be_bytes(_bs: Self::Buffer) -> Self {
unimplemented!()
}
fn from_ne_bytes(bs: Self::Buffer) -> Self {
let mut i = Int96::new();
i.set_data(
from_ne_slice(&bs[0..4]),
from_ne_slice(&bs[4..8]),
from_ne_slice(&bs[8..12]),
);
i
}
}

// FIXME Needed to satisfy the constraint of many decoding functions but ByteArray does not
// appear to actual be converted directly from bytes
impl FromBytes for ByteArray {
type Buffer = Vec<u8>;
fn from_le_bytes(bs: Self::Buffer) -> Self {
ByteArray::from(bs)
}
fn from_be_bytes(_bs: Self::Buffer) -> Self {
unreachable!()
}
fn from_ne_bytes(bs: Self::Buffer) -> Self {
ByteArray::from(bs)
}
}

impl FromBytes for FixedLenByteArray {
type Buffer = Vec<u8>;

fn from_le_bytes(bs: Self::Buffer) -> Self {
Self(ByteArray::from(bs))
}
fn from_be_bytes(_bs: Self::Buffer) -> Self {
unreachable!()
}
fn from_ne_bytes(bs: Self::Buffer) -> Self {
Self(ByteArray::from(bs))
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have been moved to live alongside the other FromBytes implementations

@@ -1575,20 +1576,6 @@ mod tests {
});
}

fn check_bytes_page_index(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to treat different index types differently 🎉

Comment on lines -26 to -31
let mut b = T::Buffer::default();
{
let b = b.as_mut();
let bs = &bs[..b.len()];
b.copy_from_slice(bs);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was problematic as for ByteArray and FixedLenByteArray use Vec as Self::Buffer, this would then have a length of 0 and so wouldn't copy anything across...

@tustvold tustvold marked this pull request as draft January 20, 2023 14:33
Comment on lines -61 to -66
fn from_be_bytes(bs: Self::Buffer) -> Self {
<$ty>::from_be_bytes(bs)
}
fn from_ne_bytes(bs: Self::Buffer) -> Self {
<$ty>::from_ne_bytes(bs)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data in parquet is always stored as little endian - https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0

The use of ne_bytes was actually incorrect

@@ -181,11 +181,11 @@ pub fn from_thrift(
// min/max statistics for INT96 columns.
let min = min.map(|data| {
assert_eq!(data.len(), 12);
from_ne_slice::<Int96>(&data)
from_le_slice::<Int96>(&data)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For native types, this outputs the data as little endian. Floating point types are encoded in IEEE.

👍

}
T::from_le_bytes(b)
}

pub trait FromBytes: Sized {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this trait is not part of the public API

Comment on lines +56 to +63
BOOLEAN(NativeIndex<bool>),
INT32(NativeIndex<i32>),
INT64(NativeIndex<i64>),
INT96(NativeIndex<Int96>),
FLOAT(NativeIndex<f32>),
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
BYTE_ARRAY(NativeIndex<ByteArray>),
FIXED_LEN_BYTE_ARRAY(NativeIndex<ByteArray>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the breaking change

@tustvold tustvold marked this pull request as ready for review January 20, 2023 15:05
let bs = &bs[..b.len()];
b.copy_from_slice(bs);
fn array_from_slice<const N: usize>(bs: &[u8]) -> Result<[u8; N]> {
// Need to slice as may be called with zero-padded values
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed this optimised as you would hope

https://rust.godbolt.org/z/4WPdn4j8j

@alamb
Copy link
Contributor

alamb commented Jan 20, 2023

cc @Ted-Jiang

@tustvold tustvold requested a review from alamb January 22, 2023 09:14
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a super expert in this area but I reviewed the code and test changes / coverage carefully and it looks 👍 to me

The only thing I think worth doing is to review the coverage for FIXED_LEN_BYTE_ARRAY in parquet/src/file/serialized_reader.rs (this PR didn't change the coverage, but it seems like there may be missing statistics coverage for that type)

@@ -156,12 +156,12 @@ fn print_index<T: std::fmt::Debug>(
idx, o.offset, o.compressed_page_size, row_count
);
match &c.min {
Some(m) => print!(", min {:>10?}", m),
Some(m) => print!(", min {:>10}", m),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -181,11 +181,11 @@ pub fn from_thrift(
// min/max statistics for INT96 columns.
let min = min.map(|data| {
assert_eq!(data.len(), 12);
from_ne_slice::<Int96>(&data)
from_le_slice::<Int96>(&data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For native types, this outputs the data as little endian. Floating point types are encoded in IEEE.

👍

use crate::util::{
bit_util::{self, from_ne_slice, BitReader, BitWriter, FromBytes},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to other reviewers, this PR changes from native endian to little endian (parquet is always little endian) -- more comments on this below

@@ -1502,7 +1503,7 @@ mod tests {
//col9->date_string_col: BINARY UNCOMPRESSED DO:0 FPO:332847 SZ:111948/111948/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 01/01/09, max: 12/31/10, num_nulls: 0]
assert!(!&page_indexes[0][8].is_sorted());
if let Index::BYTE_ARRAY(index) = &page_indexes[0][8] {
check_bytes_page_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed there is exiting coverage for boolean a few lines above (bonus that the interface hasn't changed 👍 )

Screenshot 2023-01-23 at 6 59 52 AM

@@ -1515,7 +1516,7 @@ mod tests {
//col10->string_col: BINARY UNCOMPRESSED DO:0 FPO:444795 SZ:45298/45298/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
assert!(&page_indexes[0][9].is_sorted());
if let Index::BYTE_ARRAY(index) = &page_indexes[0][9] {
check_bytes_page_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any coverage for FIXED_LENGTH_BYTE_ARRAY?

@tustvold tustvold merged commit de381ec into apache:master Jan 23, 2023
@ursabot
Copy link

ursabot commented Jan 23, 2023

Benchmark runs are scheduled for baseline = 892a803 and contender = de381ec. de381ec is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Native Types in PageIndex
4 participants