-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unpack8, unpack16, unpack64 (#2276) ~10-50% faster #2278
Conversation
Roughly ~10% performance improvement, which isn't bad given half the objective was cleaning up the code with an eventual view to implementing #2257. The major return for DeltaBitPackDecoder would likely only be realizable with #2282 and larger miniblock sizes. Performance vs master
For completeness, performance vs master with
|
Running the benchmarks with
I wonder if we should just bump the default block size to 256 for 64-bit integers 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @sunchao This new bit pack utils look useful to us.
parquet/src/util/bit_pack.rs
Outdated
) | ||
}; | ||
|
||
unroll!(for i in 0..$bits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM was very reticent to unroll this loop, this is not all that surprising given it isn't immediately obvious how much it collapses down. We therefore give it a helping hand 😄
parquet/src/util/bit_pack.rs
Outdated
|
||
//! Vectorised bit-packing utilities | ||
|
||
macro_rules! unroll_impl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of gross, but macros can't actually do maths, so we "emulate" it by building an expression tree of "1 + 1 + ...". This will all get compiled away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some comments in this macro so that future readers can have a hope of understanding what it does without hours of study ;)
Starting with inlining the comments from the PR is probably a good start :)
parquet/src/util/bit_pack.rs
Outdated
unroll_impl!(1, $offset + 1, $v, $c); | ||
}}; | ||
(4, $offset:expr, $v:ident, $c:block) => {{ | ||
unroll_impl!(2, 0, __v4, { unroll_impl!(2, __v4 * 2 + $offset, $v, $c) }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, where does __v4
comes? Seems not macro parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an identifier created by this macro instantiation, and then passed into the child. This was the only way I could work out to do arithmetic in a macro, it's kind of wild 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference this is what it expands to
unroll!(for i in 0..4 {
vec.push(i);
});
#[allow(non_upper_case_globals)]
{
{
{
{
const __v4: usize = 0;
{
{
{
const i: usize = (__v4 * 2 + 0);
{
vec.push(i);
}
}
{
const i: usize = ((__v4 * 2 + 0) + 1);
{
vec.push(i);
}
}
}
}
}
{
const __v4: usize = (0 + 1);
{
{
{
const i: usize = (__v4 * 2 + 0);
{
vec.push(i);
}
}
{
const i: usize = ((__v4 * 2 + 0) + 1);
{
vec.push(i);
}
}
}
}
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding some comments for future reference :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet/src/util/bit_pack.rs
Outdated
|
||
//! Vectorised bit-packing utilities | ||
|
||
macro_rules! unroll_impl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some comments in this macro so that future readers can have a hope of understanding what it does without hours of study ;)
Starting with inlining the comments from the PR is probably a good start :)
parquet/src/util/bit_pack.rs
Outdated
use super::*; | ||
use rand::{thread_rng, Rng}; | ||
|
||
#[inline(never)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops hold over from some godbolt shenanigans, will remove
parquet/src/util/bit_pack.rs
Outdated
/// | ||
/// - force unrolling of a loop so that LLVM optimises it properly | ||
/// - call a const generic function with the loop value | ||
macro_rules! unroll { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example might help to document this -- btw the use of matching patterns for unrolling i in 0..end
is quite neat
let val = r(end_byte); | ||
let b = val << (NUM_BITS - end_bit_offset); | ||
|
||
output[i] = a | (b & mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could somehow remove the output[i]
bounds check too? Perhaps with an iterator or append or something 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As output is a fixed length slice, the bounds check is automatically elided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good call, I was confusing that with input: &[u8]
🤦
_ => unreachable!(), | ||
} | ||
|
||
// Try to read smaller batches if possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Try to read smaller batches if possible | |
// Try to read remainder, if any, in batches if possible |
@@ -476,17 +477,6 @@ impl BitReader { | |||
|
|||
let mut i = 0; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it isn't clear to me what the num_bits
input parameter means (is it the number of bits to read after batch.len()
whole T
values?) -- can you possibly update the comments too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally renamed this to bit_width, but we seem to use num_bits
as the name for this concept in a lot of places. I've updated the doc comment
match i { | ||
0..=8 => test_get_batch_helper::<u8>(*s, i), | ||
9..=16 => test_get_batch_helper::<u16>(*s, i), | ||
_ => test_get_batch_helper::<u32>(*s, i), | ||
17..=32 => test_get_batch_helper::<u32>(*s, i), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great to reduce the code a lot. Thanks for the work. It is nicer if we can add some comments as @alamb suggested, as at first look, the macros look complicated.
Thanks, this looks very nice. I'll take a look too today. BTW I think it's possible to apply SIMD on this code path to further improve the performance. Arrow C++ already did it, see https://issues.apache.org/jira/browse/ARROW-9702 and related. |
The motivation for the manual unrolling of the loops is so that LLVM does this for us. If you check the godbolt output linked in the PR you'll see it is using AVX2 instructions. This fits with what we've seen in general with the arrow compute kernels, where only AVX512 seems to need manual implementation. |
parquet/src/util/bit_pack.rs
Outdated
(16, $offset:expr, $v:ident, $c:block) => {{ | ||
unroll_impl!(4, 0, __v16, { | ||
unroll_impl!(4, __v16 * 4 + $offset, $v, $c) | ||
}); | ||
}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use unroll_impl!(8, ...)
here?
/// Macro that generates an unpack function taking the number of bits as a const generic | ||
macro_rules! unpack_impl { | ||
($t:ty, $bytes:literal, $bits:tt) => { | ||
pub fn unpack<const NUM_BITS: usize>(input: &[u8], output: &mut [$t; $bits]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 when the original version was written const generic is not stabilized yet.
parquet/src/util/bit_pack.rs
Outdated
unroll_impl!(1, $offset + 1, $v, $c); | ||
}}; | ||
(4, $offset:expr, $v:ident, $c:block) => {{ | ||
unroll_impl!(2, 0, __v4, { unroll_impl!(2, __v4 * 2 + $offset, $v, $c) }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding some comments for future reference :)
parquet/src/util/bit_pack.rs
Outdated
/// Unpack packed `input` into `output` with a bit width of `num_bits` | ||
pub fn $name(input: &[u8], output: &mut [$t; $bits], num_bits: usize) { | ||
// This will get optimised into a jump table | ||
unroll!(for i in 0..$bits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it might be worth checkouting https://docs.rs/seq-macro/latest/seq_macro/, the generated output is easier to understand.
For example the output of unpack8
becomes:
pub fn unpack8(input: &[u8], output: &mut [u8; 8], num_bits: usize) {
if 0 == num_bits { return unpack8::unpack::<0>(input, output); }
if 1 == num_bits { return unpack8::unpack::<1>(input, output); }
if 2 == num_bits { return unpack8::unpack::<2>(input, output); }
if 3 == num_bits { return unpack8::unpack::<3>(input, output); }
if 4 == num_bits { return unpack8::unpack::<4>(input, output); }
if 5 == num_bits { return unpack8::unpack::<5>(input, output); }
if 6 == num_bits { return unpack8::unpack::<6>(input, output); }
if 7 == num_bits { return unpack8::unpack::<7>(input, output); };
if num_bits == 8 { return unpack8::unpack::<8>(input, output); }
::core::panicking::panic_fmt(::core::fmt::Arguments::new_v1(&["internal error: entered unreachable code: "],
&[::core::fmt::ArgumentV1::new_display(&::core::fmt::Arguments::new_v1(&["invalid num_bits "],
&[::core::fmt::ArgumentV1::new_display(&num_bits)]))]));
}
instead of
/// Unpack packed `input` into `output` with a bit width of `num_bits`
pub fn unpack8(input: &[u8], output: &mut [u8; 8], num_bits: usize) {
#[allow(non_upper_case_globals)]
{
{
{
{
const __v8: usize = 0;
{
{
{
{
const __v4: usize = 0;
{
{
{
const i: usize = __v4 * 2 + (__v8 * 4 + 0);
{
if i == num_bits {
return unpack8::unpack::<i>(input, output);
}
}
};
{
const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1;
{
if i == num_bits {
return unpack8::unpack::<i>(input, output);
}
}
};
}
}
};
{
const __v4: usize = 0 + 1;
{
{
{
const i: usize = __v4 * 2 + (__v8 * 4 + 0);
{
if i == num_bits {
return unpack8::unpack::<i>(input, output);
}
}
};
{
const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1;
{
if i == num_bits {
return unpack8::unpack::<i>(input, output);
}
}
};
}
}
};
};
}
}
};
{
const __v8: usize = 0 + 1;
{
{
{
{
const __v4: usize = 0;
{
{
{
const i: usize = __v4 * 2 + (__v8 * 4 + 0);
{
if i == num_bits {
return unpack8::unpack::<i>(input, output);
}
}
};
{
const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1;
{
if i == num_bits {
return unpack8::unpack::<i>(input, output);
}
}
};
}
}
};
{
const __v4: usize = 0 + 1;
{
{
{
const i: usize = __v4 * 2 + (__v8 * 4 + 0);
{
if i == num_bits {
return unpack8::unpack::<i>(input, output);
}
}
};
{
const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1;
{
if i == num_bits {
return unpack8::unpack::<i>(input, output);
}
}
};
}
}
};
};
}
}
};
};
}
};
if num_bits == 8 { return unpack8::unpack::<8>(input, output); }
::core::panicking::panic_fmt(::core::fmt::Arguments::new_v1(&["internal error: entered unreachable code: "],
&[::core::fmt::ArgumentV1::new_display(&::core::fmt::Arguments::new_v1(&["invalid num_bits "],
&[::core::fmt::ArgumentV1::new_display(&num_bits)]))]));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you may no longer need the hack in unroll_impl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid introducing a new dependency, as that has historically been controversial
I think in this case using a dependency could be preferrable and lead to even faster code. A macro relying on implementation details of how identifiers are generated is indeed a bit ugly. I know that parquet2 is using the bitpacking crate, which uses optimized vectorized implementations without relying too much on llvm. |
I looked at the bitpacking crate but it only supports the equivalent of unpack32 which it achieves in much the same way relying on LLVM to optimise it. I'll switch this to using seq_macro as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say 🚢 🇮🇹 and I will keep my 🤐 about adding new dependencies -- I think seq
is ok given what I see of the https://crates.io/crates/seq-macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too. seq-macro
doesn't carry any transitive dependency and it seems well maintained, so I think it's fine to add it.
Thank you all for reviewing and for the feedback 😃 |
Benchmark runs are scheduled for baseline = 2cf4cd8 and contender = 8a092e3. 8a092e3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
#2319 bumps the defaults to make the most of this change, PTAL 😄 |
This PR ports apache/arrow-rs#2278 to parquet2. Credit to the design and implementation of the unpacking path go to @tustvold - it is 5-10% faster than the bitpacking crate 🚀 Additionally, it adds the corresponding packing code path, thereby completely replacing the dependency on bitpacking. It also adds some traits that allows code to be written via generics. A curious observation is that, with this PR, parquet2 no longer executes unsafe code (bitpacking had some) 🎉 Backward changes: renamed parquet2::encoding::bitpacking to parquet2::encoding::bitpacked parquet2::encoding::bitpacked::Decoder now has a generic parameter (output type) parquet2::encoding::bitpacked::Decoder::new's second parameter is now a usize
This PR ports apache/arrow-rs#2278 to parquet2. Credit to the design and implementation of the unpacking path go to @tustvold - it is 5-10% faster than the bitpacking crate 🚀 Additionally, it adds the corresponding packing code path, thereby completely replacing the dependency on bitpacking. It also adds some traits that allows code to be written via generics. A curious observation is that, with this PR, parquet2 no longer executes unsafe code (bitpacking had some) 🎉 Backward changes: renamed parquet2::encoding::bitpacking to parquet2::encoding::bitpacked parquet2::encoding::bitpacked::Decoder now has a generic parameter (output type) parquet2::encoding::bitpacked::Decoder::new's second parameter is now a usize
Which issue does this PR close?
Closes #2276
Rationale for this change
Eliminates a load of unsafe code, and allows unpacking into the native type instead of always having to go via
u32
.I'm not confident the generated code is necessarily optimal, but godbolt would suggest that LLVM at least takes a decent stab at using the bit-shuffling instructions - https://rust.godbolt.org/z/1aMdKT8qc
What changes are included in this PR?
Are there any user-facing changes?