-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: bitpack with miniblock #3067
feat: bitpack with miniblock #3067
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3067 +/- ##
==========================================
- Coverage 77.60% 77.48% -0.12%
==========================================
Files 240 240
Lines 78683 78752 +69
Branches 78683 78752 +69
==========================================
- Hits 61059 61024 -35
- Misses 14496 14590 +94
- Partials 3128 3138 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool! A few changes are needed but I'm excited to make progress on this.
@@ -561,7 +561,7 @@ impl IndexWorker { | |||
} | |||
} | |||
|
|||
pub(crate) struct PostingReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes don't seem relevant? Can we remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sure, might be some automatic lint fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
protos/encodings.proto
Outdated
|
||
// The items in the list | ||
Buffer buffer = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need buffer
. That can be a concern for structural encodings (e.g. PageLayout
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. Thanks!
@@ -263,6 +271,7 @@ message ArrayEncoding { | |||
FixedSizeBinary fixed_size_binary = 11; | |||
BitpackedForNonNeg bitpacked_for_non_neg = 12; | |||
Constant constant = 13; | |||
Bitpack2 bitpack2 = 14; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it for this PR but I'm thinking of putting the new 2.1 encodings in message Compression { }
instead of ArrayEncoding
. On the plus side, we can probably name it bitpack
and not bitpack2
once we do that. Let's save for future work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, good idea.
python/play.py
Outdated
@@ -0,0 +1,96 @@ | |||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be included in this PR? If so, we should rename it and move it in a benchmark somwhere. Although I'd rather just keep it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry, will remove it. I keep git rm
this file, then I get some lint issues then I git add .
it back haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks.
rust/lance-encoding/src/encoder.rs
Outdated
) -> Result<Box<dyn MiniBlockCompressor>> { | ||
assert!(field.data_type().byte_width() > 0); | ||
if let DataBlock::FixedWidth(ref fixed_width_data) = data { | ||
if fixed_width_data.bits_per_value <= 64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check that it is a multiple of 8
? Or maybe that it is one of [8, 16, 32, 64]
? Would your logic work (for example) with fixed-size-list<int8, 3>
which has a width of 24
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good idea. I should make sure it is one of 8, 16, 32, 64
.
one interesting issue may need to be considered:
when input is fixed-size-list<int8, 4>
, we can bitpack it and get back the decoding result back correctly, but the compression ratio is probably bad, we may also need to make sure don't use bitpack when hit this case
// and the bit-width parameter has the same bit-width as the uncompressed DataBlock | ||
// for example, if the input DataBlock has `bits_per_value` of `16`, there will be 2 bytes(16 bits) | ||
// in front of each chunk storing the `bit-width` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just always use 1 byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is related to the type of output
vector, it's type is the same as input data type(to satisfy signature of Bitpack::unchecked_pack), so when we are filling in bit-width
into output
, it's type is same as input data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for fastlanes bitpack compression to work, its start output buffer position must be aligned to the same input data type alignment.
I will do same experiments to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will panic to squeeze in a u8
into a vector of u64
and then treat the rest of vector as u64
, because the alignment requirement doesn't hold anymore.
let bit_widths = $data | ||
.get_stat(Stat::BitWidth) | ||
.expect("FixedWidthDataBlock should have valid bit width statistics"); | ||
println!("bit_widths statistics got"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
println!("bit_widths statistics got"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks.
.downcast_ref::<PrimitiveArray<UInt64Type>>() | ||
.unwrap(); | ||
|
||
let (packed_chunk_sizes, total_size) = bit_widths_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are chunk_size
and total_size
a "number of values" measurement or a "number of bytes" measurement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packed_chunk_sizes and total_size are "number of values" measurement, it needs to be this way because the function signature Bitpack::unchecked_pack
requires the output
slice type to be the same as the input slice, so when we use total_size
to
let mut output: Vec<$data_type> = Vec::with_capacity(total_size);
the output size is total_size
with number of bytes of total_size * sizeof<each element>
} | ||
chunks.push(MiniBlockChunk { | ||
num_bytes: ((1 + packed_chunk_sizes[i]) * std::mem::size_of::<$data_type>()) as u16, | ||
log_num_values: 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ELEMS_PER_CHUNK
is a constant it would be good for this to be a constant too. Maybe just LOG_ELEMENTS_PER_CHUNK
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks!
// Copy for memory alignment | ||
let chunk_in_u8: Vec<u8> = data.to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the memory alignment concern now. Let's not worry about it now. In the future to fix this I think we need...
- Each page must be aligned (preferably to something largish like 64 bytes)
- The miniblock layout is responsible for ensuring that the mini-block buffer (that we get from the compressor) is either at the start of the page or is padded to sufficient alignment (again, can just do 64)
- Each miniblock chunk should be aligned. The compressor can return a preferred alignment (we want to be more conservative here and use the smallest alignment we can) and the layout can be responsible for doing the actual padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, I need to cast the raw data fetched from memory into slice of u8, u16, u32, or u64
based on their uncompressed_bit_width
, to make this cast successful, I need to make sure the data being cast has the alignment requirement that a slice of u8, u16, u32, u64
needs(multiple of 8 bytes).
the bit-packed chunk itself guarantees that(it's a multiple of 1024 bits(128 bytes), if mini-chunk layout can guarantee that, then this copy may be eliminated
This PR tries to add bit-packing encoding in the mini-block encoding path.
In this PR, each chunk(1024 values) has it's own bit-width parameter and it's stored in each chunk
I found that the current implementation to get the
bit_width
of every 1024 values is very slow and hurts the write speed significantly, more investigation needed, I will deal with it by filling a different issue and PR.#3052