Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: bitpack with miniblock #3067

Merged
merged 12 commits into from
Oct 31, 2024

Conversation

broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Oct 30, 2024

This PR tries to add bit-packing encoding in the mini-block encoding path.
In this PR, each chunk(1024 values) has it's own bit-width parameter and it's stored in each chunk

I found that the current implementation to get the bit_width of every 1024 values is very slow and hurts the write speed significantly, more investigation needed, I will deal with it by filling a different issue and PR.

#3052

@github-actions github-actions bot added enhancement New feature or request python labels Oct 30, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2024

Codecov Report

Attention: Patch coverage is 81.42857% with 13 lines in your changes missing coverage. Please review.

Project coverage is 77.48%. Comparing base (270cab3) to head (dcabd0f).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...coding/src/encodings/physical/bitpack_fastlanes.rs 80.39% 10 Missing ⚠️
rust/lance-encoding/src/encoder.rs 77.77% 2 Missing ⚠️
rust/lance-encoding/src/encodings/physical.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3067      +/-   ##
==========================================
- Coverage   77.60%   77.48%   -0.12%     
==========================================
  Files         240      240              
  Lines       78683    78752      +69     
  Branches    78683    78752      +69     
==========================================
- Hits        61059    61024      -35     
- Misses      14496    14590      +94     
- Partials     3128     3138      +10     
Flag Coverage Δ
unittests 77.48% <81.42%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@broccoliSpicy broccoliSpicy changed the title feat: bitpack with miniblock feat: bitpack with miniblock, each chunk has it's own bit-width parameter and it's stored in each chunk Oct 30, 2024
Copy link

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@broccoliSpicy broccoliSpicy changed the title feat: bitpack with miniblock, each chunk has it's own bit-width parameter and it's stored in each chunk feat: bitpack with miniblock Oct 30, 2024
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! A few changes are needed but I'm excited to make progress on this.

@@ -561,7 +561,7 @@ impl IndexWorker {
}
}

pub(crate) struct PostingReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes don't seem relevant? Can we remove them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sure, might be some automatic lint fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines 217 to 219

// The items in the list
Buffer buffer = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need buffer. That can be a concern for structural encodings (e.g. PageLayout).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Thanks!

@@ -263,6 +271,7 @@ message ArrayEncoding {
FixedSizeBinary fixed_size_binary = 11;
BitpackedForNonNeg bitpacked_for_non_neg = 12;
Constant constant = 13;
Bitpack2 bitpack2 = 14;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it for this PR but I'm thinking of putting the new 2.1 encodings in message Compression { } instead of ArrayEncoding. On the plus side, we can probably name it bitpack and not bitpack2 once we do that. Let's save for future work though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, good idea.

python/play.py Outdated
@@ -0,0 +1,96 @@
import pyarrow as pa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be included in this PR? If so, we should rename it and move it in a benchmark somwhere. Although I'd rather just keep it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry, will remove it. I keep git rm this file, then I get some lint issues then I git add . it back haha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

) -> Result<Box<dyn MiniBlockCompressor>> {
assert!(field.data_type().byte_width() > 0);
if let DataBlock::FixedWidth(ref fixed_width_data) = data {
if fixed_width_data.bits_per_value <= 64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that it is a multiple of 8? Or maybe that it is one of [8, 16, 32, 64]? Would your logic work (for example) with fixed-size-list<int8, 3> which has a width of 24?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good idea. I should make sure it is one of 8, 16, 32, 64.
one interesting issue may need to be considered:
when input is fixed-size-list<int8, 4>, we can bitpack it and get back the decoding result back correctly, but the compression ratio is probably bad, we may also need to make sure don't use bitpack when hit this case

Comment on lines +1570 to +1572
// and the bit-width parameter has the same bit-width as the uncompressed DataBlock
// for example, if the input DataBlock has `bits_per_value` of `16`, there will be 2 bytes(16 bits)
// in front of each chunk storing the `bit-width` parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just always use 1 byte?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is related to the type of output vector, it's type is the same as input data type(to satisfy signature of Bitpack::unchecked_pack), so when we are filling in bit-width into output, it's type is same as input data type

Copy link
Contributor Author

@broccoliSpicy broccoliSpicy Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for fastlanes bitpack compression to work, its start output buffer position must be aligned to the same input data type alignment.
I will do same experiments to test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will panic to squeeze in a u8 into a vector of u64 and then treat the rest of vector as u64, because the alignment requirement doesn't hold anymore.

let bit_widths = $data
.get_stat(Stat::BitWidth)
.expect("FixedWidthDataBlock should have valid bit width statistics");
println!("bit_widths statistics got");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
println!("bit_widths statistics got");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

.downcast_ref::<PrimitiveArray<UInt64Type>>()
.unwrap();

let (packed_chunk_sizes, total_size) = bit_widths_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are chunk_size and total_size a "number of values" measurement or a "number of bytes" measurement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packed_chunk_sizes and total_size are "number of values" measurement, it needs to be this way because the function signature Bitpack::unchecked_pack requires the output slice type to be the same as the input slice, so when we use total_size to

      let mut output: Vec<$data_type> = Vec::with_capacity(total_size);

the output size is total_size with number of bytes of total_size * sizeof<each element>

}
chunks.push(MiniBlockChunk {
num_bytes: ((1 + packed_chunk_sizes[i]) * std::mem::size_of::<$data_type>()) as u16,
log_num_values: 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ELEMS_PER_CHUNK is a constant it would be good for this to be a constant too. Maybe just LOG_ELEMENTS_PER_CHUNK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks!

Comment on lines +1729 to +1730
// Copy for memory alignment
let chunk_in_u8: Vec<u8> = data.to_vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the memory alignment concern now. Let's not worry about it now. In the future to fix this I think we need...

  • Each page must be aligned (preferably to something largish like 64 bytes)
  • The miniblock layout is responsible for ensuring that the mini-block buffer (that we get from the compressor) is either at the start of the page or is padded to sufficient alignment (again, can just do 64)
  • Each miniblock chunk should be aligned. The compressor can return a preferred alignment (we want to be more conservative here and use the smallest alignment we can) and the layout can be responsible for doing the actual padding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, I need to cast the raw data fetched from memory into slice of u8, u16, u32, or u64 based on their uncompressed_bit_width, to make this cast successful, I need to make sure the data being cast has the alignment requirement that a slice of u8, u16, u32, u64 needs(multiple of 8 bytes).
the bit-packed chunk itself guarantees that(it's a multiple of 1024 bits(128 bytes), if mini-chunk layout can guarantee that, then this copy may be eliminated

@broccoliSpicy broccoliSpicy merged commit 54053e6 into lancedb:main Oct 31, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants