-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support miniblock with binary data #3099
Conversation
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
00b1084
to
a6053ca
Compare
rust/lance-encoding/src/data.rs
Outdated
@@ -293,6 +293,55 @@ impl FixedWidthDataBlock { | |||
} | |||
} | |||
|
|||
pub struct VariableWidthDataBlockBuilder1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for variable width data with offset u64, I will do
pub struct VariableWidthDataBlockBuilder2 {
offsets: Vec<u64>,
bytes: Vec<u8>,
}
it will be added in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use better names please. Even if it's just VariableWidthDataBlockBuilder
and LargeVariableWidthDataBlockBuilder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, sure! VariableWidthDataBlockBuilder
and LargeVariableWidthDataBlockBuilder
are way better than ...1
, ...2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
rust/lance-encoding/src/format.rs
Outdated
@@ -133,6 +133,12 @@ impl ProtobufUtils { | |||
} | |||
} | |||
|
|||
pub fn binary_miniblock() -> ArrayEncoding { | |||
ArrayEncoding { | |||
array_encoding: Some(ArrayEncodingEnum::BinaryMiniblock(BinaryMiniBlock {})), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not important to this PR but I defined proto buffer
message BinaryMiniBlock {
}
but here it is BinaryMiniblock
B
-> b
don't know the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because you have BinaryMiniBlock binary_miniblock = 15;
and not BinaryMiniBlock binary_mini_block = 15;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! gotcha, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3099 +/- ##
==========================================
+ Coverage 77.08% 77.13% +0.05%
==========================================
Files 240 240
Lines 80412 80652 +240
Branches 80412 80652 +240
==========================================
+ Hits 61987 62213 +226
- Misses 15263 15275 +12
- Partials 3162 3164 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. We will need test cases though. For fixed width data we just extended the 2.0 tests to run on 2.1 as well:
#[rstest]
#[test_log::test(tokio::test)]
async fn test_value_primitive(
#[values(LanceFileVersion::V2_0, LanceFileVersion::V2_1)] version: LanceFileVersion,
) {
for data_type in PRIMITIVE_TYPES {
log::info!("Testing encoding for {:?}", data_type);
let field = Field::new("", data_type.clone(), false);
check_round_trip_encoding_random(field, version).await;
}
}
You can probably do the same thing using the tests in physical/binary.rs
.
Also, I think we have a problem with alignment. I have an idea I'll put together real quick.
rust/lance-encoding/src/data.rs
Outdated
@@ -293,6 +293,55 @@ impl FixedWidthDataBlock { | |||
} | |||
} | |||
|
|||
pub struct VariableWidthDataBlockBuilder1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use better names please. Even if it's just VariableWidthDataBlockBuilder
and LargeVariableWidthDataBlockBuilder
.
rust/lance-encoding/src/data.rs
Outdated
for i in selection.start..selection.end { | ||
let this_value_len = offsets[i as usize + 1] - offsets[i as usize]; | ||
self.offsets.push(previous_len as u32 + this_value_len); | ||
previous_len += this_value_len as usize; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inefficient I think (don't want offsets[]
in a loop because each call does bounds checking) but we can optimize later if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, fixed.
DataType::Binary | DataType::Utf8 => { | ||
let column_info = column_infos.expect_next()?; | ||
let scheduler = Box::new(StructuralPrimitiveFieldScheduler::try_new( | ||
column_info.as_ref(), | ||
self.decompressor_strategy.as_ref(), | ||
)?); | ||
column_infos.next_top_level(); | ||
Ok(scheduler) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this can we add Binary
and Utf8
to Self::is_primitive
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a good idea, I am a bit worried doing this may cause some issues in lance version priori to 2.1, I will verify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out it does cause troubles with backward compatibility, for example, the test in python/tests/test_backwards_compatibility.py
will fail. kept CoreDecompressorStrategy::is_primitive
as is and kept this logic.
rust/lance-encoding/src/encoder.rs
Outdated
let max_len = max_len | ||
.as_any() | ||
.downcast_ref::<PrimitiveArray<UInt64Type>>() | ||
.unwrap(); | ||
if max_len.value(0) < 128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why verify max_len
again here? It's not harmful, just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just being cautious, and subconsciously the fear of writing bug code. I think it's better to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
bytes_start_offset: usize, | ||
// every chunk is padded to 8 bytes. | ||
// we need to interpret every chunk as &[u32] so we need it to padded at least to 4 bytes, | ||
// 8 bytes are a more conservative behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some confusing between padding and alignment here. You want to be able to take the offsets of a chunk and interpret them as &[u32]
. In order for this to work the slice needs to be aligned to (not padded) 4 bytes (and it is quite concretely 4, let's not make it 8). The means the pointer to the start of the slice needs to be divisible by 4.
Padding the buffer to make the length divisible by 4 does not guarantee this, except possibly accidentally.
In fact, getting this alignment correct is kind of tricky, and is going to take some tweaking of the miniblock encoder itself 😦.
Let me throw together a PR. We can probably change our "pad to 6" logic to just be "pad to 8" and then pad the start of the data buffer appropriately so that the start of the data buffer is on an 8-byte aligned pointer. This will ensure that each chunk starts with an 8-byte aligned pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a copy of the entire chunk in decompress
(hope we can eliminate it after your PR).
because I did a copy, I think I can safely reinterpret the chunk as &[u32]
as long as it's length in number of bytes
is some multiple of 4.
// `this_chunk_offsets` are offsets that points to bytes in this chunk, | ||
let this_chunk_offsets = offsets | ||
[chunk.chunk_start_offset_in_orig_idx..chunk.chunk_last_offset_in_orig_idx + 1] | ||
.iter() | ||
.map(|offset| { | ||
offset - offsets[chunk.chunk_start_offset_in_orig_idx] | ||
+ chunk.bytes_start_offset as u32 | ||
}) | ||
.collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears you are copying n+1
offsets and normalizing them so they always start at 0. If you're going to normalize, there is no need to copy the 0.
Alternatively, you could skip the normalization, and copy the n+1 values. The decompressor would then need to subtract the first value from everything.
Not sure which approach is better, and this approach is fine for now, but it is technically doing more work than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears you are copying n+1 offsets and normalizing them so they always start at 0.
they normalize to chunk.bytes_start_offset
.
I can skip the normalization here but then I also need to store the chunk.bytes_start_offset
in the output
buffer so the decompressor
know the chunk.bytes_start_offset
fn decompress(&self, data: LanceBuffer, num_values: u64) -> Result<DataBlock> { | ||
assert!(data.len() >= 8); | ||
let data = data.to_vec(); | ||
let offsets: &[u32] = cast_slice(&data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to use try_cast_slice
instead of cast_slice
.
rust/lance-encoding/src/format.rs
Outdated
@@ -133,6 +133,12 @@ impl ProtobufUtils { | |||
} | |||
} | |||
|
|||
pub fn binary_miniblock() -> ArrayEncoding { | |||
ArrayEncoding { | |||
array_encoding: Some(ArrayEncodingEnum::BinaryMiniblock(BinaryMiniBlock {})), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because you have BinaryMiniBlock binary_miniblock = 15;
and not BinaryMiniBlock binary_mini_block = 15;
actually, the |
…o binary-miniblock
…o binary-miniblock
after merging PR #3101, I tried to get rid of
and misalignment panic still happens when executing
will deal with it in a separate PR. |
rust/lance-encoding/src/encoder.rs
Outdated
@@ -1429,6 +1444,7 @@ pub mod tests { | |||
assert!(!is_dict_encoding_applicable(vec![Some("a"), Some("a")], 3)); | |||
} | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this?
@@ -579,6 +579,7 @@ fn rows_in_buffer( | |||
bits_in_buffer / bits_per_value | |||
} | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I fixed this (well, more worked around it). Can you uncomment and verify these tests still fail?
This PR enables miniblock encoding with binary data type.