Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add full zip encoding for wide data types #3114

Merged
merged 3 commits into from
Nov 13, 2024

Conversation

westonpace
Copy link
Contributor

The encoding is only tested on tensors for now. It should encode variable-width data but, without a repetition index, we are not yet able to schedule / decode variable width data. In addition, I've created a few todos for follow-up.

@github-actions github-actions bot added the enhancement New feature or request label Nov 11, 2024
@westonpace westonpace force-pushed the feat/full-zip-encoder branch from 34d793d to bb03752 Compare November 11, 2024 14:14
@codecov-commenter
Copy link

codecov-commenter commented Nov 11, 2024

Codecov Report

Attention: Patch coverage is 77.96178% with 173 lines in your changes missing coverage. Please review.

Project coverage is 77.18%. Comparing base (961cd95) to head (7d8a714).

Files with missing lines Patch % Lines
rust/lance-encoding/src/repdef.rs 69.45% 113 Missing ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs 82.77% 36 Missing and 10 partials ⚠️
...encoding/src/encodings/physical/fixed_size_list.rs 86.27% 5 Missing and 2 partials ⚠️
rust/lance-encoding/src/encoder.rs 73.68% 4 Missing and 1 partial ⚠️
rust/lance-core/src/utils/bit.rs 96.66% 1 Missing ⚠️
rust/lance-encoding/src/decoder.rs 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3114      +/-   ##
==========================================
+ Coverage   77.15%   77.18%   +0.03%     
==========================================
  Files         240      240              
  Lines       80759    81517     +758     
  Branches    80759    81517     +758     
==========================================
+ Hits        62309    62920     +611     
- Misses      15278    15385     +107     
- Partials     3172     3212      +40     
Flag Coverage Δ
unittests 77.18% <77.96%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -223,6 +226,7 @@ impl LanceBuffer {
///
/// If the underlying buffer is not properly aligned, this will involve a copy of the data
pub fn borrow_to_typed_slice<T: ArrowNativeType>(&mut self) -> impl AsRef<[T]> {
check_little_endian();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just disallow building Lance when the targeted platform is not little endian?

let control_word: u32 = (((next.0 & self.rep_mask) as u32) << self.def_width)
+ ((next.1 & self.def_mask) as u32);
let control_word = control_word.to_le_bytes();
buf.push(control_word[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason we don't like to use extend_from_slice?

buf.extend_from_slice(control_word.to_le_bytes())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just being paranoid. However, my reasoning was that buf.extend_from_slice will invoke a memcpy where buf.push should not. However, there may be a fast path in extend_from_slice somewhere or it may get optimized away or...

Copy link
Contributor

@broccoliSpicy broccoliSpicy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@@ -225,8 +222,8 @@ impl LanceBuffer {
/// Reinterprets a LanceBuffer into a Vec<T>
///
/// If the underlying buffer is not properly aligned, this will involve a copy of the data
#[cfg(target_endian = "little")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have this enforcing machine endian logic somewhere more public, for example a build.rs and/or Cargo.toml in project root.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I put a directive in lib.rs. I prefer to avoid fiddling with build.rs if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace westonpace force-pushed the feat/full-zip-encoder branch from e903775 to 7d8a714 Compare November 13, 2024 14:43
@westonpace westonpace merged commit ec76db4 into lancedb:main Nov 13, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants