-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add full zip encoding for wide data types #3114
Conversation
34d793d
to
bb03752
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3114 +/- ##
==========================================
+ Coverage 77.15% 77.18% +0.03%
==========================================
Files 240 240
Lines 80759 81517 +758
Branches 80759 81517 +758
==========================================
+ Hits 62309 62920 +611
- Misses 15278 15385 +107
- Partials 3172 3212 +40
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
rust/lance-encoding/src/buffer.rs
Outdated
@@ -223,6 +226,7 @@ impl LanceBuffer { | |||
/// | |||
/// If the underlying buffer is not properly aligned, this will involve a copy of the data | |||
pub fn borrow_to_typed_slice<T: ArrowNativeType>(&mut self) -> impl AsRef<[T]> { | |||
check_little_endian(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we just disallow building Lance when the targeted platform is not little endian?
let control_word: u32 = (((next.0 & self.rep_mask) as u32) << self.def_width) | ||
+ ((next.1 & self.def_mask) as u32); | ||
let control_word = control_word.to_le_bytes(); | ||
buf.push(control_word[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason we don't like to use extend_from_slice?
buf.extend_from_slice(control_word.to_le_bytes())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm just being paranoid. However, my reasoning was that buf.extend_from_slice
will invoke a memcpy
where buf.push
should not. However, there may be a fast path in extend_from_slice
somewhere or it may get optimized away or...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
rust/lance-encoding/src/buffer.rs
Outdated
@@ -225,8 +222,8 @@ impl LanceBuffer { | |||
/// Reinterprets a LanceBuffer into a Vec<T> | |||
/// | |||
/// If the underlying buffer is not properly aligned, this will involve a copy of the data | |||
#[cfg(target_endian = "little")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have this enforcing machine endian
logic somewhere more public, for example a build.rs
and/or Cargo.toml
in project root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I put a directive in lib.rs
. I prefer to avoid fiddling with build.rs
if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e903775
to
7d8a714
Compare
The encoding is only tested on tensors for now. It should encode variable-width data but, without a repetition index, we are not yet able to schedule / decode variable width data. In addition, I've created a few todos for follow-up.