feat: adds list decode support for mini-block encoded data #3241

westonpace · 2024-12-12T23:37:24Z

Lists are encoded using rep/def levels and a repetition index. At decode time we take all this information to be able to fetch individual ranges of lists.

codecov-commenter · 2024-12-13T00:05:54Z

Codecov Report

Attention: Patch coverage is 97.40000% with 39 lines in your changes missing coverage. Please review.

Project coverage is 78.96%. Comparing base (83b8efd) to head (3d46212).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
.../lance-encoding/src/encodings/logical/primitive.rs	98.00%	12 Missing and 12 partials ⚠️
rust/lance-encoding/src/encodings/logical/list.rs	68.75%	5 Missing ⚠️
rust/lance-encoding/src/repdef.rs	97.56%	2 Missing and 2 partials ⚠️
rust/lance-encoding/src/testing.rs	0.00%	4 Missing ⚠️
rust/lance-arrow/src/list.rs	98.03%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3241      +/-   ##
==========================================
+ Coverage   78.47%   78.96%   +0.48%     
==========================================
  Files         245      246       +1     
  Lines       85088    86313    +1225     
  Branches    85088    86313    +1225     
==========================================
+ Hits        66772    68153    +1381     
+ Misses      15501    15341     -160     
- Partials     2815     2819       +4

Flag	Coverage Δ
unittests	`78.96% <97.40%> (+0.48%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127

Nice work! I have a few minor questions.

wjones127 · 2024-12-13T22:57:41Z

rust/lance-encoding/src/encodings/logical/primitive.rs

+        debug_assert!(
+            buf.len()
+                <= bytes_rep
+                                + bytes_def
+                                + bytes_val
+                                + 6
+                                + 1 // P1
+                                + (2 * MINIBLOCK_MAX_PADDING) // P2/P3
+        );


Kinda funky formatting

I think it's because of the inline comments. I can break it up but there is a good chance this particular code will be changing within the month.

wjones127 · 2024-12-13T23:11:50Z

rust/lance-encoding/src/encodings/logical/primitive.rs

-            vals_in_chunk: self.vals_in_chunk,
-            ranges: self.ranges.clone(),
-            vals_targeted: self.vals_targeted,
+// TODO: Add test cases for the all-preamble and all-trailer cases


Is this TODO done? or a follow up?

A follow-up. I'll go through and make follow-ups before I merge, I think there are 2 or 3 in this PR.

wjones127 · 2024-12-13T23:14:38Z

rust/lance-encoding/src/encodings/logical/primitive.rs

+//
+// So if `ChunkInstructions` is "skip preamble, skip 10, take 50, take trailer" and we are decoding in
+// batches of size 10 we might have a `ChunkDrainInstructions` that targets that chunk and has its own
+// skip of 17 and take of 10.  This would mean we decode the chunk, skip the preamble and 27 rows, and


Maybe?

Suggested change

// skip of 17 and take of 10. This would mean we decode the chunk, skip the preamble and 27 rows, and

// skip of 17 and take of 10. This would mean we decode the chunk, skip the preamble and 17 rows, and

No, it's additive:

// Our instructions tell us which rows we want to take from this chunk let row_range_start = instructions.rows_to_skip + instructions.chunk_instructions.rows_to_skip; let row_range_end = row_range_start + instructions.rows_to_take;

We add instructions.rows_to_skip (17 in the above example) and instructions.chunk_instructions.rows_to_skip (10 in the above example) to get the actual range in the chunk (27..37 in the above example)

Oh I see. I was mixing up with the 10 from the take.

wjones127 · 2024-12-13T23:17:08Z

rust/lance-encoding/src/encodings/logical/primitive.rs

+// "no preamble, skip 5, take 10, take trailer" and we are draining 20 rows then the
+// first drain instructions will have no preamble, skip 0, take 11 and the second chunk
+// instructions will have take preamble, skip 0, take 9 (assuming the second chunk has at
+// least 9 rows)


Do we ignore the skips? Also why 11 when the total is take 10? Is that because the trailer is counting as 1 here?

Maybe this could make more sense in some other way but in my head:

The chunk instructions describe a range into the chunk. The drain instructions describe a range into the chunk instructions:

And yes, the total is 11 because of the trailer. When we're acutally zoomed into the rep levels and mapping the ranges the trailer looks like a normal row (e.g. look at the middle chunk below):

1 0 0 0 1 0 | 0 0 1 0 0 1 0 0 1 0 0 1 0 0 | 0 0 1 0

So the logic was (at the time) simpler to just treat it as a row by the time we hit map_ranges. Maybe it would be simpler if we keep it a separate boolean?

I don't think we need a separate boolean, just wanted to clarify the logic here. Also TBH I'm confused about the concept of draining 20 rows from a take 10.

It was meant to be draining 20 rows from a series of chunks starting with a chunk that only has 10 rows and a trailer. However, I cleaned up the wording and made the example more explicit:

// One very confusing bit is that `rows_to_take` includes the trailer. So if we have two chunks: // -no preamble, skip 5, take 10, take trailer // -take preamble, skip 0, take 50, no trailer // // and we are draining 20 rows then the drain instructions for the first batch will be: // - no preamble, skip 0 (from chunk 0), take 11 (from chunk 0) // - take preamble (from chunk 1), skip 0 (from chunk 1), take 9 (from chunk 1)

wjones127 · 2024-12-13T23:18:03Z

rust/lance-encoding/src/encodings/logical/primitive.rs

+impl ChunkInstructions {
+    // Given a repetition index and a set of user ranges we need to figure out how to read from the chunks
+    //
+    // We assume that `user_ranges` are in sorted order and non-overlapping


Should we add a debug_assert for this?

Maybe but we should probably put it higher up in decoder.rs. This is an assumption we make throughout the decoders and many of them rely on it.

This comment was less "oh, in this spot we have this extra guarantee" and more "remember that we know this"

If you have something higher up that is fine.

broccoliSpicy · 2024-12-15T01:29:27Z

rust/lance-encoding/src/encodings/logical/list.rs

@@ -1359,6 +1358,9 @@ impl StructuralFieldScheduler for StructuralListScheduler {
    }
 }

+/// Scheduling job for list data
+///
+/// It doesn't really do anything right now because list


seems like a unfinished sentence?

Oops, yes it is 🤦. Changed to:

/// Scheduling job for list data /// /// Scheduling is handled by the primitive encoder and nothing special /// happens here.

github-actions bot added the enhancement New feature or request label Dec 12, 2024

broccoliSpicy approved these changes Dec 13, 2024

View reviewed changes

wjones127 approved these changes Dec 13, 2024

View reviewed changes

broccoliSpicy reviewed Dec 15, 2024

View reviewed changes

westonpace added 4 commits December 16, 2024 10:51

Added the decode path for miniblock & lists

21c1a85

Cleanup

8a595bf

Add license header

cfdf222

Address review comments

d21781d

westonpace force-pushed the feat/v2.1-lists-miniblock branch from 61ef307 to d21781d Compare December 16, 2024 18:56

westonpace added 2 commits December 16, 2024 10:59

Remove unused default as it was breaking Rust MSRV

9cc1417

More correctly removed not-quite-so-unused default

3d46212

westonpace merged commit 64fcfcc into lancedb:main Dec 17, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adds list decode support for mini-block encoded data #3241

feat: adds list decode support for mini-block encoded data #3241

westonpace commented Dec 12, 2024

codecov-commenter commented Dec 13, 2024 •

edited

Loading

wjones127 left a comment

wjones127 Dec 13, 2024

westonpace Dec 14, 2024

wjones127 Dec 13, 2024

westonpace Dec 14, 2024

wjones127 Dec 13, 2024

westonpace Dec 14, 2024

wjones127 Dec 16, 2024

wjones127 Dec 13, 2024

westonpace Dec 14, 2024

westonpace Dec 14, 2024

wjones127 Dec 16, 2024

westonpace Dec 16, 2024

wjones127 Dec 13, 2024

westonpace Dec 14, 2024

wjones127 Dec 16, 2024

broccoliSpicy Dec 15, 2024

westonpace Dec 16, 2024

	// skip of 17 and take of 10. This would mean we decode the chunk, skip the preamble and 27 rows, and
	// skip of 17 and take of 10. This would mean we decode the chunk, skip the preamble and 17 rows, and

feat: adds list decode support for mini-block encoded data #3241

feat: adds list decode support for mini-block encoded data #3241

Conversation

westonpace commented Dec 12, 2024

codecov-commenter commented Dec 13, 2024 • edited Loading

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 13, 2024 •

edited

Loading