Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds list decode support for mini-block encoded data #3241

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

westonpace
Copy link
Contributor

Lists are encoded using rep/def levels and a repetition index. At decode time we take all this information to be able to fetch individual ranges of lists.

@github-actions github-actions bot added the enhancement New feature or request label Dec 12, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 13, 2024

Codecov Report

Attention: Patch coverage is 97.40000% with 39 lines in your changes missing coverage. Please review.

Project coverage is 78.96%. Comparing base (83b8efd) to head (3d46212).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 98.00% 12 Missing and 12 partials ⚠️
rust/lance-encoding/src/encodings/logical/list.rs 68.75% 5 Missing ⚠️
rust/lance-encoding/src/repdef.rs 97.56% 2 Missing and 2 partials ⚠️
rust/lance-encoding/src/testing.rs 0.00% 4 Missing ⚠️
rust/lance-arrow/src/list.rs 98.03% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3241      +/-   ##
==========================================
+ Coverage   78.47%   78.96%   +0.48%     
==========================================
  Files         245      246       +1     
  Lines       85088    86313    +1225     
  Branches    85088    86313    +1225     
==========================================
+ Hits        66772    68153    +1381     
+ Misses      15501    15341     -160     
- Partials     2815     2819       +4     
Flag Coverage Δ
unittests 78.96% <97.40%> (+0.48%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have a few minor questions.

Comment on lines +605 to +613
debug_assert!(
buf.len()
<= bytes_rep
+ bytes_def
+ bytes_val
+ 6
+ 1 // P1
+ (2 * MINIBLOCK_MAX_PADDING) // P2/P3
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda funky formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because of the inline comments. I can break it up but there is a good chance this particular code will be changing within the month.

vals_in_chunk: self.vals_in_chunk,
ranges: self.ranges.clone(),
vals_targeted: self.vals_targeted,
// TODO: Add test cases for the all-preamble and all-trailer cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO done? or a follow up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up. I'll go through and make follow-ups before I merge, I think there are 2 or 3 in this PR.

//
// So if `ChunkInstructions` is "skip preamble, skip 10, take 50, take trailer" and we are decoding in
// batches of size 10 we might have a `ChunkDrainInstructions` that targets that chunk and has its own
// skip of 17 and take of 10. This would mean we decode the chunk, skip the preamble and 27 rows, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
// skip of 17 and take of 10. This would mean we decode the chunk, skip the preamble and 27 rows, and
// skip of 17 and take of 10. This would mean we decode the chunk, skip the preamble and 17 rows, and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's additive:

            // Our instructions tell us which rows we want to take from this chunk
            let row_range_start =
                instructions.rows_to_skip + instructions.chunk_instructions.rows_to_skip;
            let row_range_end = row_range_start + instructions.rows_to_take;

We add instructions.rows_to_skip (17 in the above example) and instructions.chunk_instructions.rows_to_skip (10 in the above example) to get the actual range in the chunk (27..37 in the above example)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. I was mixing up with the 10 from the take.

Comment on lines 1240 to 1243
// "no preamble, skip 5, take 10, take trailer" and we are draining 20 rows then the
// first drain instructions will have no preamble, skip 0, take 11 and the second chunk
// instructions will have take preamble, skip 0, take 9 (assuming the second chunk has at
// least 9 rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ignore the skips? Also why 11 when the total is take 10? Is that because the trailer is counting as 1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could make more sense in some other way but in my head:

The chunk instructions describe a range into the chunk. The drain instructions describe a range into the chunk instructions:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, the total is 11 because of the trailer. When we're acutally zoomed into the rep levels and mapping the ranges the trailer looks like a normal row (e.g. look at the middle chunk below):

1 0 0 0 1 0 | 0 0 1 0 0 1 0 0 1 0 0 1 0 0 | 0 0 1 0

So the logic was (at the time) simpler to just treat it as a row by the time we hit map_ranges. Maybe it would be simpler if we keep it a separate boolean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a separate boolean, just wanted to clarify the logic here. Also TBH I'm confused about the concept of draining 20 rows from a take 10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was meant to be draining 20 rows from a series of chunks starting with a chunk that only has 10 rows and a trailer. However, I cleaned up the wording and made the example more explicit:

// One very confusing bit is that `rows_to_take` includes the trailer.  So if we have two chunks:
//  -no preamble, skip 5, take 10, take trailer
//  -take preamble, skip 0, take 50, no trailer
//
// and we are draining 20 rows then the drain instructions for the first batch will be:
//  - no preamble, skip 0 (from chunk 0), take 11 (from chunk 0)
//  - take preamble (from chunk 1), skip 0 (from chunk 1), take 9 (from chunk 1)

impl ChunkInstructions {
// Given a repetition index and a set of user ranges we need to figure out how to read from the chunks
//
// We assume that `user_ranges` are in sorted order and non-overlapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a debug_assert for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe but we should probably put it higher up in decoder.rs. This is an assumption we make throughout the decoders and many of them rely on it.

This comment was less "oh, in this spot we have this extra guarantee" and more "remember that we know this"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have something higher up that is fine.

@@ -1359,6 +1358,9 @@ impl StructuralFieldScheduler for StructuralListScheduler {
}
}

/// Scheduling job for list data
///
/// It doesn't really do anything right now because list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a unfinished sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes it is 🤦. Changed to:

/// Scheduling job for list data
///
/// Scheduling is handled by the primitive encoder and nothing special
/// happens here.

@westonpace westonpace force-pushed the feat/v2.1-lists-miniblock branch from 61ef307 to d21781d Compare December 16, 2024 18:56
@westonpace westonpace merged commit 64fcfcc into lancedb:main Dec 17, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants