-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support List and LargeList in Row format (#3159) #3251
Conversation
/// | ||
/// ```text | ||
/// ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ | ||
/// [1_u8, 2_u8, 3_u8] │01│01│01│02│01│03│00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03│ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the │01│01│01│02│01│03│
prefix and the │00│00│00│03│
suffix. But where do the other bytes come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lengths of each encoded row, and the number of elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically 00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03
represents [element0_len, element1_len, element2_len, element count]
--> [2, 2, 2, 3]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add this explanation to the docstring?
Co-authored-by: Marco Neumann <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code and tests carefully. It is a little mind bending but I think it is very nicely done 🏆
/// | ||
/// ```text | ||
/// ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ | ||
/// [1_u8, 2_u8, 3_u8] │01│01│01│02│01│03│00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03│ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically 00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03
represents [element0_len, element1_len, element2_len, element count]
--> [2, 2, 2, 3]
!= sort_field.options.descending, | ||
}; | ||
|
||
let field = SortField::new_with_options(f.data_type().clone(), options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how descending sort is achieved if the list is always encoded descending false 🤔 )
However, I see it is tested below, so 👍
let options = SortOptions {
descending: true,
nulls_first: false,
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the elements are encoded with descending false, they are then encoded using variable length encoding which may reorder them. Yes it is mind-bending 😅
builder.values().append_value(32); | ||
builder.values().append_value(52); | ||
builder.append(true); | ||
builder.values().append_value(32); // MASKED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
masked means this row is NULL so these values should be igored
|
||
assert!(rows.row(0) < rows.row(1)); // [32, 52, 32] < [32, 52, 12] | ||
assert!(rows.row(2) > rows.row(1)); // [32, 42] > [32, 52, 12] | ||
assert!(rows.row(3) < rows.row(2)); // null < [32, 42] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you for the comments -- they make the tests easy to follow
// ] | ||
let options = SortOptions { | ||
descending: false, | ||
nulls_first: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding a test in nested lists for nulls_first: false
, and verify that
assert!(rows.row(0) < rows.row(1));
Co-authored-by: Andrew Lamb <[email protected]>
Benchmark runs are scheduled for baseline = de3828c and contender = 9833288. 9833288 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Part of #3159
Rationale for this change
The longer term goal is to make the Row Format support enough types to use in DataFusion so we can use it for a unified GroupBy operation
What changes are included in this PR?
Add support for encoding/decoding lists from the row format
Are there any user-facing changes?