-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support skip_def_levels for ColumnLevelDecoder #2111
Conversation
@tustvold @thinkharderdev PTAL 😊, this block the #2106 because all type are R:0 D:1.
|
Codecov Report
@@ Coverage Diff @@
## master #2111 +/- ##
==========================================
- Coverage 82.86% 82.85% -0.02%
==========================================
Files 237 237
Lines 61296 61381 +85
==========================================
+ Hits 50792 50856 +64
- Misses 10504 10525 +21
Help us with your feedback. Take ten seconds to tell us how you rate us. |
@@ -399,6 +442,55 @@ mod tests { | |||
assert_eq!(decoded.as_slice(), expected.as_slice()); | |||
} | |||
|
|||
#[test] | |||
fn test_packed_decoder_skip() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately max_def_level = 1 is an insufficient condition for using a PackedDecoder, as the nulls may not exist at the leaf level.
The way this currently works is the ArrayReader
implementation will pass either a packed or full DefinitionLevelBuffer
to RecordReader
, which will in turn pass this to DefinitionLevelBufferDecoder
. This is a bit of a hack but avoided leaking arrow-specific details into the GenericRecordReader
.
Unfortunately this means that DefinitionLevelBufferDecoder
does not know if it should use a PackedDecoder until it has actually been asked to read data.
I will have a brief play and see if I can sort this out for you, it's my mess to cleanup 😆
) -> Result<(usize, usize)> { | ||
Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792")) | ||
// For now only support max_def_level == 1 | ||
if max_def_level == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't technically correct, if you have a nullable StructArray, containing a non-nullable child. The child will have a max definition level of 1, but it will be using ColumnLevelDecoderImpl
not PackedDecoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, i understand it 🤔 thanks!
I think when using PackedDecoder
just need the bitmask
of the col.
And the unit of the bitmask
is per row.
So you use the max_def_level == 1
and max_rep_level == 0
as the checker
am i right 😂?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, we only preserve the bitmask if this is a nullable leaf.
I.e we would preserve the bitmask for this
message Schema {
OPTIONAL int32 field;
}
But not
message Schema {
OPTIONAL GROUP group {
REQUIRED int32 field;
}
}
Despite field having a max_def_level == 1 in both cases.
Following #2117 I would remove all max_def_level
logic, and just match the type of decoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
One question: if we preserve the bitmask if this is a nullable leaf
like
message Schema {
OPTIONAL GROUP group {
OPTIONAL int32 field;
}
}
we can not use the PackedDecoder
right ? because the value can be 0,1,2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are decoding the level data, not the value data - see https://akshays-blog.medium.com/wrapping-head-around-repetition-and-definition-levels-in-dremel-powering-bigquery-c1a33c9695da and https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.
In this case the definition level data can only be 0s or 1s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i will read carefully. Seems you know the everything about parquet 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #2116 which should hopefully help with the DefinitionLevelBufferDecoder logic, which I think is currently incorrect in this PR. |
#2117 has now been merged which should make it possible to significantly simplify this PR (i.e. remove the max_def_level shenanigans) |
@Ted-Jiang I meant #2116 |
for _ in 0..num_levels { | ||
// Values are delimited by max_def_level | ||
if max_def_level | ||
== reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will test in #2106
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just some minor nits
let indicator_value = self.decode_header()?; | ||
let indicator_value = self | ||
.decode_header() | ||
.expect("decode_header fail in PackedDecoder"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before met not set decode_header
when coding. Better have the error info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we propogated the error to the caller, instead of panicking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! my fault. it should use previous code pass the error. fix in fix comment
Benchmark runs are scheduled for baseline = 7746e7d and contender = fce6626. fce6626 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2107 .
Rationale for this change
This pr only support skip def_levels in
DefinitionLevelBufferDecoder
,@tustvold Just make sure all page with max_def_level = 1 and max_rep_level = 0,
will only use
DefinitionLevelBufferDecoder
decode def_level.In another word, this will support the skip val in col like
OPTIONAL type R:0 D:1
Added :
Seems it still use
ColumnLevelDecoderImpl
in some case facingR:0 D:1
when decoding some col`s def_level
What changes are included in this PR?
Are there any user-facing changes?