-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Predicate Pushdown for Parquet Lists (#2108) #2999
Support Predicate Pushdown for Parquet Lists (#2108) #2999
Conversation
Will update with rep_levels support |
https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/ may be helpful here |
I will review this carefully tomorrow morning! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement! ❤️ @tustvold
@@ -305,12 +337,25 @@ impl ColumnLevelDecoder for ColumnLevelDecoderImpl { | |||
} | |||
} | |||
|
|||
fn read(&mut self, out: &mut Self::Slice, range: Range<usize>) -> Result<usize> { | |||
fn read(&mut self, out: &mut Self::Slice, mut range: Range<usize>) -> Result<usize> { | |||
let read_from_buffer = match self.buffer.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
loop { | ||
if self.buffer.is_empty() { | ||
// Read SKIP_BUFFER_SIZE as we don't know how many to read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
// Find end of record | ||
while to_skip < self.buffer.len() && self.buffer[to_skip] != 0 { | ||
to_skip += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice check !
} | ||
|
||
level_skip += to_skip; | ||
if to_skip >= self.buffer.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here only need ==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the hope is that >= helps LLVM elide bound checks that follow. Not sure it makes any difference, just habit 😅
let mut read = 0; | ||
let mut decoded = vec![]; | ||
let mut expected = vec![]; | ||
while read < encoded.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fuzz tests ! 👍
@@ -264,9 +264,13 @@ impl<T: DataType> ColumnValueDecoder for ColumnValueDecoderImpl<T> { | |||
} | |||
} | |||
|
|||
const SKIP_BUFFER_SIZE: usize = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold choose SKIP_BUFFER_SIZE: usize = 1024
, 1024* i16 just 2Kb, which supported by AVX_256, so this is the best choice 🤔 (i am not familiar with this, could you give me some inforamtion😂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a somewhat arbitrary number, it tries to strike a balance between small enough to not be overly wasteful but large enough that we don't end up repeatedly reading too few values
Benchmark runs are scheduled for baseline = fc58036 and contender = e2c4199. e2c4199 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2108
Rationale for this change
This PR adds support for skipping repetition levels by adding a buffer to ColumnLevelDecoderImpl. This buffer is also used to speed up skipping definition levels, as it allows for vectorised decoding.
What changes are included in this PR?
Are there any user-facing changes?