Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Predicate Pushdown for Parquet Lists (#2108) #2999

Merged
merged 5 commits into from
Nov 5, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Nov 2, 2022

Which issue does this PR close?

Closes #2108

Rationale for this change

This PR adds support for skipping repetition levels by adding a buffer to ColumnLevelDecoderImpl. This buffer is also used to speed up skipping definition levels, as it allows for vectorised decoding.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 2, 2022
@tustvold tustvold marked this pull request as draft November 2, 2022 19:05
@tustvold
Copy link
Contributor Author

tustvold commented Nov 2, 2022

Will update with rep_levels support

@tustvold tustvold changed the title Add buffer to ColumnLevelDecoderImpl (#2108) Implement skip_rep_levels for ColumnLevelDecoderImpl (#2108) Nov 3, 2022
@tustvold tustvold changed the title Implement skip_rep_levels for ColumnLevelDecoderImpl (#2108) Support Predicate Pushdown for Parquet Lists (#2108) Nov 4, 2022
@tustvold tustvold requested a review from Ted-Jiang November 4, 2022 02:39
@tustvold tustvold marked this pull request as ready for review November 4, 2022 02:39
@tustvold tustvold requested a review from alamb November 4, 2022 02:39
@tustvold
Copy link
Contributor Author

tustvold commented Nov 4, 2022

https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/ may be helpful here

@Ted-Jiang
Copy link
Member

I will review this carefully tomorrow morning!

Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! ❤️ @tustvold

@@ -305,12 +337,25 @@ impl ColumnLevelDecoder for ColumnLevelDecoderImpl {
}
}

fn read(&mut self, out: &mut Self::Slice, range: Range<usize>) -> Result<usize> {
fn read(&mut self, out: &mut Self::Slice, mut range: Range<usize>) -> Result<usize> {
let read_from_buffer = match self.buffer.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


loop {
if self.buffer.is_empty() {
// Read SKIP_BUFFER_SIZE as we don't know how many to read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// Find end of record
while to_skip < self.buffer.len() && self.buffer[to_skip] != 0 {
to_skip += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice check !

}

level_skip += to_skip;
if to_skip >= self.buffer.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here only need ==

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the hope is that >= helps LLVM elide bound checks that follow. Not sure it makes any difference, just habit 😅

let mut read = 0;
let mut decoded = vec![];
let mut expected = vec![];
while read < encoded.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fuzz tests ! 👍

@@ -264,9 +264,13 @@ impl<T: DataType> ColumnValueDecoder for ColumnValueDecoderImpl<T> {
}
}

const SKIP_BUFFER_SIZE: usize = 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold choose SKIP_BUFFER_SIZE: usize = 1024, 1024* i16 just 2Kb, which supported by AVX_256, so this is the best choice 🤔 (i am not familiar with this, could you give me some inforamtion😂)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a somewhat arbitrary number, it tries to strike a balance between small enough to not be overly wasteful but large enough that we don't end up repeatedly reading too few values

@tustvold tustvold merged commit e2c4199 into apache:master Nov 5, 2022
@ursabot
Copy link

ursabot commented Nov 5, 2022

Benchmark runs are scheduled for baseline = fc58036 and contender = e2c4199. e2c4199 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support skip_rep_levels in ColumnLevelDecoder
3 participants