Simplify parquet filter predicate test, fix Page Filtering Incorrectly Handles Pages with Different Row Counts #4743

tustvold · 2022-12-27T13:11:19Z

Which issue does this PR close?

Closes #4744

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold · 2022-12-27T13:12:46Z

datafusion/core/tests/parquet/filter_pushdown.rs


-    let mut set = tokio::task::JoinSet::new();


I reverted the JoinSet change as the test now runs in under 2 seconds, and it made debugging test failures difficult as the stack trace pointed to where the JoinHandle is unwrapped, not the case that actually failed. It also made the failures non-deterministic, which was also annoying

cc @waynexia

Looks good to me. It also takes me some time to figure out which sub-case fails. If the time is greatly shortened, I agree we need not pay for the parallel

datafusion/core/tests/parquet/filter_pushdown.rs

tustvold · 2022-12-27T13:33:23Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

@@ -251,7 +251,7 @@ fn prune_pages_in_one_row_group(
                let mut sum_row = *row_vec.first().unwrap();
                let mut selected = *values.first().unwrap();
                trace!("Pruned to to {:?} using {:?}", values, pruning_stats);
-                for (i, &f) in values.iter().skip(1).enumerate() {
+                for (i, &f) in values.iter().enumerate().skip(1) {


This is the cause of #4744

alamb

Thank you @tustvold

alamb · 2022-12-27T13:56:58Z

datafusion/core/tests/parquet/filter_pushdown.rs


-    let mut set = tokio::task::JoinSet::new();


cc @waynexia

alamb · 2022-12-27T13:58:55Z

datafusion/core/tests/parquet/filter_pushdown.rs

        .with_filter(
            conjunction(vec![
-                col("request_bytes").gt(lit(2000000000)),
+                col("client_addr").eq(lit("58.242.143.99")),


Is the case of a pruning on a non equality predicate on a non dictionary encoded column covered elsewhere?

Ted-Jiang

Thanks for this 👍
Suffered covid these days , sorry for the late respond.

…redicate-test

tustvold · 2022-12-28T17:12:42Z

I'm looking into the other test failures, I'm hoping the tests are wrong

tustvold · 2022-12-28T17:42:33Z

datafusion/core/tests/parquet/page_pruning.rs

-    //  vec.push(RowSelector::skip(894));
-    //  vec.push(RowSelector::select(339));
-    //  vec.push(RowSelector::skip(3330));
+    // `month = 1` or `month = 2` from the page index should create below RowSelection


These tests were just wrong, I have fixed them using apache/arrow-rs#3405 to verify that they are correct

ursabot · 2022-12-28T22:12:17Z

Benchmark runs are scheduled for baseline = 3abbffb and contender = 40bf559. 40bf559 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Simplify filter predicate test

2aedc3a

github-actions bot added the core Core DataFusion crate label Dec 27, 2022

tustvold commented Dec 27, 2022

View reviewed changes

datafusion/core/tests/parquet/filter_pushdown.rs Show resolved Hide resolved

tustvold changed the title ~~Simplify filter predicate test~~ Simplify parquet filter predicate test Dec 27, 2022

Fix pruning logic (apache#4744)

523466c

tustvold commented Dec 27, 2022

View reviewed changes

tustvold marked this pull request as ready for review December 27, 2022 13:35

Format

6654e44

tustvold force-pushed the simplify-filter-predicate-test branch from 36dcfd0 to 6654e44 Compare December 27, 2022 13:35

tustvold requested a review from alamb December 27, 2022 13:36

tustvold mentioned this pull request Dec 27, 2022

refactor: parallelize parquet_exec test case single_file #4735

Merged

alamb changed the title ~~Simplify parquet filter predicate test~~ Simplify parquet filter predicate test, fix Page Filtering Incorrectly Handles Pages with Different Row Counts Dec 27, 2022

alamb requested a review from Ted-Jiang December 27, 2022 13:55

alamb approved these changes Dec 27, 2022

View reviewed changes

Ted-Jiang approved these changes Dec 28, 2022

View reviewed changes

tustvold added 2 commits December 28, 2022 13:59

Fix oom_sort

4b1058d

Merge remote-tracking branch 'upstream/master' into simplify-filter-p…

9a4bd13

…redicate-test

Fix tests

f5bdd3f

tustvold commented Dec 28, 2022

View reviewed changes

tustvold added 3 commits December 28, 2022 17:45

Format

427672c

Clippy

c95ca24

Clippy

cc5336d

tustvold merged commit 40bf559 into apache:master Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify parquet filter predicate test, fix Page Filtering Incorrectly Handles Pages with Different Row Counts #4743

Simplify parquet filter predicate test, fix Page Filtering Incorrectly Handles Pages with Different Row Counts #4743

tustvold commented Dec 27, 2022 •

edited

Loading

tustvold Dec 27, 2022

alamb Dec 27, 2022

waynexia Dec 27, 2022

tustvold Dec 27, 2022

alamb left a comment

alamb Dec 27, 2022

alamb Dec 27, 2022

Ted-Jiang left a comment •

edited

Loading

tustvold commented Dec 28, 2022

tustvold Dec 28, 2022

ursabot commented Dec 28, 2022

Simplify parquet filter predicate test, fix Page Filtering Incorrectly Handles Pages with Different Row Counts #4743

Simplify parquet filter predicate test, fix Page Filtering Incorrectly Handles Pages with Different Row Counts #4743

Conversation

tustvold commented Dec 27, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold Dec 27, 2022

Choose a reason for hiding this comment

alamb Dec 27, 2022

Choose a reason for hiding this comment

waynexia Dec 27, 2022

Choose a reason for hiding this comment

tustvold Dec 27, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 27, 2022

Choose a reason for hiding this comment

alamb Dec 27, 2022

Choose a reason for hiding this comment

Ted-Jiang left a comment • edited Loading

Choose a reason for hiding this comment

tustvold commented Dec 28, 2022

tustvold Dec 28, 2022

Choose a reason for hiding this comment

ursabot commented Dec 28, 2022

tustvold commented Dec 27, 2022 •

edited

Loading

Ted-Jiang left a comment •

edited

Loading