-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support "A column is known to be entirely NULL" in PruningPredicate
#9223
Support "A column is known to be entirely NULL" in PruningPredicate
#9223
Conversation
A column is known to be entirely NULL
in PruningPredicate
PruningPredicate
69bb592
to
db95018
Compare
…row_count THEN false ELSE ... END`
…counts chore: fix pruning_predicate in slt tests chore: clippy
f652f74
to
750cb16
Compare
The major feature is implemented, but there are two things I plan to do before I open it for review:
I plan to continue the work on the week of March 4, 2024 |
Picking up this PR again. I'm able to test this branch against InfluxDB IOx by using the new |
I plan to review this PR later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @appletreeisyellow -- I reviewed this PR quite carefully and I think it looks really nice.
I had a few minor comment suggestions that I view as optional.
I think there is one more important test to add(provide row counts, but not null counts) but otherwise this PR looks ready to go to me.
Also, thank you for very dilligently updating the comments to reflect the new field
cc @viirya as I believe you have previously been interested in the pruning logic
END \ | ||
AND (\ | ||
CASE \ | ||
WHEN c2_null_count@5 = c2_row_count@6 THEN false \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice in the future to combine these clauses (so we didn't have the repeated CASE expression) but for now I think this is good enought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Tracked in #9171 (comment)
@@ -138,7 +138,7 @@ physical_plan | |||
SortPreservingMergeExec: [column1@0 ASC NULLS LAST] | |||
--CoalesceBatchesExec: target_batch_size=8192 | |||
----FilterExec: column1@0 != 42 | |||
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)] | |||
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=CASE WHEN column1_null_count@2 = column1_row_count@3 THEN false ELSE column1_min@0 != 42 OR 42 != column1_max@1 END, required_guarantees=[column1 not in (42)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 now that the pruning predicate is getting more complex, perhaps we should not display it by default anymore in explain plans. Maybe we can add a config option (as a follow on PR) that is disabled by default 🤔 to control if it is displayed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Tracked in #9171 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, we can show part of pruning predicate (i.e., truncated one) so users can know there is pruning predicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of showing truncated pruning predicate by default. I updated this idea in #9171 (comment)
Some(0), // no nulls | ||
Some(1), // 1 null | ||
None, // unknown nulls | ||
Some(4), // 4 nulls, which is the same as the row counts, i.e. this column is all null (don't keep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// i | [-11,-1] | Unknown | Unknown | ==> All rows must pass (must keep) | ||
// i | [NULL, NULL] | 4 | 4 | ==> The column is all null (not keep) | ||
// i | [1, NULL] | 10 | 0 | ==> No rows can pass (not keep) | ||
let expected_ret = &[true, false, true, false, false]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 4th element in this array is false
which is different than the 4th element when null counts aren't known . 👍
Co-authored-by: Andrew Lamb <[email protected]> docs: take more feedback
0977d1b
to
d69202a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me -- thank you vevry much @appletreeisyellow
I plan to merge this PR on Monday unless anyone else would like more time to comment |
@@ -1320,14 +1457,56 @@ fn build_statistics_expr( | |||
); | |||
} | |||
}; | |||
let statistics_expr = wrap_case_expr(statistics_expr, expr_builder)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no null_count or row_count statistics, do we still need to rewrite it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, if there is no null_count
or row_count
statistics, we don't need to rewrite it. Practically, we don't know about null_count
or row_count
statistics when the rewrite takes place, because the rewrite happens in PruningPredicate
first and then null_count
and row_count
statistics come in later in PruningStatistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, then I think the rules order could be changed? It sounds not making sense to rewrite predicates based on statistics before the statistics are ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I think the doc I wrote was not clear, I updated the doc in a8639fb. What do you think now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating it. Looks better now.
Thank you @appletreeisyellow and @viirya for the review |
/// `x = 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END` | ||
/// `x < 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END` | ||
/// `x = 5 AND y = 10` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END AND CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END` | ||
/// `x IS NULL` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_null_count > 0 END` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@appletreeisyellow @alamb Sorry i am confused here, i think here just need x IS NULL
| x_null_count > 0 END
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right @Ted-Jiang. Nicely spotted -- I double checked and indeed all that is done is null_count > 0
. I'll a PR to fix.
❯ explain select duration_nano from traces where duration_nano IS NULL;
| | ParquetExec: file_groups={16 groups: [...]}, projection=[duration_nano], predicate=duration_nano@1 IS NULL, pruning_predicate=duration_nano_null_count@0 > 0, required_guarantees=[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: #9986
Part of #9171
Rationale for this change
What changes are included in this PR?
PruningStatistics::row_counts()
to get the total row counts in each container.PruningStatistics::row_counts()
andPruningStatistics::null_counts()
to determine whether to prune containers that have columns with all NULL. This is done by wrapping aCASE
expression around the re-written pruning predicate expression:Example 1
If a query has a predicate like:
instead of re-writing to
we want the re-written expression to be
Example 2
Another more complicated example:
instead of re-writing to
we want the re-written expression to be
Are these changes tested?
Yes, updated and added more test coverage
Are there any user-facing changes?
Yes, there is a new API for
PruningPredicate
calledPruningPredicate::row_counts()