-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code #3317
Conversation
Test build #23480 has started for PR 3317 at commit
|
import parquet.filter2.compat.FilterCompat.Filter | ||
import parquet.filter2.compat.RowGroupFilter | ||
|
||
import org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.blockLocationCache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, thanks, dunno why IDEA went insane here :(
Test build #23481 has started for PR 3317 at commit
|
Test build #23480 has finished for PR 3317 at commit
|
Test PASSed. |
Test build #23481 has finished for PR 3317 at commit
|
Test PASSed. |
I tested for ByteType, ShortType, DateType and TimestampType and it's LGTM so far. |
@@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.util.Metadata | |||
object NamedExpression { | |||
private val curId = new java.util.concurrent.atomic.AtomicLong() | |||
def newExprId = ExprId(curId.getAndIncrement()) | |||
def unapply(expr: NamedExpression): Option[(String, DataType)] = Some(expr.name, expr.dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cute. At some point we should probably write up a guide of all the ways we use pattern matching in catalyst
, perhaps as a part of your big refactoring PR. I'm generally in support of these shortcuts, but I want to make sure that we are coherent about how they are used. BTW, I've been holding off on merging your other PR because I didn't want to create conflicts close to the release, but I would like to revisit this soon .
I like any PR that fixes bugs while deleting 532 lines of code :) I think it would be a good idea to add some more test coverage in this area, but would not block this PR on it if you think this is ready to go. |
@marmbrus I have test cases used for checking this PR and I can add the test cases to ParquetQuerySuite. |
@sarutak You can open a PR against this PR branch :) |
Thanks! I merged this into master and 1.2, but it would be great to still include more tests in a follow-up PR. |
This PR has merged. So should I open new PR for adding test cases? |
Yes please. |
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317) <!-- Reviewable:end --> Author: Cheng Lian <[email protected]> Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic (cherry picked from commit 36b0956) Signed-off-by: Michael Armbrust <[email protected]>
I opened the PR for additional test cases. |
…ates with literals on the left hand side For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side. (This bug existed before apache#3317 and affects branch-1.1. apache#3338 was opened to backport this to branch-1.1.) <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334) <!-- Reviewable:end --> Author: Cheng Lian <[email protected]> Closes apache#3334 from liancheng/fix-parquet-comp-filter and squashes the following commits: 0130897 [Cheng Lian] Fixes Parquet comparison filter generation
…ates with literals on the left hand side For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side. (This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.) <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334) <!-- Reviewable:end --> Author: Cheng Lian <[email protected]> Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits: 0130897 [Cheng Lian] Fixes Parquet comparison filter generation (cherry picked from commit 423baea) Signed-off-by: Michael Armbrust <[email protected]>
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in SPARK-4453. This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.
While generating
ParquetTableScan
operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then callfindExpression
to traverse the generated filter to find out all pushed down predicates [1]. In this way, we have to introduce theCatalystFilter
class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.The basic idea of this PR is that, we don't need
findExpression
after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returningNone
for any unsupported predicate type.