-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVERT][SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators #3161
Conversation
Test build #23065 has started for PR 3161 at commit
|
414812d
to
9a1fae7
Compare
Why are you trying to revert? Does the other patch make things worse or does it just still have issues? Also this does not seem to apply cleanly. |
Ah, I forgot to rebase. |
Test build #23066 has started for PR 3161 at commit
|
9a1fae7
to
76c62af
Compare
Test build #23068 has started for PR 3161 at commit
|
Test build #23068 has finished for PR 3161 at commit
|
Test FAILed. |
76c62af
to
d640907
Compare
I've just finished to revert. |
Test build #23069 has started for PR 3161 at commit
|
Test build #23065 has finished for PR 3161 at commit
|
Test FAILed. |
Test build #23066 has finished for PR 3161 at commit
|
Test PASSed. |
Test build #23069 has finished for PR 3161 at commit
|
Test PASSed. |
@sarutak Could you please provide some details about the problems you mentioned above? I can see that if we don't enumerate all data types in functions like |
@liancheng One of the problems is the match error as you mentioned. And another one of the problem is that Parquet filter2 API's restriction. filter2.ValidTypeMap restricts which type we can use for filter operand. |
I see, thanks for the explanation. Then I'd suggest to fix the On 11/17/14 7:41 AM, Kousuke Saruta wrote:
|
Yes. I've resolved match error and removed push-down filter for types which is not supported by Parquets. Now I'm writing test cases and I'll push soon. |
Hey @sarutak, noticed that the whole Parquet record filter generation logics could be significantly simplified while reviewing your PR, so I opened #3317 for the simplification. However, it also (inevitably) fixes SPARK-4213. I guess this refactoring is fairly safe, but the PR is big, it would be really great if you have some time to help review :) |
@liancheng Thanks for working on the issue. O.K, I'll check the PR with some test cases. |
Can we close this issue now that #3317 is open? |
Yes. I close this PR. |
While reviewing PR apache#3083 and apache#3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317) <!-- Reviewable:end --> Author: Cheng Lian <[email protected]> Closes apache#3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317) <!-- Reviewable:end --> Author: Cheng Lian <[email protected]> Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic (cherry picked from commit 36b0956) Signed-off-by: Michael Armbrust <[email protected]>
This PR is to revert #3083.