Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code #3317

Closed
wants to merge 2 commits into from

Conversation

liancheng
Copy link
Contributor

While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in SPARK-4453. This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating ParquetTableScan operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call findExpression to traverse the generated filter to find out all pushed down predicates [1]. In this way, we have to introduce the CatalystFilter class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need findExpression after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning None for any unsupported predicate type.

Review on Reviewable

@liancheng liancheng changed the title [SPARK-4453][SPARK-4213] Simplifies Parquet filter generation code [SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code Nov 17, 2014
@SparkQA
Copy link

SparkQA commented Nov 17, 2014

Test build #23480 has started for PR 3317 at commit 43760e8.

  • This patch merges cleanly.

import parquet.filter2.compat.FilterCompat.Filter
import parquet.filter2.compat.RowGroupFilter

import org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.blockLocationCache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, thanks, dunno why IDEA went insane here :(

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

Test build #23481 has started for PR 3317 at commit d6a9499.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

Test build #23480 has finished for PR 3317 at commit 43760e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23480/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

Test build #23481 has finished for PR 3317 at commit d6a9499.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ExternalSort(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23481/
Test PASSed.

@sarutak
Copy link
Member

sarutak commented Nov 17, 2014

I tested for ByteType, ShortType, DateType and TimestampType and it's LGTM so far.

@@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.util.Metadata
object NamedExpression {
private val curId = new java.util.concurrent.atomic.AtomicLong()
def newExprId = ExprId(curId.getAndIncrement())
def unapply(expr: NamedExpression): Option[(String, DataType)] = Some(expr.name, expr.dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cute. At some point we should probably write up a guide of all the ways we use pattern matching in catalyst, perhaps as a part of your big refactoring PR. I'm generally in support of these shortcuts, but I want to make sure that we are coherent about how they are used. BTW, I've been holding off on merging your other PR because I didn't want to create conflicts close to the release, but I would like to revisit this soon .

@marmbrus
Copy link
Contributor

I like any PR that fixes bugs while deleting 532 lines of code :)

I think it would be a good idea to add some more test coverage in this area, but would not block this PR on it if you think this is ready to go.

@sarutak
Copy link
Member

sarutak commented Nov 17, 2014

@marmbrus I have test cases used for checking this PR and I can add the test cases to ParquetQuerySuite.

@liancheng
Copy link
Contributor Author

@sarutak You can open a PR against this PR branch :)

@marmbrus
Copy link
Contributor

Thanks! I merged this into master and 1.2, but it would be great to still include more tests in a follow-up PR.

@sarutak
Copy link
Member

sarutak commented Nov 18, 2014

This PR has merged. So should I open new PR for adding test cases?

@marmbrus
Copy link
Contributor

Yes please.

@asfgit asfgit closed this in 36b0956 Nov 18, 2014
asfgit pushed a commit that referenced this pull request Nov 18, 2014
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits:

d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic

(cherry picked from commit 36b0956)
Signed-off-by: Michael Armbrust <[email protected]>
@sarutak
Copy link
Member

sarutak commented Nov 18, 2014

I opened the PR for additional test cases.
#3333

@liancheng liancheng deleted the simplify-parquet-filters branch November 18, 2014 02:32
marmbrus pushed a commit to marmbrus/spark that referenced this pull request Nov 19, 2014
…ates with literals on the left hand side

For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side.

(This bug existed before apache#3317 and affects branch-1.1. apache#3338 was opened to backport this to branch-1.1.)

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes apache#3334 from liancheng/fix-parquet-comp-filter and squashes the following commits:

0130897 [Cheng Lian] Fixes Parquet comparison filter generation
liancheng added a commit that referenced this pull request Nov 19, 2014
…ates with literals on the left hand side

For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side.

(This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.)

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits:

0130897 [Cheng Lian] Fixes Parquet comparison filter generation

(cherry picked from commit 423baea)
Signed-off-by: Michael Armbrust <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants