[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code #3317

liancheng · 2014-11-17T18:25:52Z

While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in SPARK-4453. This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating ParquetTableScan operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call findExpression to traverse the generated filter to find out all pushed down predicates [1]. In this way, we have to introduce the CatalystFilter class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need findExpression after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning None for any unsupported predicate type.

SparkQA · 2014-11-17T18:29:58Z

Test build #23480 has started for PR 3317 at commit 43760e8.

This patch merges cleanly.

tianyi · 2014-11-17T18:35:56Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

+    import parquet.filter2.compat.FilterCompat.Filter
+    import parquet.filter2.compat.RowGroupFilter
+
+import org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.blockLocationCache


indent issue?

Hm, thanks, dunno why IDEA went insane here :(

SparkQA · 2014-11-17T18:55:06Z

Test build #23481 has started for PR 3317 at commit d6a9499.

This patch merges cleanly.

SparkQA · 2014-11-17T19:40:17Z

Test build #23480 has finished for PR 3317 at commit 43760e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-17T19:40:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23480/
Test PASSed.

SparkQA · 2014-11-17T20:08:34Z

Test build #23481 has finished for PR 3317 at commit d6a9499.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ExternalSort(

AmplabJenkins · 2014-11-17T20:08:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23481/
Test PASSed.

sarutak · 2014-11-17T20:54:43Z

I tested for ByteType, ShortType, DateType and TimestampType and it's LGTM so far.

marmbrus · 2014-11-17T23:42:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

@@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.util.Metadata
 object NamedExpression {
  private val curId = new java.util.concurrent.atomic.AtomicLong()
  def newExprId = ExprId(curId.getAndIncrement())
+  def unapply(expr: NamedExpression): Option[(String, DataType)] = Some(expr.name, expr.dataType)


Cute. At some point we should probably write up a guide of all the ways we use pattern matching in catalyst, perhaps as a part of your big refactoring PR. I'm generally in support of these shortcuts, but I want to make sure that we are coherent about how they are used. BTW, I've been holding off on merging your other PR because I didn't want to create conflicts close to the release, but I would like to revisit this soon .

marmbrus · 2014-11-17T23:44:07Z

I like any PR that fixes bugs while deleting 532 lines of code :)

I think it would be a good idea to add some more test coverage in this area, but would not block this PR on it if you think this is ready to go.

sarutak · 2014-11-17T23:52:55Z

@marmbrus I have test cases used for checking this PR and I can add the test cases to ParquetQuerySuite.

liancheng · 2014-11-18T00:45:15Z

@sarutak You can open a PR against this PR branch :)

marmbrus · 2014-11-18T00:56:25Z

Thanks! I merged this into master and 1.2, but it would be great to still include more tests in a follow-up PR.

sarutak · 2014-11-18T01:52:33Z

This PR has merged. So should I open new PR for adding test cases?

marmbrus · 2014-11-18T01:53:04Z

Yes please.

While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)  Author: Cheng Lian <[email protected]> Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic (cherry picked from commit 36b0956) Signed-off-by: Michael Armbrust <[email protected]>

sarutak · 2014-11-18T02:07:13Z

I opened the PR for additional test cases.
#3333

…ates with literals on the left hand side For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side. (This bug existed before apache#3317 and affects branch-1.1. apache#3338 was opened to backport this to branch-1.1.)  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)  Author: Cheng Lian <[email protected]> Closes apache#3334 from liancheng/fix-parquet-comp-filter and squashes the following commits: 0130897 [Cheng Lian] Fixes Parquet comparison filter generation

…ates with literals on the left hand side For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side. (This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.)  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)  Author: Cheng Lian <[email protected]> Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits: 0130897 [Cheng Lian] Fixes Parquet comparison filter generation (cherry picked from commit 423baea) Signed-off-by: Michael Armbrust <[email protected]>

Simplifies Parquet filter generation logic

43760e8

liancheng changed the title ~~[SPARK-4453][SPARK-4213] Simplifies Parquet filter generation code~~ [SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code Nov 17, 2014

tianyi reviewed Nov 17, 2014
View reviewed changes

liancheng mentioned this pull request Nov 17, 2014

[REVERT][SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators #3161

Closed

Fixes import styling issue

d6a9499

marmbrus reviewed Nov 17, 2014
View reviewed changes

asfgit closed this in 36b0956 Nov 18, 2014

sarutak mentioned this pull request Nov 18, 2014

[SPARK-4453][SPARK-4213][SQL] Additional test cases for Pushdown Filter for Parquet #3333

Closed

liancheng deleted the simplify-parquet-filters branch November 18, 2014 02:32

liancheng mentioned this pull request Nov 18, 2014

[SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates with literals on the left hand side #3334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code #3317

[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code #3317

liancheng commented Nov 17, 2014

SparkQA commented Nov 17, 2014

tianyi Nov 17, 2014

liancheng Nov 17, 2014

SparkQA commented Nov 17, 2014

SparkQA commented Nov 17, 2014

AmplabJenkins commented Nov 17, 2014

SparkQA commented Nov 17, 2014

AmplabJenkins commented Nov 17, 2014

sarutak commented Nov 17, 2014

marmbrus Nov 17, 2014

marmbrus commented Nov 17, 2014

sarutak commented Nov 17, 2014

liancheng commented Nov 18, 2014

marmbrus commented Nov 18, 2014

sarutak commented Nov 18, 2014

marmbrus commented Nov 18, 2014

sarutak commented Nov 18, 2014

[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code #3317

[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code #3317

Conversation

liancheng commented Nov 17, 2014

SparkQA commented Nov 17, 2014

tianyi Nov 17, 2014

Choose a reason for hiding this comment

liancheng Nov 17, 2014

Choose a reason for hiding this comment

SparkQA commented Nov 17, 2014

SparkQA commented Nov 17, 2014

AmplabJenkins commented Nov 17, 2014

SparkQA commented Nov 17, 2014

AmplabJenkins commented Nov 17, 2014

sarutak commented Nov 17, 2014

marmbrus Nov 17, 2014

Choose a reason for hiding this comment

marmbrus commented Nov 17, 2014

sarutak commented Nov 17, 2014

liancheng commented Nov 18, 2014

marmbrus commented Nov 18, 2014

sarutak commented Nov 18, 2014

marmbrus commented Nov 18, 2014

sarutak commented Nov 18, 2014