Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVERT][SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators #3161

Closed
wants to merge 1 commit into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Nov 7, 2014

This PR is to revert #3083.

@sarutak sarutak changed the title [REVERT][SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators [REVERT][SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators Nov 7, 2014
@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23065 has started for PR 3161 at commit 414812d.

  • This patch does not merge cleanly.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 7, 2014

Why are you trying to revert? Does the other patch make things worse or does it just still have issues? Also this does not seem to apply cleanly.

@sarutak
Copy link
Member Author

sarutak commented Nov 7, 2014

@marmbrus Initial push in this PR was wrong. As I mentioned, #3083 has still problem but comparison for String literal can work so I'd like to revert the rest of the comparison I added in #3083 (e.g, Date, Timestamp, etc).

@sarutak
Copy link
Member Author

sarutak commented Nov 7, 2014

Ah, I forgot to rebase.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23066 has started for PR 3161 at commit 9a1fae7.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23068 has started for PR 3161 at commit 76c62af.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23068 has finished for PR 3161 at commit 76c62af.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23068/
Test FAILed.

@sarutak
Copy link
Member Author

sarutak commented Nov 7, 2014

I've just finished to revert.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23069 has started for PR 3161 at commit d640907.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23065 has finished for PR 3161 at commit 414812d.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • final class ByteColumn(columnPath: ColumnPath)
    • final class ShortColumn(columnPath: ColumnPath)
    • final class DateColumn(columnPath: ColumnPath)
    • final class TimestampColumn(columnPath: ColumnPath)
    • final class DecimalColumn(columnPath: ColumnPath)
    • final case class WrappedDate(val date: Date) extends Comparable[WrappedDate]
    • final case class WrappedTimestamp(val timestamp: Timestamp) extends Comparable[WrappedTimestamp]

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23065/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23066 has finished for PR 3161 at commit 9a1fae7.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23066/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23069 has finished for PR 3161 at commit d640907.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait ScalaReflection

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23069/
Test PASSed.

@sarutak sarutak changed the title [REVERT][SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators [REVERT][SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators Nov 7, 2014
@liancheng
Copy link
Contributor

@sarutak Could you please provide some details about the problems you mentioned above? I can see that if we don't enumerate all data types in functions like createEqualityFilter, then there can be match error. We can either support all data types as it is now (but you said there're problems) or make functions like createEqualityFilter return an Option[CatalystFilter].

@sarutak
Copy link
Member Author

sarutak commented Nov 16, 2014

@liancheng One of the problems is the match error as you mentioned. And another one of the problem is that Parquet filter2 API's restriction. filter2.ValidTypeMap restricts which type we can use for filter operand.
https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/filter2/predicate/ValidTypeMap.java#L48
Because of the reason, when I wrote tests, I noticed Decimal, WrappedDate, WrappedTimestamp or some types cannot be used.

@liancheng
Copy link
Contributor

I see, thanks for the explanation. Then I'd suggest to fix the
incomplete pattern matching problem first. Are you already working on
this? Otherwise I can open a PR for this.

On 11/17/14 7:41 AM, Kousuke Saruta wrote:

@liancheng https://github.com/liancheng One of the problems is the
match error as you mentioned. And another one of the problem is that
Parquet filter2 API's restriction. filter2.ValidTypeMap restricts
which type we can use for filter operand.
https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/filter2/predicate/ValidTypeMap.java#L48
Because of the reason, when I wrote tests, I noticed Decimal,
WrappedDate, WrappedTimestamp cannot be used.


Reply to this email directly or view it on GitHub
#3161 (comment).

@sarutak
Copy link
Member Author

sarutak commented Nov 17, 2014

Yes. I've resolved match error and removed push-down filter for types which is not supported by Parquets. Now I'm writing test cases and I'll push soon.

@liancheng
Copy link
Contributor

Hey @sarutak, noticed that the whole Parquet record filter generation logics could be significantly simplified while reviewing your PR, so I opened #3317 for the simplification. However, it also (inevitably) fixes SPARK-4213. I guess this refactoring is fairly safe, but the PR is big, it would be really great if you have some time to help review :)

@sarutak
Copy link
Member Author

sarutak commented Nov 17, 2014

@liancheng Thanks for working on the issue. O.K, I'll check the PR with some test cases.

@marmbrus
Copy link
Contributor

Can we close this issue now that #3317 is open?

@sarutak
Copy link
Member Author

sarutak commented Nov 17, 2014

Yes. I close this PR.

@sarutak sarutak closed this Nov 17, 2014
mengxr pushed a commit to mengxr/spark that referenced this pull request Nov 18, 2014
While reviewing PR apache#3083 and apache#3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes apache#3317 from liancheng/simplify-parquet-filters and squashes the following commits:

d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
asfgit pushed a commit that referenced this pull request Nov 18, 2014
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits:

d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic

(cherry picked from commit 36b0956)
Signed-off-by: Michael Armbrust <[email protected]>
@sarutak sarutak deleted the SPARK-4213-revert branch April 11, 2015 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants