Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators #3083

Closed
wants to merge 3 commits into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Nov 4, 2014

Following description is quoted from JIRA:

When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error:
scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$)
Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters.
To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB):

create table sparkbug (
id int,
event string
) stored as parquet;

Insert some sample data:

insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1;
insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1;

Launch a spark shell and create a HiveContext to the metastore where the table above is located.

import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
hc.setConf("spark.sql.shuffle.partitions", "10")
hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
hc.setConf("spark.sql.parquet.compression.codec", "snappy")
import hc._
hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'")

A scala.MatchError will appear in the output.

@sarutak
Copy link
Member Author

sarutak commented Nov 4, 2014

CC @marmbrus

@sarutak sarutak changed the title [SPARK-4213] ParquetFilters - No support for LT, LTE, GT, GTE operators [SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators Nov 4, 2014
@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22840 has finished for PR 3083 at commit 9a1fae7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):
    • case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)

@@ -111,6 +111,11 @@ private[sql] object ParquetFilters {
name,
FilterApi.lt(floatColumn(name), literal.value.asInstanceOf[java.lang.Float]),
predicate)
case StringType =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about DateType and the others? Do we need a default case that avoids match errors when we have predicates on types that parquet can't handle pushdown for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly. I'll try for the rest of the types.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22903 has finished for PR 3083 at commit 4ab6e56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class ByteColumn(columnPath: ColumnPath)
    • final class ShortColumn(columnPath: ColumnPath)
    • final class DateColumn(columnPath: ColumnPath)
    • final class TimestampColumn(columnPath: ColumnPath)
    • final class DecimalColumn(columnPath: ColumnPath)
    • final class WrappedDate(val date: Date) extends Comparable[WrappedDate]
    • final class WrappedTimestamp(val timestamp: Timestamp) extends Comparable[WrappedTimestamp]

@marmbrus
Copy link
Contributor

marmbrus commented Nov 7, 2014

Thanks! Merged to master and 1.2

@sarutak
Copy link
Member Author

sarutak commented Nov 7, 2014

@marmbrus Please wait. I found this change has still problem. I'll address it.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 7, 2014

Sorry, too late. Please address in a follow up.

Can you put [WIP] in the title when its not ready to merge?

@sarutak sarutak changed the title [SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators [SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators Nov 7, 2014
asfgit pushed a commit that referenced this pull request Nov 7, 2014
…erators

Following description is quoted from JIRA:

When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error:
scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$)
Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters.
To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB):

    create table sparkbug (
    id int,
    event string
    ) stored as parquet;

Insert some sample data:

    insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1;
    insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1;

Launch a spark shell and create a HiveContext to the metastore where the table above is located.

    import org.apache.spark.sql._
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.sql.hive.HiveContext
    val hc = new HiveContext(sc)
    hc.setConf("spark.sql.shuffle.partitions", "10")
    hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
    hc.setConf("spark.sql.parquet.compression.codec", "snappy")
    import hc._
    hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'")

A scala.MatchError will appear in the output.

Author: Kousuke Saruta <[email protected]>

Closes #3083 from sarutak/SPARK-4213 and squashes the following commits:

4ab6e56 [Kousuke Saruta] WIP
b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213
9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings

(cherry picked from commit 14c54f1)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 14c54f1 Nov 7, 2014
mengxr pushed a commit to mengxr/spark that referenced this pull request Nov 18, 2014
While reviewing PR apache#3083 and apache#3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes apache#3317 from liancheng/simplify-parquet-filters and squashes the following commits:

d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
asfgit pushed a commit that referenced this pull request Nov 18, 2014
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits:

d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic

(cherry picked from commit 36b0956)
Signed-off-by: Michael Armbrust <[email protected]>
@sarutak sarutak deleted the SPARK-4213 branch April 11, 2015 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants