[SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators #3083

sarutak · 2014-11-04T00:52:18Z

Following description is quoted from JIRA:

When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error:
scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$)
Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters.
To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB):

create table sparkbug (
id int,
event string
) stored as parquet;

Insert some sample data:

insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1;
insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1;

Launch a spark shell and create a HiveContext to the metastore where the table above is located.

import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
hc.setConf("spark.sql.shuffle.partitions", "10")
hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
hc.setConf("spark.sql.parquet.compression.codec", "snappy")
import hc._
hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'")

A scala.MatchError will appear in the output.

sarutak · 2014-11-04T00:53:25Z

CC @marmbrus

SparkQA · 2014-11-04T01:57:24Z

Test build #22840 has finished for PR 3083 at commit 9a1fae7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):
- case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)

marmbrus · 2014-11-04T01:59:29Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala

@@ -111,6 +111,11 @@ private[sql] object ParquetFilters {
          name,
          FilterApi.lt(floatColumn(name), literal.value.asInstanceOf[java.lang.Float]),
          predicate)
+      case StringType =>


What about DateType and the others? Do we need a default case that avoids match errors when we have predicates on types that parquet can't handle pushdown for?

Yeah, exactly. I'll try for the rest of the types.

SparkQA · 2014-11-05T01:33:32Z

Test build #22903 has finished for PR 3083 at commit 4ab6e56.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class ByteColumn(columnPath: ColumnPath)
- final class ShortColumn(columnPath: ColumnPath)
- final class DateColumn(columnPath: ColumnPath)
- final class TimestampColumn(columnPath: ColumnPath)
- final class DecimalColumn(columnPath: ColumnPath)
- final class WrappedDate(val date: Date) extends Comparable[WrappedDate]
- final class WrappedTimestamp(val timestamp: Timestamp) extends Comparable[WrappedTimestamp]

marmbrus · 2014-11-07T19:57:31Z

Thanks! Merged to master and 1.2

sarutak · 2014-11-07T20:05:05Z

@marmbrus Please wait. I found this change has still problem. I'll address it.

marmbrus · 2014-11-07T20:11:12Z

Sorry, too late. Please address in a follow up.

Can you put [WIP] in the title when its not ready to merge?

…erators Following description is quoted from JIRA: When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1; insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf("spark.sql.shuffle.partitions", "10") hc.setConf("spark.sql.hive.convertMetastoreParquet", "true") hc.setConf("spark.sql.parquet.compression.codec", "snappy") import hc._ hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'") A scala.MatchError will appear in the output. Author: Kousuke Saruta <[email protected]> Closes #3083 from sarutak/SPARK-4213 and squashes the following commits: 4ab6e56 [Kousuke Saruta] WIP b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213 9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings (cherry picked from commit 14c54f1) Signed-off-by: Michael Armbrust <[email protected]>

While reviewing PR apache#3083 and apache#3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)  Author: Cheng Lian <[email protected]> Closes apache#3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic

While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)  Author: Cheng Lian <[email protected]> Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic (cherry picked from commit 36b0956) Signed-off-by: Michael Armbrust <[email protected]>

Fixed ParquetFilters so that compare Strings

9a1fae7

sarutak changed the title ~~[SPARK-4213] ParquetFilters - No support for LT, LTE, GT, GTE operators~~ [SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators Nov 4, 2014

marmbrus reviewed Nov 4, 2014
View reviewed changes

sarutak added 2 commits November 4, 2014 06:31

Merge branch 'master' of git://git.apache.org/spark into SPARK-4213

b6890c6

WIP

4ab6e56

sarutak changed the title ~~[SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators~~ [SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators Nov 7, 2014

asfgit closed this in 14c54f1 Nov 7, 2014

sarutak mentioned this pull request Nov 7, 2014

[REVERT][SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators #3161

Closed

liancheng mentioned this pull request Nov 17, 2014

[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code #3317

Closed

sarutak deleted the SPARK-4213 branch April 11, 2015 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators #3083

[SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators #3083

sarutak commented Nov 4, 2014

sarutak commented Nov 4, 2014

SparkQA commented Nov 4, 2014

marmbrus Nov 4, 2014

sarutak Nov 4, 2014

SparkQA commented Nov 5, 2014

marmbrus commented Nov 7, 2014

sarutak commented Nov 7, 2014

marmbrus commented Nov 7, 2014

[SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators #3083

[SPARK-4213][SQL][WIP] ParquetFilters - No support for LT, LTE, GT, GTE operators #3083

Conversation

sarutak commented Nov 4, 2014

sarutak commented Nov 4, 2014

SparkQA commented Nov 4, 2014

marmbrus Nov 4, 2014

Choose a reason for hiding this comment

sarutak Nov 4, 2014

Choose a reason for hiding this comment

SparkQA commented Nov 5, 2014

marmbrus commented Nov 7, 2014

sarutak commented Nov 7, 2014

marmbrus commented Nov 7, 2014