[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning #21882

HyukjinKwon · 2018-07-26T13:15:25Z

What changes were proposed in this pull request?

Looks we intentionally set null for upper/lower bounds for complex types and don't use it. However, these look used in in-memory partition pruning, which ends up with incorrect results.

This PR proposes to explicitly whitelist the supported types.

val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
df.cache().filter("arrayCol > array('a', 'b')").show()

val df = sql("select cast('a' as binary) as a")
df.cache().filter("a == cast('a' as binary)").show()

Before:

+--------+
|arrayCol|
+--------+
+--------+

+---+
|  a|
+---+
+---+

After:

+--------+
|arrayCol|
+--------+
|  [c, d]|
+--------+

+----+
|   a|
+----+
|[61]|
+----+

How was this patch tested?

Unit tests were added and manually tested.

HyukjinKwon · 2018-07-26T13:16:19Z

cc @cloud-fan, mind taking a look please?

pwoody · 2018-07-26T14:31:10Z

This is the same problem as #20935 , yeah?

HyukjinKwon · 2018-07-26T14:36:43Z

oops it is. I didn't know. But this PR can target lower branches too since it's a correctness issue :-)

HyukjinKwon · 2018-07-26T14:40:55Z

Also, I believe this issue still can happen even after your PR?

pwoody · 2018-07-26T14:56:56Z

The test case here works fine at least. The linked PR focuses on accurately collecting stats, so null bounds should be correct if they occur.

mgaido91 · 2018-07-26T15:18:45Z

@pwoody I am not sure I 100% agree on your last sentence. I agree that we should correct null bounds, but letting the users facing bugs returning wrong results meanwhile we find all the possible cases we have not thought of is not the right way to go I think. Moreover, if a datatype is not orderable, we cannot even fix the lack of an upper and lower bound...

I think the approach proposed here is safer and I like it. It would be great (but I am not sure it is feasible) if we could emit a WARN message in case a null is found (in testing we can throw an exception using AssertNotNull), in order to let users know that they are hitting a case which should not happen, so that they can report us.

HyukjinKwon · 2018-07-26T15:28:15Z

We should backport this one anyway. Actually the stats are logged in DEBUG level. So, I think we are fine. I guess no harm to add this safeguard and get rid of this hole found, and this doesn't block your PR too. We can just orthogonally proceed.

cloud-fan · 2018-07-26T15:36:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

+  // bounds.
+  private def nullSafeEval(
+      attr: AttributeReference)(func: AttributeReference => Expression): Expression = {
+    attr.isNull || func(attr)


this adds extra runtime null check and may introduce perf regression. How about we follow the hive partition pruning and only create filters for non-complex type? e.g.

object ExtractableLiteral { def unapply(expr: Expression): Option[Expression] = { if (expr.dataType.isInstanceOf[AtomicType]) Some(expr) else None } } ... case EqualTo(a: AttributeReference, ExtractableLiteral(l)) =>

Thanks, @cloud-fan.

this basically means turning off the filtering for complex types. Despite this may be not a big deal, as probably we won't have complex types often here, can't we instead add the isNull filter only for complex types?

hmmmmm .. if this can whitelist the cases we support, I thought it's okay to use the suggestion above. BTW, looks we should exclude binary type too. It will still supports itnull or isnotnull though.

mgaido91 · 2018-07-26T15:37:46Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/PartitionBatchPruningSuite.scala

+  checkBatchPruning("SELECT _1 FROM pruningArrayData WHERE _1 <= array(1)", 5, 10)(Seq(Array(1)))
+  checkBatchPruning("SELECT _1 FROM pruningArrayData WHERE _1 >= array(1)", 5, 10)(
+    testArrayData.map(_._1))
+


nit: unneeded blank line

mgaido91 · 2018-07-26T15:56:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

+  // bounds.
+  private def nullSafeEval(
+      attr: AttributeReference)(func: AttributeReference => Expression): Expression = {
+    attr.isNull || func(attr)


this basically means turning off the filtering for complex types. Despite this may be not a big deal, as probably we won't have complex types often here, can't we instead add the isNull filter only for complex types?

SparkQA · 2018-07-26T16:56:21Z

Test build #93597 has finished for PR 21882 at commit ea38b56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-26T17:29:53Z

~~Oh BTW, please let allow merge this one when there are some sign-offs and we are ready. I should test #21880 :-) ..~~ I tested this against another PR. It's fine now..

SparkQA · 2018-07-26T20:29:24Z

Test build #93608 has finished for PR 21882 at commit 8cd100e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T05:20:21Z

Test build #93649 has finished for PR 21882 at commit 7f1040e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-27T10:41:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

+  private object ExtractableLiteral {
+    def unapply(expr: Expression): Option[Literal] = expr match {
+      case lit: Literal => lit.dataType match {
+        case BinaryType => None


can we also add test for binary type?

Will add late tonight or tomorrow

cloud-fan · 2018-07-27T10:42:26Z

LGTM

HyukjinKwon · 2018-07-28T06:59:37Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/PartitionBatchPruningSuite.scala

+  checkBatchPruning("SELECT _1 FROM pruningArrayData WHERE _1 >= array(1)", 5, 10)(
+    testArrayData.map(_._1))
+  // Do not filter on binary type
+  checkBatchPruning(


Before this change, Expected Array(Array(1)), but got Array() Wrong query result

HyukjinKwon · 2018-07-28T07:00:47Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/PartitionBatchPruningSuite.scala

+    pruningArrayData.createOrReplaceTempView("pruningArrayData")
+    spark.catalog.cacheTable("pruningArrayData")
+
+    val pruningBinaryData = sparkContext.makeRDD(testBinaryData, 5).toDF()


scala> spark.sparkContext.makeRDD((1 to 100).map { key => Tuple1(Array.fill(key)(key.toByte)) }, 5).toDF().printSchema() root |-- _1: binary (nullable = true)

HyukjinKwon · 2018-07-28T07:02:02Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/PartitionBatchPruningSuite.scala

+  // Do not filter on binary type
+  checkBatchPruning(
+    title = "SELECT _1 FROM pruningBinaryData WHERE _1 == 0x01 (binary literal)",
+    actual = spark.table("pruningBinaryData").filter($"_1".equalTo(Array[Byte](1.toByte))),


The problem here is, there seems no SQL binary liternal. So, I had to use Scala API

not very elegant, but we can do binary(chr(5)) in order to get a binary literal

Oops right.

…ory partition pruning

SparkQA · 2018-07-28T10:53:18Z

Test build #93716 has finished for PR 21882 at commit 3e7b319.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-28T11:54:22Z

Test build #93720 has finished for PR 21882 at commit 1a0a2d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-28T12:11:55Z

Test build #93722 has finished for PR 21882 at commit fe3c0a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-28T14:08:54Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/PartitionBatchPruningSuite.scala

  }

  override protected def afterEach(): Unit = {
    try {
      spark.catalog.uncacheTable("pruningData")
      spark.catalog.uncacheTable("pruningStringData")
+      spark.catalog.uncacheTable("pruningArrayData")


uncache the pruningBinaryData too

SparkQA · 2018-07-30T04:46:52Z

Test build #93761 has finished for PR 21882 at commit deb20ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…r bounds for in-memory partition pruning ## What changes were proposed in this pull request? Looks we intentionally set `null` for upper/lower bounds for complex types and don't use it. However, these look used in in-memory partition pruning, which ends up with incorrect results. This PR proposes to explicitly whitelist the supported types. ```scala val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") df.cache().filter("arrayCol > array('a', 'b')").show() ``` ```scala val df = sql("select cast('a' as binary) as a") df.cache().filter("a == cast('a' as binary)").show() ``` **Before:** ``` +--------+ |arrayCol| +--------+ +--------+ ``` ``` +---+ | a| +---+ +---+ ``` **After:** ``` +--------+ |arrayCol| +--------+ | [c, d]| +--------+ ``` ``` +----+ | a| +----+ |[61]| +----+ ``` ## How was this patch tested? Unit tests were added and manually tested. Author: hyukjinkwon <[email protected]> Closes #21882 from HyukjinKwon/stats-filter. (cherry picked from commit bfe60fc) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2018-07-30T05:22:41Z

thanks, merging to master/2.3!

HyukjinKwon · 2018-07-30T05:46:12Z

Thank you @pwoody, @mgaido91 and @cloud-fan.

cloud-fan reviewed Jul 26, 2018

View reviewed changes

mgaido91 reviewed Jul 26, 2018

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-24934][SQL] Handle missing upper/lower bounds case in in-memory partition pruning~~ [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning Jul 26, 2018

HyukjinKwon force-pushed the stats-filter branch from ea38b56 to 8cd100e Compare July 26, 2018 16:50

cloud-fan reviewed Jul 27, 2018

View reviewed changes

HyukjinKwon commented Jul 28, 2018

View reviewed changes

HyukjinKwon force-pushed the stats-filter branch from 3e7b319 to f26fb13 Compare July 28, 2018 08:14

Explicitly whitelist supported types in upper/lower bounds for in-mem…

1a0a2d8

…ory partition pruning

HyukjinKwon force-pushed the stats-filter branch from f26fb13 to 1a0a2d8 Compare July 28, 2018 08:15

style

fe3c0a0

cloud-fan reviewed Jul 28, 2018

View reviewed changes

Uncache table

deb20ef

asfgit closed this in bfe60fc Jul 30, 2018

HyukjinKwon deleted the stats-filter branch October 16, 2018 12:45

HyukjinKwon mentioned this pull request Jan 6, 2019

[SPARK-23819][SQL] Fix InMemoryTableScanExec complex type pruning #20935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning #21882

[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning #21882

HyukjinKwon commented Jul 26, 2018 •

edited

Loading

HyukjinKwon commented Jul 26, 2018

pwoody commented Jul 26, 2018

HyukjinKwon commented Jul 26, 2018

HyukjinKwon commented Jul 26, 2018

pwoody commented Jul 26, 2018

mgaido91 commented Jul 26, 2018

HyukjinKwon commented Jul 26, 2018 •

edited

Loading

cloud-fan Jul 26, 2018

HyukjinKwon Jul 26, 2018

mgaido91 Jul 26, 2018

HyukjinKwon Jul 26, 2018

mgaido91 Jul 26, 2018

mgaido91 Jul 26, 2018

SparkQA commented Jul 26, 2018

HyukjinKwon commented Jul 26, 2018 •

edited

Loading

SparkQA commented Jul 26, 2018

SparkQA commented Jul 27, 2018

cloud-fan Jul 27, 2018

HyukjinKwon Jul 27, 2018

cloud-fan commented Jul 27, 2018

HyukjinKwon Jul 28, 2018

HyukjinKwon Jul 28, 2018

HyukjinKwon Jul 28, 2018 •

edited

Loading

mgaido91 Jul 28, 2018

HyukjinKwon Jul 28, 2018

SparkQA commented Jul 28, 2018

SparkQA commented Jul 28, 2018

SparkQA commented Jul 28, 2018

cloud-fan Jul 28, 2018

SparkQA commented Jul 30, 2018

cloud-fan commented Jul 30, 2018

HyukjinKwon commented Jul 30, 2018

[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning #21882

[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning #21882

Conversation

HyukjinKwon commented Jul 26, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Jul 26, 2018

pwoody commented Jul 26, 2018

HyukjinKwon commented Jul 26, 2018

HyukjinKwon commented Jul 26, 2018

pwoody commented Jul 26, 2018

mgaido91 commented Jul 26, 2018

HyukjinKwon commented Jul 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 26, 2018

HyukjinKwon commented Jul 26, 2018 • edited Loading

SparkQA commented Jul 26, 2018

SparkQA commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jul 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 28, 2018

SparkQA commented Jul 28, 2018

SparkQA commented Jul 28, 2018

Choose a reason for hiding this comment

SparkQA commented Jul 30, 2018

cloud-fan commented Jul 30, 2018

HyukjinKwon commented Jul 30, 2018

HyukjinKwon commented Jul 26, 2018 •

edited

Loading

HyukjinKwon commented Jul 26, 2018 •

edited

Loading

HyukjinKwon commented Jul 26, 2018 •

edited

Loading

HyukjinKwon Jul 28, 2018 •

edited

Loading