[SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery #24321

francis0407 · 2019-04-09T06:34:27Z

What changes were proposed in this pull request?

In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters.
We have a sql with a scalar subquery:

val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)")
plan.explain(true)

And we get the log info of DataSourceV2Strategy:

Pushing operators to csv:examples/src/main/resources/t2.txt
Pushed Filters: 
Post-Scan Filters: isnotnull(t2a#30)
Output: t2a#30, t2b#31

The Post-Scan Filters should contain the scalar subquery, but we eliminate it by mistake.

== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('t2a > scalar-subquery#56 [])
   :  +- 'Project [unresolvedalias('max('t1a), None)]
   :     +- 'UnresolvedRelation `t1`
   +- 'UnresolvedRelation `t2`

== Analyzed Logical Plan ==
t2a: string, t2b: string
Project [t2a#30, t2b#31]
+- Filter (t2a#30 > scalar-subquery#56 [])
   :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
   :     +- SubqueryAlias `t1`
   :        +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
   +- SubqueryAlias `t2`
      +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt

== Optimized Logical Plan ==
Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 []))
:  +- Aggregate [max(t1a#13) AS max(t1a)#63]
:     +- Project [t1a#13]
:        +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
+- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt

== Physical Plan ==
*(1) Project [t2a#30, t2b#31]
+- *(1) Filter isnotnull(t2a#30)
   +- *(1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan

How was this patch tested?

ut

dilipbiswal · 2019-04-09T06:57:15Z

cc @cloud-fan

cloud-fan · 2019-04-09T07:06:22Z

ok to test

cloud-fan · 2019-04-09T07:06:29Z

cc @gengliangwang

cloud-fan · 2019-04-09T07:34:54Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

+
+  test("SPARK-27411: DataSourceV2Strategy should not eliminate subquery") {
+    val t2 = spark.read.format(classOf[SimpleDataSourceV2].getName).load()
+    sql("create temporary view t1 (a int) using parquet")


let's wrap the test with withTempView, so that the view can be cleaned up

cloud-fan · 2019-04-09T07:35:14Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

+    val subqueries = df.queryExecution.executedPlan.collect {
+      case p => p.subqueries
+    }.flatten
+    assert(subqueries.length == 1)


let's also check the result. I think this is a correctness bug?

Sure, thanks.

SparkQA · 2019-04-09T11:19:12Z

Test build #104423 has finished for PR 24321 at commit d92e309.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang

Nice catch! LGTM

cloud-fan · 2019-04-09T13:46:04Z

thanks, merging to master!

SparkQA · 2019-04-09T14:20:29Z

Test build #104426 has finished for PR 24321 at commit 8afc71e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters. We have a sql with a scalar subquery: ``` scala val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)") plan.explain(true) ``` And we get the log info of DataSourceV2Strategy: ``` Pushing operators to csv:examples/src/main/resources/t2.txt Pushed Filters: Post-Scan Filters: isnotnull(t2a#30) Output: t2a#30, t2b#31 ``` The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it by mistake. ``` == Parsed Logical Plan == 'Project [*] +- 'Filter ('t2a > scalar-subquery#56 []) : +- 'Project [unresolvedalias('max('t1a), None)] : +- 'UnresolvedRelation `t1` +- 'UnresolvedRelation `t2` == Analyzed Logical Plan == t2a: string, t2b: string Project [t2a#30, t2b#31] +- Filter (t2a#30 > scalar-subquery#56 []) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- SubqueryAlias `t1` : +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- SubqueryAlias `t2` +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Optimized Logical Plan == Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 [])) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- Project [t1a#13] : +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Physical Plan == *(1) Project [t2a#30, t2b#31] +- *(1) Filter isnotnull(t2a#30) +- *(1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan ``` ut Closes apache#24321 from francis0407/SPARK-27411. Authored-by: francis0407 <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

francis0407 added 2 commits April 9, 2019 14:18

SPARK-27411: DataSourceV2Strategy should not eliminate subquery

af74d51

fix

d92e309

francis0407 changed the title ~~SPARK-27411: DataSourceV2Strategy should not eliminate subquery~~ [SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery Apr 9, 2019

cloud-fan reviewed Apr 9, 2019

View reviewed changes

fix test

8afc71e

cloud-fan approved these changes Apr 9, 2019

View reviewed changes

gengliangwang approved these changes Apr 9, 2019

View reviewed changes

cloud-fan closed this in 601fac2 Apr 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery #24321

[SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery #24321

francis0407 commented Apr 9, 2019 •

edited

Loading

dilipbiswal commented Apr 9, 2019

cloud-fan commented Apr 9, 2019

cloud-fan commented Apr 9, 2019

cloud-fan Apr 9, 2019

francis0407 Apr 9, 2019

cloud-fan Apr 9, 2019

francis0407 Apr 9, 2019

SparkQA commented Apr 9, 2019

gengliangwang left a comment

cloud-fan commented Apr 9, 2019

SparkQA commented Apr 9, 2019

[SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery #24321

[SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery #24321

Conversation

francis0407 commented Apr 9, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

dilipbiswal commented Apr 9, 2019

cloud-fan commented Apr 9, 2019

cloud-fan commented Apr 9, 2019

cloud-fan Apr 9, 2019

Choose a reason for hiding this comment

francis0407 Apr 9, 2019

Choose a reason for hiding this comment

cloud-fan Apr 9, 2019

Choose a reason for hiding this comment

francis0407 Apr 9, 2019

Choose a reason for hiding this comment

SparkQA commented Apr 9, 2019

gengliangwang left a comment

Choose a reason for hiding this comment

cloud-fan commented Apr 9, 2019

SparkQA commented Apr 9, 2019

francis0407 commented Apr 9, 2019 •

edited

Loading