[SPARK-17409] [SQL] Do Not Optimize Query in CTAS More Than Once #15048

gatorsmile · 2016-09-11T04:47:13Z

What changes were proposed in this pull request?

As explained in #14797:

Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.

We should not optimize the query in CTAS more than once. For example,

spark.range(99, 101).createOrReplaceTempView("tab1")
val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
checkAnswer(spark.table("tab2"), sql(sqlStmt))

Before this PR, the results do not match

== Results ==
!== Correct Answer - 2 ==       == Spark Answer - 2 ==
![100,100.000000000000000000]   [100,null]
 [99,99.000000000000000000]     [99,99.000000000000000000]

After this PR, the results match.

+---+----------------------+
|id |num                   |
+---+----------------------+
|99 |99.000000000000000000 |
|100|100.000000000000000000|
+---+----------------------+

In this PR, we do not treat the query in CTAS as a child. Thus, the query will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule PreprocessDDL, because so far only this rule needs the analyzed plan of the query.

How was this patch tested?

Added a test

SparkQA · 2016-09-11T06:49:13Z

Test build #65217 has finished for PR 15048 at commit da7deed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-11T06:56:27Z

cc @cloud-fan @yhuai @davies

hvanhovell · 2016-09-11T16:34:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala

@@ -37,7 +38,9 @@ case class CreateTable(tableDesc: CatalogTable, mode: SaveMode, query: Option[Lo

  override def output: Seq[Attribute] = Seq.empty[Attribute]

-  override def children: Seq[LogicalPlan] = query.toSeq
+  override def children: Seq[LogicalPlan] = Seq.empty[LogicalPlan]


extend LeafNode?

hvanhovell · 2016-09-11T21:12:03Z

@gatorsmile so should we check all commands? It might also be an idea to have Command extend LeafNode (and make children final). I think @davies did something similar for #14797.

gatorsmile · 2016-09-11T21:50:02Z

@hvanhovell Sure, will do it. Thanks!

SparkQA · 2016-09-12T02:07:23Z

Test build #65233 has finished for PR 15048 at commit ae335ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Command extends LeafNode
- trait RunnableCommand extends logical.Command
- case class CreateTable(

cloud-fan · 2016-09-12T08:21:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

@@ -68,7 +68,7 @@ class ResolveDataSource(sparkSession: SparkSession) extends Rule[LogicalPlan] {
 /**
 * Preprocess some DDL plans, e.g. [[CreateTable]], to do some normalization and checking.


we should update the comments to say that this rule will also analyze the query.(we may also wanna update the rule name)

Sure, let me do it now. Thanks!

SparkQA · 2016-09-13T00:22:12Z

Test build #65283 has finished for PR 15048 at commit 4c3c955.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AnalyzeCreateTable(sparkSession: SparkSession) extends Rule[LogicalPlan]

cloud-fan · 2016-09-14T15:11:09Z

thanks, merging to master!

### What changes were proposed in this pull request? As explained in apache#14797: >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. We should not optimize the query in CTAS more than once. For example, ```Scala spark.range(99, 101).createOrReplaceTempView("tab1") val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1" sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt") checkAnswer(spark.table("tab2"), sql(sqlStmt)) ``` Before this PR, the results do not match ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![100,100.000000000000000000] [100,null] [99,99.000000000000000000] [99,99.000000000000000000] ``` After this PR, the results match. ``` +---+----------------------+ |id |num | +---+----------------------+ |99 |99.000000000000000000 | |100|100.000000000000000000| +---+----------------------+ ``` In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`. ### How was this patch tested? Added a test Author: gatorsmile <[email protected]> Closes apache#15048 from gatorsmile/ctasOptimized.

yhuai · 2016-10-12T17:14:34Z

@gatorsmile We should also backport this to branch 2.0, right?

yhuai · 2016-10-12T17:56:34Z

@gatorsmile Also, does it affect CTAS for creating a hive serde table?

gatorsmile · 2016-10-12T18:03:08Z

Yeah. We should backport it to 2.0

Yeah. It affects both data source tables and hive serde tables. To fix it in Spark 2.0, we need to rewrite the fix since Spark 2.0 does not have a unified logical plan, afaik. Let me submit a PR to backport it.

yhuai · 2016-10-12T18:04:54Z

Thanks! btw, does this patch cover hive tables?

yhuai · 2016-10-12T18:06:20Z

Also, another good test for this is

val df = sql("select 0 as id")
df.registerTempTable("foo")
val df2 = sql("""select * from foo group by id""")
df2.write.mode("overwrite").saveAsTable("bar")

Without this fix, you will have an exception like org.apache.spark.sql.AnalysisException: GROUP BY position 0 is not in select list (valid range is [1, 1]); line 1 pos 7.

yhuai · 2016-10-12T18:17:02Z

Also, can we add a test for hive tables?

gatorsmile · 2016-10-12T18:53:20Z

Yeah, based on my understanding, it should cover the hive serde table. I will submit a PR to make sure it and also include the test case you provided above. Thank you!

…15048 ### What changes were proposed in this pull request? This PR is to backport #15048 and #15459. However, in 2.0, we do not have a unified logical node `CreateTable` and the analyzer rule `PreWriteCheck` is also different. To minimize the code changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it as a new PR to review. Thanks! As explained in #14797: >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. We should not optimize the query in CTAS more than once. For example, ```Scala spark.range(99, 101).createOrReplaceTempView("tab1") val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1" sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt") checkAnswer(spark.table("tab2"), sql(sqlStmt)) ``` Before this PR, the results do not match ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![100,100.000000000000000000] [100,null] [99,99.000000000000000000] [99,99.000000000000000000] ``` After this PR, the results match. ``` +---+----------------------+ |id |num | +---+----------------------+ |99 |99.000000000000000000 | |100|100.000000000000000000| +---+----------------------+ ``` In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`. ### How was this patch tested? Author: gatorsmile <[email protected]> Closes #15502 from gatorsmile/ctasOptimize2.0.

… Once ### What changes were proposed in this pull request? This follow-up PR is for addressing the [comment](apache#15048). We added two test cases based on the suggestion from yhuai . One is a new test case using the `saveAsTable` API to create a data source table. Another is for CTAS on Hive serde table. Note: No need to backport this PR to 2.0. Will submit a new PR to backport the whole fix with new test cases to Spark 2.0 ### How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes apache#15459 from gatorsmile/ctasOptimizedTestCases.

## What changes were proposed in this pull request? We could get incorrect results by running DecimalPrecision twice. This PR resolves the original found in apache#15048 and apache#14797. After this PR, it becomes easier to change it back using `children` instead of using `innerChildren`. ## How was this patch tested? The existing test. Author: gatorsmile <[email protected]> Closes apache#20000 from gatorsmile/keepPromotePrecision.

gatorsmile added 3 commits September 10, 2016 21:09

fix

f7941e8

Merge remote-tracking branch 'upstream/master' into ctasOptimized

3a203f9

one more test case

da7deed

hvanhovell reviewed Sep 11, 2016
View reviewed changes

address comments.

ae335ae

cloud-fan reviewed Sep 12, 2016
View reviewed changes

address comments.

4c3c955

asfgit closed this in 52738d4 Sep 14, 2016

This was referenced Oct 13, 2016

[SPARK-17409] [SQL] [FOLLOW-UP] Do Not Optimize Query in CTAS More Than Once #15459

Closed

[SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once #15048 #15502

Closed

gatorsmile mentioned this pull request Dec 17, 2017

[SPARK-22815] [SQL] Keep PromotePrecision in Optimized Plans #20000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17409] [SQL] Do Not Optimize Query in CTAS More Than Once #15048

[SPARK-17409] [SQL] Do Not Optimize Query in CTAS More Than Once #15048

gatorsmile commented Sep 11, 2016 •

edited

Loading

SparkQA commented Sep 11, 2016

gatorsmile commented Sep 11, 2016

hvanhovell Sep 11, 2016

gatorsmile Sep 11, 2016

hvanhovell commented Sep 11, 2016

gatorsmile commented Sep 11, 2016

SparkQA commented Sep 12, 2016

cloud-fan Sep 12, 2016

gatorsmile Sep 12, 2016

SparkQA commented Sep 13, 2016

cloud-fan commented Sep 14, 2016

yhuai commented Oct 12, 2016

yhuai commented Oct 12, 2016

gatorsmile commented Oct 12, 2016

yhuai commented Oct 12, 2016

yhuai commented Oct 12, 2016

yhuai commented Oct 12, 2016

gatorsmile commented Oct 12, 2016

		@@ -68,7 +68,7 @@ class ResolveDataSource(sparkSession: SparkSession) extends Rule[LogicalPlan] {
		/**
		* Preprocess some DDL plans, e.g. [[CreateTable]], to do some normalization and checking.

[SPARK-17409] [SQL] Do Not Optimize Query in CTAS More Than Once #15048

[SPARK-17409] [SQL] Do Not Optimize Query in CTAS More Than Once #15048

Conversation

gatorsmile commented Sep 11, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 11, 2016

gatorsmile commented Sep 11, 2016

hvanhovell Sep 11, 2016

Choose a reason for hiding this comment

gatorsmile Sep 11, 2016

Choose a reason for hiding this comment

hvanhovell commented Sep 11, 2016

gatorsmile commented Sep 11, 2016

SparkQA commented Sep 12, 2016

cloud-fan Sep 12, 2016

Choose a reason for hiding this comment

gatorsmile Sep 12, 2016

Choose a reason for hiding this comment

SparkQA commented Sep 13, 2016

cloud-fan commented Sep 14, 2016

yhuai commented Oct 12, 2016

yhuai commented Oct 12, 2016

gatorsmile commented Oct 12, 2016

yhuai commented Oct 12, 2016

yhuai commented Oct 12, 2016

yhuai commented Oct 12, 2016

gatorsmile commented Oct 12, 2016

gatorsmile commented Sep 11, 2016 •

edited

Loading