Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17409] [SQL] Do Not Optimize Query in CTAS More Than Once #15048

Closed
wants to merge 5 commits into from

Conversation

gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Sep 11, 2016

What changes were proposed in this pull request?

As explained in #14797:

Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.

We should not optimize the query in CTAS more than once. For example,

spark.range(99, 101).createOrReplaceTempView("tab1")
val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
checkAnswer(spark.table("tab2"), sql(sqlStmt))

Before this PR, the results do not match

== Results ==
!== Correct Answer - 2 ==       == Spark Answer - 2 ==
![100,100.000000000000000000]   [100,null]
 [99,99.000000000000000000]     [99,99.000000000000000000]

After this PR, the results match.

+---+----------------------+
|id |num                   |
+---+----------------------+
|99 |99.000000000000000000 |
|100|100.000000000000000000|
+---+----------------------+

In this PR, we do not treat the query in CTAS as a child. Thus, the query will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule PreprocessDDL, because so far only this rule needs the analyzed plan of the query.

How was this patch tested?

Added a test

@SparkQA
Copy link

SparkQA commented Sep 11, 2016

Test build #65217 has finished for PR 15048 at commit da7deed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

cc @cloud-fan @yhuai @davies

@@ -37,7 +38,9 @@ case class CreateTable(tableDesc: CatalogTable, mode: SaveMode, query: Option[Lo

override def output: Seq[Attribute] = Seq.empty[Attribute]

override def children: Seq[LogicalPlan] = query.toSeq
override def children: Seq[LogicalPlan] = Seq.empty[LogicalPlan]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extend LeafNode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. : )

@hvanhovell
Copy link
Contributor

@gatorsmile so should we check all commands? It might also be an idea to have Command extend LeafNode (and make children final). I think @davies did something similar for #14797.

@gatorsmile
Copy link
Member Author

@hvanhovell Sure, will do it. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 12, 2016

Test build #65233 has finished for PR 15048 at commit ae335ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Command extends LeafNode
    • trait RunnableCommand extends logical.Command
    • case class CreateTable(

@@ -68,7 +68,7 @@ class ResolveDataSource(sparkSession: SparkSession) extends Rule[LogicalPlan] {
/**
* Preprocess some DDL plans, e.g. [[CreateTable]], to do some normalization and checking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should update the comments to say that this rule will also analyze the query.(we may also wanna update the rule name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me do it now. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 13, 2016

Test build #65283 has finished for PR 15048 at commit 4c3c955.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AnalyzeCreateTable(sparkSession: SparkSession) extends Rule[LogicalPlan]

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 52738d4 Sep 14, 2016
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
### What changes were proposed in this pull request?
As explained in apache#14797:
>Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.

We should not optimize the query in CTAS more than once. For example,
```Scala
spark.range(99, 101).createOrReplaceTempView("tab1")
val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
checkAnswer(spark.table("tab2"), sql(sqlStmt))
```
Before this PR, the results do not match
```
== Results ==
!== Correct Answer - 2 ==       == Spark Answer - 2 ==
![100,100.000000000000000000]   [100,null]
 [99,99.000000000000000000]     [99,99.000000000000000000]
```
After this PR, the results match.
```
+---+----------------------+
|id |num                   |
+---+----------------------+
|99 |99.000000000000000000 |
|100|100.000000000000000000|
+---+----------------------+
```

In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`.

### How was this patch tested?
Added a test

Author: gatorsmile <[email protected]>

Closes apache#15048 from gatorsmile/ctasOptimized.
@yhuai
Copy link
Contributor

yhuai commented Oct 12, 2016

@gatorsmile We should also backport this to branch 2.0, right?

@yhuai
Copy link
Contributor

yhuai commented Oct 12, 2016

@gatorsmile Also, does it affect CTAS for creating a hive serde table?

@gatorsmile
Copy link
Member Author

Yeah. We should backport it to 2.0

Yeah. It affects both data source tables and hive serde tables. To fix it in Spark 2.0, we need to rewrite the fix since Spark 2.0 does not have a unified logical plan, afaik. Let me submit a PR to backport it.

@yhuai
Copy link
Contributor

yhuai commented Oct 12, 2016

Thanks! btw, does this patch cover hive tables?

@yhuai
Copy link
Contributor

yhuai commented Oct 12, 2016

Also, another good test for this is

val df = sql("select 0 as id")
df.registerTempTable("foo")
val df2 = sql("""select * from foo group by id""")
df2.write.mode("overwrite").saveAsTable("bar")

Without this fix, you will have an exception like org.apache.spark.sql.AnalysisException: GROUP BY position 0 is not in select list (valid range is [1, 1]); line 1 pos 7.

@yhuai
Copy link
Contributor

yhuai commented Oct 12, 2016

Also, can we add a test for hive tables?

@gatorsmile
Copy link
Member Author

Yeah, based on my understanding, it should cover the hive serde table. I will submit a PR to make sure it and also include the test case you provided above. Thank you!

asfgit pushed a commit that referenced this pull request Oct 17, 2016
…15048

### What changes were proposed in this pull request?
This PR is to backport #15048 and #15459.

However, in 2.0, we do not have a unified logical node `CreateTable` and the analyzer rule `PreWriteCheck` is also different. To minimize the code changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it as a new PR to review. Thanks!

As explained in #14797:
>Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.

We should not optimize the query in CTAS more than once. For example,
```Scala
spark.range(99, 101).createOrReplaceTempView("tab1")
val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
checkAnswer(spark.table("tab2"), sql(sqlStmt))
```
Before this PR, the results do not match
```
== Results ==
!== Correct Answer - 2 ==       == Spark Answer - 2 ==
![100,100.000000000000000000]   [100,null]
 [99,99.000000000000000000]     [99,99.000000000000000000]
```
After this PR, the results match.
```
+---+----------------------+
|id |num                   |
+---+----------------------+
|99 |99.000000000000000000 |
|100|100.000000000000000000|
+---+----------------------+
```

In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`.

### How was this patch tested?

Author: gatorsmile <[email protected]>

Closes #15502 from gatorsmile/ctasOptimize2.0.
ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 25, 2016
… Once

### What changes were proposed in this pull request?
This follow-up PR is for addressing the [comment](apache#15048).

We added two test cases based on the suggestion from yhuai . One is a new test case using the `saveAsTable` API to create a data source table. Another is for CTAS on Hive serde table.

Note: No need to backport this PR to 2.0. Will submit a new PR to backport the whole fix with new test cases to Spark 2.0

### How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes apache#15459 from gatorsmile/ctasOptimizedTestCases.
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
… Once

### What changes were proposed in this pull request?
This follow-up PR is for addressing the [comment](apache#15048).

We added two test cases based on the suggestion from yhuai . One is a new test case using the `saveAsTable` API to create a data source table. Another is for CTAS on Hive serde table.

Note: No need to backport this PR to 2.0. Will submit a new PR to backport the whole fix with new test cases to Spark 2.0

### How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes apache#15459 from gatorsmile/ctasOptimizedTestCases.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… Once

### What changes were proposed in this pull request?
This follow-up PR is for addressing the [comment](apache#15048).

We added two test cases based on the suggestion from yhuai . One is a new test case using the `saveAsTable` API to create a data source table. Another is for CTAS on Hive serde table.

Note: No need to backport this PR to 2.0. Will submit a new PR to backport the whole fix with new test cases to Spark 2.0

### How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes apache#15459 from gatorsmile/ctasOptimizedTestCases.
ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 19, 2017
## What changes were proposed in this pull request?
We could get incorrect results by running DecimalPrecision twice. This PR resolves the original found in apache#15048 and apache#14797. After this PR, it becomes easier to change it back using `children` instead of using `innerChildren`.

## How was this patch tested?
The existing test.

Author: gatorsmile <[email protected]>

Closes apache#20000 from gatorsmile/keepPromotePrecision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants