[SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter #21821

gatorsmile · 2018-07-19T23:49:25Z

What changes were proposed in this pull request?

      val udf1 = udf({(x: Int, y: Int) => x + y})
      val df = spark.range(0, 3).toDF("a")
        .withColumn("b", udf1($"a", udf1($"a", lit(10))))
      df.cache()
      df.write.saveAsTable("t")

Cache is not being used because the plans do not match with the cached plan. This is a regression caused by the changes we made in AnalysisBarrier, since not all the Analyzer rules are idempotent.

How was this patch tested?

Added a test.

Also found a bug in the DSV1 write path. This is not a regression. Thus, opened a separate JIRA https://issues.apache.org/jira/browse/SPARK-24869

gatorsmile · 2018-07-19T23:51:15Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -254,7 +254,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
          val writer = ws.createWriter(jobId, df.logicalPlan.schema, mode, options)
          if (writer.isPresent) {
            runCommand(df.sparkSession, "save") {
-              WriteToDataSourceV2(writer.get(), df.logicalPlan)
+              WriteToDataSourceV2(writer.get(), df.planWithBarrier)


This change is not needed but it is safe to have.

SparkQA · 2018-07-19T23:55:03Z

Test build #93305 has finished for PR 21821 at commit 23ec09f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-20T03:06:30Z

Test build #93309 has finished for PR 21821 at commit 4030e17.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-20T05:10:40Z

cc @cloud-fan @hvanhovell @rxin

SparkQA · 2018-07-20T06:59:22Z

Test build #93319 has finished for PR 21821 at commit 9edc28f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-20T07:01:44Z

retest this please

SparkQA · 2018-07-20T09:45:52Z

Test build #93323 has finished for PR 21821 at commit 9edc28f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-20T16:17:41Z

~~We need a separate rule to eliminate barriers for the write path and CTAS, since the input queries are not always children of these nodes. Thus, the current EliminateBarriers does not work. ~~

So far, all the relevant plans since 2.3 have extended DataWritingCommand. Thus, EliminateBarriers will still eliminate the barrier at the end of Analyzer.

cloud-fan · 2018-07-22T14:38:20Z

shall we fix the non-idempotent analyzer rule for 2.3.2?

gatorsmile · 2018-07-22T17:37:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

@@ -891,8 +891,9 @@ object DDLUtils {
   * Throws exception if outputPath tries to overwrite inputpath.
   */
  def verifyNotReadPath(query: LogicalPlan, outputPath: Path) : Unit = {
-    val inputPaths = query.collect {
-      case LogicalRelation(r: HadoopFsRelation, _, _, _) => r.location.rootPaths
+    val inputPaths = EliminateBarriers(query).collect {


AnalysisBarrier is a leaf node. That is one of the reasons why it could easily break the other code.

gatorsmile · 2018-07-22T17:44:16Z

@cloud-fan This sounds good to me. @maryannxue Please fix the rule HandleNullInputsForUDF?

SparkQA · 2018-07-22T21:18:58Z

Test build #93416 has finished for PR 21821 at commit e8bf33c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-07-23T00:45:47Z

Yes, @gatorsmile. Code is ready. Will post a PR shortly.

gatorsmile · 2018-07-23T07:16:02Z

cc @hvanhovell

hvanhovell · 2018-07-23T11:44:11Z

@gatorsmile do we still need this patch if maryann fixes this?

gatorsmile · 2018-07-23T15:23:31Z

@hvanhovell The question is whether HandleNullInputsForUDF is the only rule that are non-idempotent. If not, we still need to add an AnalysisBarrier. It sounds like the changes are still safe to apply?

maryannxue · 2018-07-25T16:44:59Z

LGTM.

hvanhovell · 2018-07-25T16:46:35Z

LGTM

SparkQA · 2018-07-25T17:00:17Z

Test build #93553 has finished for PR 21821 at commit 328addd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-07-25T17:58:42Z

Is this still valid since #21822 is going on? Shall we have this only on 2.3 main branches?

gatorsmile · 2018-07-25T18:01:54Z

@mgaido91 See the comment #21821 (comment)

gatorsmile · 2018-07-25T18:04:12Z

This PR is majorly for Spark 2.3 branch.

The code changes will be removed from the master branch when #21822 is merged. However, the test cases are still valid.

mgaido91 · 2018-07-25T19:20:43Z

LGTM

SparkQA · 2018-07-25T21:45:45Z

Test build #93555 has finished for PR 21821 at commit ddbd9f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

```Scala val udf1 = udf({(x: Int, y: Int) => x + y}) val df = spark.range(0, 3).toDF("a") .withColumn("b", udf1($"a", udf1($"a", lit(10)))) df.cache() df.write.saveAsTable("t") ``` Cache is not being used because the plans do not match with the cached plan. This is a regression caused by the changes we made in AnalysisBarrier, since not all the Analyzer rules are idempotent. Added a test. Also found a bug in the DSV1 write path. This is not a regression. Thus, opened a separate JIRA https://issues.apache.org/jira/browse/SPARK-24869 Author: Xiao Li <[email protected]> Closes #21821 from gatorsmile/testMaster22. (cherry picked from commit d2e7deb) Signed-off-by: Xiao Li <[email protected]>

gatorsmile · 2018-07-26T00:41:54Z

Thanks! Merged to master/2.3

fix

23ec09f

gatorsmile commented Jul 19, 2018

View reviewed changes

remove useless import

4030e17

gatorsmile changed the title ~~[SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter~~ [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter [WIP] Jul 20, 2018

fix

e8bf33c

gatorsmile force-pushed the testMaster22 branch from 9edc28f to e8bf33c Compare July 22, 2018 17:36

gatorsmile commented Jul 22, 2018

View reviewed changes

gatorsmile changed the title ~~[SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter [WIP]~~ [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter Jul 23, 2018

Merge remote-tracking branch 'upstream/master' into testMaster22

328addd

fix build failure.

ddbd9f7

asfgit closed this in d2e7deb Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter #21821

[SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter #21821

gatorsmile commented Jul 19, 2018

gatorsmile Jul 19, 2018

SparkQA commented Jul 19, 2018

SparkQA commented Jul 20, 2018

gatorsmile commented Jul 20, 2018

SparkQA commented Jul 20, 2018

gatorsmile commented Jul 20, 2018

SparkQA commented Jul 20, 2018

gatorsmile commented Jul 20, 2018 •

edited

Loading

cloud-fan commented Jul 22, 2018

gatorsmile Jul 22, 2018

gatorsmile commented Jul 22, 2018

SparkQA commented Jul 22, 2018

maryannxue commented Jul 23, 2018

gatorsmile commented Jul 23, 2018

hvanhovell commented Jul 23, 2018

gatorsmile commented Jul 23, 2018 •

edited

Loading

maryannxue commented Jul 25, 2018

hvanhovell commented Jul 25, 2018

SparkQA commented Jul 25, 2018

mgaido91 commented Jul 25, 2018

gatorsmile commented Jul 25, 2018

gatorsmile commented Jul 25, 2018 •

edited

Loading

mgaido91 commented Jul 25, 2018

SparkQA commented Jul 25, 2018

gatorsmile commented Jul 26, 2018

[SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter #21821

[SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWriter #21821

Conversation

gatorsmile commented Jul 19, 2018

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile Jul 19, 2018

Choose a reason for hiding this comment

SparkQA commented Jul 19, 2018

SparkQA commented Jul 20, 2018

gatorsmile commented Jul 20, 2018

SparkQA commented Jul 20, 2018

gatorsmile commented Jul 20, 2018

SparkQA commented Jul 20, 2018

gatorsmile commented Jul 20, 2018 • edited Loading

cloud-fan commented Jul 22, 2018

gatorsmile Jul 22, 2018

Choose a reason for hiding this comment

gatorsmile commented Jul 22, 2018

SparkQA commented Jul 22, 2018

maryannxue commented Jul 23, 2018

gatorsmile commented Jul 23, 2018

hvanhovell commented Jul 23, 2018

gatorsmile commented Jul 23, 2018 • edited Loading

maryannxue commented Jul 25, 2018

hvanhovell commented Jul 25, 2018

SparkQA commented Jul 25, 2018

mgaido91 commented Jul 25, 2018

gatorsmile commented Jul 25, 2018

gatorsmile commented Jul 25, 2018 • edited Loading

mgaido91 commented Jul 25, 2018

SparkQA commented Jul 25, 2018

gatorsmile commented Jul 26, 2018

gatorsmile commented Jul 20, 2018 •

edited

Loading

gatorsmile commented Jul 23, 2018 •

edited

Loading

gatorsmile commented Jul 25, 2018 •

edited

Loading