[SPARK-19024][SQL] Implement new approach to write a permanent view #16613

jiangxb1987 · 2017-01-17T09:50:15Z

What changes were proposed in this pull request?

On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach.

The main advantage includes:

If you update an underlying view, the current view also gets updated;
That gives us a change to get ride of SQL generation for operators.

Major changes of this PR:

Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable;
Update the commands CreateViewCommand and AlterViewAsCommand, get rid of SQL generation from them.

How was this patch tested?

Existing tests.

SparkQA · 2017-01-17T11:20:37Z

Test build #71501 has finished for PR 16613 at commit 917ca04.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-17T14:39:03Z

Test build #71507 has finished for PR 16613 at commit 9d582a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-01-17T15:13:08Z

cc @cloud-fan @yhuai @hvanhovell @gatorsmile

cloud-fan · 2017-01-18T03:13:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

+      properties: Map[String, String],
+      session: SparkSession,
+      viewText: String): Map[String, String] = {
+    // Try to analyze the viewText, throw an AnalysisException if the query is invalid.


do we need to do this? the passed in query is already the parsed and analyzed plan of viewText, isn't it?

Good catch! The query is not analyzed, perhaps we should use the analyzedPlan.

yea, we should use analyzedPlan

cloud-fan · 2017-01-18T03:14:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

+    // Generate the query column names, throw an AnalysisException if there exists duplicate column
+    // names.
+    val queryOutput = queryPlan.schema.fieldNames
+    assert(queryOutput.toSet.size == queryOutput.size,


nit: queryOutput.distinct.size == queryOutput.size

cloud-fan · 2017-01-18T03:17:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

      viewOriginalText = originalText,
-      viewText = Some(viewSQL),
+      viewText = originalText,


something we can clean up: Hive will expand the view text, so it needs 2 fields: originalText and viewText. Since we don't expand the view text, but only add table properties, I think we only need a single field viewText in CatalogTable.

I'm afraid that would require changes of several tens of places, should we do that in a seprated PR?

yea, in a separated PR.

SparkQA · 2017-01-18T07:42:48Z

Test build #71579 has started for PR 16613 at commit 2d49ef2.

cloud-fan · 2017-01-18T07:43:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

+   * @param session the spark session.
+   * @param aliasedPlan if `userSpecifiedColumns` is defined, the aliased plan outputs the user
+   *                    specified columns, else it is the same as the `analyzedPlan`.
+   * @param analyzedPlan the analyzed logical plan that represents the child of a view.


why we need both aliasedPlan and analyzedPlan?

We generate the queryColumnNames by analyzedPlan, and we generate the view schema by aliasedPlan, they are not the same when userSpecifiedColumns is defined.
So we have to pass the both param in this function.

cloud-fan · 2017-01-18T07:44:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

+   * @param analyzedPlan the analyzed logical plan that represents the child of a view.
+   * @return new view properties including view default database and query column names properties.
+   */
+  def generateViewProperties(


looks like all other methods in this class can be private?

yea, will update that.

cloud-fan · 2017-01-18T08:46:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

@@ -173,7 +165,8 @@ case class CreateViewCommand(
      }
    } else {
      // Create the view if it doesn't exist.
-      catalog.createTable(prepareTable(sparkSession, aliasedPlan), ignoreIfExists = false)
+      catalog.createTable(prepareTable(sparkSession, analyzedPlan),
+        ignoreIfExists = false)


nit: put them in one line

cloud-fan · 2017-01-18T08:48:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

+    }
+
+    val aliasedPlan = aliasPlan(session, analyzedPlan)
+    val newProperties = generateViewProperties(properties, session, analyzedPlan)

    CatalogTable(
      identifier = name,
      tableType = CatalogTableType.VIEW,
      storage = CatalogStorageFormat.empty,
      schema = aliasedPlan.schema,


nit: inline the aliasedPlan here, i.e. schema = aliasPlan(session, analyzedPlan).schema. Then it's clearer that we only alias the plan to get the schema

cloud-fan · 2017-01-18T08:50:04Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

@@ -370,28 +370,35 @@ class HiveDDLSuite
      spark.range(10).write.saveAsTable(tabName)
      val viewName = "view1"
      withView(viewName) {
+        def checkProperties(
+            properties: Map[String, String],


nit: we can make this method better

def checkProperties(expected: Map[String, String]): Boolean = { val properties = catalog.getTableMetadata(TableIdentifier(viewName)).properties ... }

cloud-fan · 2017-01-18T08:51:15Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala

-      """CREATE VIEW IF NOT EXISTS
+    withView("testView") {
+      sql(
+        """CREATE VIEW IF NOT EXISTS
        |default.testView (c1 COMMENT 'blabla', c2 COMMENT 'blabla')


nit: indention is wrong

cloud-fan · 2017-01-18T09:26:44Z

LGTM, pending jenkins

SparkQA · 2017-01-18T11:03:14Z

Test build #71587 has finished for PR 16613 at commit e2ccdd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-18T11:13:24Z

thanks, merging to master!

jiangxb1987 · 2017-01-18T11:14:39Z

Thank you @cloud-fan !

SparkQA · 2017-01-18T11:49:29Z

Test build #71589 has finished for PR 16613 at commit 7c5b6af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2017-01-18T16:29:10Z

is there a feature flag that is used to determine if we use this new approach? I feel it will be good to have an internal feature flag to determine the code path. So, if there is something wrong that is hard to fix quickly before the release, we can still switch back to the old code path. Then, in the next release, we can remove the feature flag. What do you think?

Also, @jiangxb1987 can you take a look at the SQLViewSuite and see if we have enough test coverage?

yhuai · 2017-01-19T02:14:56Z

nvm. After second thought, the feature flag does not really buy us anything. We just store the original view definition and the column mapping in the metastore. So, I think it is fine to just do the switch.

## What changes were proposed in this pull request? On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach. The main advantage includes: 1. If you update an underlying view, the current view also gets updated; 2. That gives us a change to get ride of SQL generation for operators. Major changes of this PR: 1. Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable; 2. Update the commands `CreateViewCommand` and `AlterViewAsCommand`, get rid of SQL generation from them. ## How was this patch tested? Existing tests. Author: jiangxingbo <[email protected]> Closes apache#16613 from jiangxb1987/view-write-path.

implement view write path for the new approach.

917ca04

update failed test cases in HiveDDLSuite.

9d582a4

cloud-fan reviewed Jan 18, 2017

View reviewed changes

code refactor.

2d49ef2

cloud-fan reviewed Jan 18, 2017

View reviewed changes

refactor aliasedPlan.

e2ccdd5

cloud-fan reviewed Jan 18, 2017

View reviewed changes

code refactor.

7c5b6af

asfgit closed this in f85f296 Jan 18, 2017

jiangxb1987 deleted the view-write-path branch March 16, 2017 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19024][SQL] Implement new approach to write a permanent view #16613

[SPARK-19024][SQL] Implement new approach to write a permanent view #16613

jiangxb1987 commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

jiangxb1987 commented Jan 17, 2017

cloud-fan Jan 18, 2017

jiangxb1987 Jan 18, 2017

cloud-fan Jan 18, 2017

cloud-fan Jan 18, 2017

cloud-fan Jan 18, 2017

jiangxb1987 Jan 18, 2017

cloud-fan Jan 18, 2017

SparkQA commented Jan 18, 2017

cloud-fan Jan 18, 2017

jiangxb1987 Jan 18, 2017 •

edited

Loading

cloud-fan Jan 18, 2017

jiangxb1987 Jan 18, 2017

cloud-fan Jan 18, 2017

cloud-fan Jan 18, 2017

cloud-fan Jan 18, 2017

cloud-fan Jan 18, 2017

cloud-fan commented Jan 18, 2017

SparkQA commented Jan 18, 2017

cloud-fan commented Jan 18, 2017

jiangxb1987 commented Jan 18, 2017

SparkQA commented Jan 18, 2017

yhuai commented Jan 18, 2017

yhuai commented Jan 19, 2017

[SPARK-19024][SQL] Implement new approach to write a permanent view #16613

[SPARK-19024][SQL] Implement new approach to write a permanent view #16613

Conversation

jiangxb1987 commented Jan 17, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

jiangxb1987 commented Jan 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 18, 2017

Choose a reason for hiding this comment

jiangxb1987 Jan 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 18, 2017

SparkQA commented Jan 18, 2017

cloud-fan commented Jan 18, 2017

jiangxb1987 commented Jan 18, 2017

SparkQA commented Jan 18, 2017

yhuai commented Jan 18, 2017

yhuai commented Jan 19, 2017

jiangxb1987 Jan 18, 2017 •

edited

Loading