Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19024][SQL] Implement new approach to write a permanent view #16613

Closed
wants to merge 5 commits into from

Conversation

jiangxb1987
Copy link
Contributor

What changes were proposed in this pull request?

On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach.

The main advantage includes:

  1. If you update an underlying view, the current view also gets updated;
  2. That gives us a change to get ride of SQL generation for operators.

Major changes of this PR:

  1. Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable;
  2. Update the commands CreateViewCommand and AlterViewAsCommand, get rid of SQL generation from them.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71501 has finished for PR 16613 at commit 917ca04.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71507 has finished for PR 16613 at commit 9d582a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

properties: Map[String, String],
session: SparkSession,
viewText: String): Map[String, String] = {
// Try to analyze the viewText, throw an AnalysisException if the query is invalid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this? the passed in query is already the parsed and analyzed plan of viewText, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The query is not analyzed, perhaps we should use the analyzedPlan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, we should use analyzedPlan

// Generate the query column names, throw an AnalysisException if there exists duplicate column
// names.
val queryOutput = queryPlan.schema.fieldNames
assert(queryOutput.toSet.size == queryOutput.size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: queryOutput.distinct.size == queryOutput.size

viewOriginalText = originalText,
viewText = Some(viewSQL),
viewText = originalText,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something we can clean up: Hive will expand the view text, so it needs 2 fields: originalText and viewText. Since we don't expand the view text, but only add table properties, I think we only need a single field viewText in CatalogTable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that would require changes of several tens of places, should we do that in a seprated PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, in a separated PR.

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71579 has started for PR 16613 at commit 2d49ef2.

* @param session the spark session.
* @param aliasedPlan if `userSpecifiedColumns` is defined, the aliased plan outputs the user
* specified columns, else it is the same as the `analyzedPlan`.
* @param analyzedPlan the analyzed logical plan that represents the child of a view.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need both aliasedPlan and analyzedPlan?

Copy link
Contributor Author

@jiangxb1987 jiangxb1987 Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generate the queryColumnNames by analyzedPlan, and we generate the view schema by aliasedPlan, they are not the same when userSpecifiedColumns is defined.
So we have to pass the both param in this function.

* @param analyzedPlan the analyzed logical plan that represents the child of a view.
* @return new view properties including view default database and query column names properties.
*/
def generateViewProperties(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like all other methods in this class can be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, will update that.

@@ -173,7 +165,8 @@ case class CreateViewCommand(
}
} else {
// Create the view if it doesn't exist.
catalog.createTable(prepareTable(sparkSession, aliasedPlan), ignoreIfExists = false)
catalog.createTable(prepareTable(sparkSession, analyzedPlan),
ignoreIfExists = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put them in one line

}

val aliasedPlan = aliasPlan(session, analyzedPlan)
val newProperties = generateViewProperties(properties, session, analyzedPlan)

CatalogTable(
identifier = name,
tableType = CatalogTableType.VIEW,
storage = CatalogStorageFormat.empty,
schema = aliasedPlan.schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inline the aliasedPlan here, i.e. schema = aliasPlan(session, analyzedPlan).schema. Then it's clearer that we only alias the plan to get the schema

@@ -370,28 +370,35 @@ class HiveDDLSuite
spark.range(10).write.saveAsTable(tabName)
val viewName = "view1"
withView(viewName) {
def checkProperties(
properties: Map[String, String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can make this method better

def checkProperties(expected: Map[String, String]): Boolean = {
  val properties = catalog.getTableMetadata(TableIdentifier(viewName)).properties
  ...
}

"""CREATE VIEW IF NOT EXISTS
withView("testView") {
sql(
"""CREATE VIEW IF NOT EXISTS
|default.testView (c1 COMMENT 'blabla', c2 COMMENT 'blabla')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indention is wrong

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71587 has finished for PR 16613 at commit e2ccdd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@jiangxb1987
Copy link
Contributor Author

Thank you @cloud-fan !

@asfgit asfgit closed this in f85f296 Jan 18, 2017
@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71589 has finished for PR 16613 at commit 7c5b6af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jan 18, 2017

is there a feature flag that is used to determine if we use this new approach? I feel it will be good to have an internal feature flag to determine the code path. So, if there is something wrong that is hard to fix quickly before the release, we can still switch back to the old code path. Then, in the next release, we can remove the feature flag. What do you think?

Also, @jiangxb1987 can you take a look at the SQLViewSuite and see if we have enough test coverage?

@yhuai
Copy link
Contributor

yhuai commented Jan 19, 2017

nvm. After second thought, the feature flag does not really buy us anything. We just store the original view definition and the column mapping in the metastore. So, I think it is fine to just do the switch.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach.

The main advantage includes:
1. If you update an underlying view, the current view also gets updated;
2. That gives us a change to get ride of SQL generation for operators.

Major changes of this PR:
1. Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable;
2. Update the commands `CreateViewCommand` and `AlterViewAsCommand`, get rid of SQL generation from them.

## How was this patch tested?
Existing tests.

Author: jiangxingbo <[email protected]>

Closes apache#16613 from jiangxb1987/view-write-path.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach.

The main advantage includes:
1. If you update an underlying view, the current view also gets updated;
2. That gives us a change to get ride of SQL generation for operators.

Major changes of this PR:
1. Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable;
2. Update the commands `CreateViewCommand` and `AlterViewAsCommand`, get rid of SQL generation from them.

## How was this patch tested?
Existing tests.

Author: jiangxingbo <[email protected]>

Closes apache#16613 from jiangxb1987/view-write-path.
@jiangxb1987 jiangxb1987 deleted the view-write-path branch March 16, 2017 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants