-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19024][SQL] Implement new approach to write a permanent view #16613
Conversation
Test build #71501 has finished for PR 16613 at commit
|
Test build #71507 has finished for PR 16613 at commit
|
properties: Map[String, String], | ||
session: SparkSession, | ||
viewText: String): Map[String, String] = { | ||
// Try to analyze the viewText, throw an AnalysisException if the query is invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this? the passed in query
is already the parsed and analyzed plan of viewText, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! The query
is not analyzed, perhaps we should use the analyzedPlan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, we should use analyzedPlan
// Generate the query column names, throw an AnalysisException if there exists duplicate column | ||
// names. | ||
val queryOutput = queryPlan.schema.fieldNames | ||
assert(queryOutput.toSet.size == queryOutput.size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: queryOutput.distinct.size == queryOutput.size
viewOriginalText = originalText, | ||
viewText = Some(viewSQL), | ||
viewText = originalText, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something we can clean up: Hive will expand the view text, so it needs 2 fields: originalText and viewText. Since we don't expand the view text, but only add table properties, I think we only need a single field viewText
in CatalogTable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that would require changes of several tens of places, should we do that in a seprated PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, in a separated PR.
Test build #71579 has started for PR 16613 at commit |
* @param session the spark session. | ||
* @param aliasedPlan if `userSpecifiedColumns` is defined, the aliased plan outputs the user | ||
* specified columns, else it is the same as the `analyzedPlan`. | ||
* @param analyzedPlan the analyzed logical plan that represents the child of a view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need both aliasedPlan
and analyzedPlan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generate the queryColumnNames
by analyzedPlan
, and we generate the view schema by aliasedPlan
, they are not the same when userSpecifiedColumns
is defined.
So we have to pass the both param in this function.
* @param analyzedPlan the analyzed logical plan that represents the child of a view. | ||
* @return new view properties including view default database and query column names properties. | ||
*/ | ||
def generateViewProperties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like all other methods in this class can be private
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, will update that.
@@ -173,7 +165,8 @@ case class CreateViewCommand( | |||
} | |||
} else { | |||
// Create the view if it doesn't exist. | |||
catalog.createTable(prepareTable(sparkSession, aliasedPlan), ignoreIfExists = false) | |||
catalog.createTable(prepareTable(sparkSession, analyzedPlan), | |||
ignoreIfExists = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put them in one line
} | ||
|
||
val aliasedPlan = aliasPlan(session, analyzedPlan) | ||
val newProperties = generateViewProperties(properties, session, analyzedPlan) | ||
|
||
CatalogTable( | ||
identifier = name, | ||
tableType = CatalogTableType.VIEW, | ||
storage = CatalogStorageFormat.empty, | ||
schema = aliasedPlan.schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inline the aliasedPlan
here, i.e. schema = aliasPlan(session, analyzedPlan).schema
. Then it's clearer that we only alias the plan to get the schema
@@ -370,28 +370,35 @@ class HiveDDLSuite | |||
spark.range(10).write.saveAsTable(tabName) | |||
val viewName = "view1" | |||
withView(viewName) { | |||
def checkProperties( | |||
properties: Map[String, String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can make this method better
def checkProperties(expected: Map[String, String]): Boolean = {
val properties = catalog.getTableMetadata(TableIdentifier(viewName)).properties
...
}
"""CREATE VIEW IF NOT EXISTS | ||
withView("testView") { | ||
sql( | ||
"""CREATE VIEW IF NOT EXISTS | ||
|default.testView (c1 COMMENT 'blabla', c2 COMMENT 'blabla') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indention is wrong
LGTM, pending jenkins |
Test build #71587 has finished for PR 16613 at commit
|
thanks, merging to master! |
Thank you @cloud-fan ! |
Test build #71589 has finished for PR 16613 at commit
|
is there a feature flag that is used to determine if we use this new approach? I feel it will be good to have an internal feature flag to determine the code path. So, if there is something wrong that is hard to fix quickly before the release, we can still switch back to the old code path. Then, in the next release, we can remove the feature flag. What do you think? Also, @jiangxb1987 can you take a look at the SQLViewSuite and see if we have enough test coverage? |
nvm. After second thought, the feature flag does not really buy us anything. We just store the original view definition and the column mapping in the metastore. So, I think it is fine to just do the switch. |
## What changes were proposed in this pull request? On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach. The main advantage includes: 1. If you update an underlying view, the current view also gets updated; 2. That gives us a change to get ride of SQL generation for operators. Major changes of this PR: 1. Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable; 2. Update the commands `CreateViewCommand` and `AlterViewAsCommand`, get rid of SQL generation from them. ## How was this patch tested? Existing tests. Author: jiangxingbo <[email protected]> Closes apache#16613 from jiangxb1987/view-write-path.
## What changes were proposed in this pull request? On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach. The main advantage includes: 1. If you update an underlying view, the current view also gets updated; 2. That gives us a change to get ride of SQL generation for operators. Major changes of this PR: 1. Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable; 2. Update the commands `CreateViewCommand` and `AlterViewAsCommand`, get rid of SQL generation from them. ## How was this patch tested? Existing tests. Author: jiangxingbo <[email protected]> Closes apache#16613 from jiangxb1987/view-write-path.
What changes were proposed in this pull request?
On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach.
The main advantage includes:
Major changes of this PR:
CreateViewCommand
andAlterViewAsCommand
, get rid of SQL generation from them.How was this patch tested?
Existing tests.