-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39503][SQL] Add session catalog name for v1 database table and function #36936
Conversation
0eada9f
to
0db8c67
Compare
db01710
to
47dfa76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! This will make it a lot clearer whether the session catalog was used or not. To that end it is a bit weird to me that we do not include the session catalog name for views. That could give the wrong impression that views and tables live in different catalogs. Which is not true, as AFAIK it's allowed by Spark to refer to a view as spark_catalog.default.view_name
.
Displaying the catalog for views may also allow us to remove some of the additional code that had to be added (e.g. ...WithoutCatalog
). I understand we currently do not support views with the v2 plugins, but just seems like a technicality to me, and something that we may want to change in the future, not something that we would like to "leak" to the users like this.
@tomvanbussel yeah, I'm fine to also support add session catalog for view. But I want to clarify some difference and trade-off for view.
|
@ulysses-you Fully agreed that we should leave local temporary views as they are. They also do not belong to a database AFAIK, so it indeed would be weird to include the session catalog in the identifier. I'm not sure if I'm the right person to discuss how views should be integrated with the v2 catalog, but it seems like a neat idea. Might be worth exploring in a follow-up. |
499dcec
to
2e95e2b
Compare
@@ -34,13 +38,35 @@ sealed trait IdentifierWithDatabase { | |||
private def quoteIdentifier(name: String): String = name.replace("`", "``") | |||
|
|||
def quotedString: String = { | |||
if (SQLConf.get.getConf(LEGACY_IDENTIFIER_OUTPUT_CATALOG_NAME) && database.isDefined) { | |||
val replacedId = quoteIdentifier(identifier) | |||
val replacedDb = database.map(quoteIdentifier(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val replacedDb = database.map(quoteIdentifier(_)) | |
val replacedDb = quoteIdentifier(database.get) |
buildConf("spark.sql.legacy.identifierOutputCatalogName") | ||
.internal() | ||
.doc("When set to true, the identifier will output catalog name if database is defined. " + | ||
"When set to false, it restores the legacy behavior that does not output catalog name.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: usually true means legacy behavior, we probably need to rename the config a little bit.
looks good in general. I'm taking a closer look at where we use |
d6eccae
to
5ff067e
Compare
The change list to help reviewer: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala | 3 ++-
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala | 16 +++++++++++-----
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala | 12 ++++++++++--
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala | 40 +++++++++++++++++++++++++++++++---------
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala | 47 ++++++++++++++++++++++++++++++-----------------
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 9 +++++++++
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala | 3 ++-
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala | 3 ++-
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala | 5 +++--
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala | 9 ++++++---
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala | 21 ++++++++++++---------
sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala | 9 ++++++---
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala | 36 +++++++++++++++++++++++-------------
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala | 4 +++-
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala | 19 +++++++++++--------
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeNamespaceExec.scala | 1 +
sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala | 8 ++++----
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala | 57 ++++++++++++++++++++++++++++++++++----------------------
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala | 19 ++++++++++---------
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala | 9 ++++++---
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala | 4 +++-
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala | 4 +++-
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala | 41 +++++++++++++++++++++++------------------
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala | 9 ++++++---
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala | 10 +++++++---
sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala | 4 +++-
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala | 24 +++++++++++++++---------
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeNamespaceSuite.scala | 12 +++++++-----
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala | 2 ++
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/DescribeNamespaceSuite.scala | 1 +
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala | 3 ++-
sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala | 24 ++++++++++++++----------
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala | 7 ++++---
sql/hive/src/test/scala/org/apache/spark/sql/hive/UDFSuite.scala | 9 +++++----
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala | 22 ++++++++++++++--------
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala | 3 ++-
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala | 13 +++++++------
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala | 3 ++-
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala | 11 +++++++----
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala | 2 ++
sql/core/src/test/resources/sql-tests/results/charvarchar.sql.out | 16 +++++++++++++++-
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out | 7 +++++++
sql/core/src/test/resources/sql-tests/results/describe-table-after-alter-table.sql.out | 5 +++++
sql/core/src/test/resources/sql-tests/results/describe.sql.out | 23 +++++++++++++++--------
sql/core/src/test/resources/sql-tests/results/explain-aqe.sql.out | 104 +++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------
sql/core/src/test/resources/sql-tests/results/explain-cbo.sql.out | 8 ++++----
sql/core/src/test/resources/sql-tests/results/explain.sql.out | 96 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------
sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out | 2 +-
sql/core/src/test/resources/sql-tests/results/show-tables.sql.out | 10 ++++++----
sql/core/src/test/resources/sql-tests/results/udaf.sql.out | 2 +-
sql/core/src/test/resources/sql-tests/results/udf/udf-udaf.sql.out | 2 +- |
@@ -895,7 +905,7 @@ case class ShowTablesCommand( | |||
val normalizedSpec = PartitioningUtils.normalizePartitionSpec( | |||
partitionSpec.get, | |||
table.partitionSchema, | |||
tableIdent.quotedString, | |||
tableIdent.quotedString(SESSION_CATALOG_NAME), | |||
sparkSession.sessionState.conf.resolver) | |||
val partition = catalog.getPartition(tableIdent, normalizedSpec) | |||
val database = tableIdent.database.getOrElse("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to add the catalog name to the output of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @tomvanbussel for pointing out this, actually it is out of scope of this pr. We can support improve the output attribute of some related commands in future.
|
||
if (replacedDb.isDefined) s"`${replacedDb.get}`.`$replacedId`" else s"`$replacedId`" | ||
def quotedString(catalog: Option[String]): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add these new methods. These methods make it easier to omit the catalog than to include it. I'm worried that this leads to most new code (and some existing code) not including the catalog. AFAIK it's always safe to include the catalog name if the database name is set (as in that case we know that it won't be a local temporary view), so the following should work:
def quotedString: String = {
val replacedId = quoteIdentifier(identifier)
val replacedDb = database.map(quoteIdentifier)
if (database.isDefined) {
if (SQLConf.get.getConf(LEGACY_NON_IDENTIFIER_OUTPUT_CATALOG_NAME) || isTempView) {
s"`${replacedDb.get}`.`$replacedId`"
} else {
s"`$SESSION_CATALOG_NAME`.`${replacedDb.get}`.`$replacedId`"
}
} else {
s"`$replacedId`"
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is hacky that we access the conf inside identifier that it may appear in anywhere. I make a new pr which pull out the config see #37021
@@ -234,7 +235,7 @@ case class UnresolvedGenerator(name: FunctionIdentifier, children: Seq[Expressio | |||
override def nullable: Boolean = throw new UnresolvedException("nullable") | |||
override lazy val resolved = false | |||
|
|||
override def prettyName: String = name.unquotedString | |||
override def prettyName: String = name.unquotedString(SESSION_CATALOG_NAME) | |||
override def toString: String = s"'$name(${children.mkString(", ")})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will include the unquoted identifier without the catalog, as the toString
method of FunctionIdentifier
is implemented as unquotedString
. Is this intentional?
I create a new pr #37021 for this issue but a new approach:
|
… function ### What changes were proposed in this pull request? - Add session catalog name in identifiers, then all identifiers will be 3 part name ### Why are the changes needed? To make it more clearer that this table or function comes from which catalog. It affects: - the scan table/permanent view of the query plan - the target table of the data writing - desc database - desc table - desc function Note that, we do not support temporary view since it does not belong to any database and catalog This a new appraoch of #36936 that: - add catalog field in identifier, so identifier just print catalog if defined - inject catalog at the beginning of identifier life ### Does this PR introduce _any_ user-facing change? maybe yes, so add a new config `spark.sql.legacy.nonIdentifierOutputCatalogName` to restore the old behavior ### How was this patch tested? change list: ```scala docs/sql-migration-guide.md | 1 + sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala | 10 +++++++--- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala | 28 ++++++++++++++++++--------- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala | 1 + sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala | 56 ++++++++++++++++++++++++++++++++++++++++++++--------- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala | 4 ++-- sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala | 13 +++++++++---- sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/LookupCatalog.scala | 8 ++++++-- sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 9 +++++++++ sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala | 7 ++++--- sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala | 14 ++++++++------ sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala | 6 ++++-- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeNamespaceExec.scala | 1 + sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala | 5 +++-- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 5 +++-- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala | 5 +++-- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala | 58 ++++++++++++++++++++++++++++++------------------------- sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/LookupCatalogSuite.scala | 4 +++- sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala | 4 +++- sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala | 13 +++++++++---- sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 7 +++++-- sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala | 4 +++- sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala | 39 ++++++++++++++++++++++++------------ sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewTestSuite.scala | 7 +++++-- sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala | 36 +++++++++++++++++++++-------------- sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala | 51 +++++++++++++++++++++++++++++---------------------- sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeNamespaceSuite.scala | 12 +++++++----- sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala | 2 ++ sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTblPropertiesSuite.scala | 3 ++- sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/DescribeNamespaceSuite.scala | 1 + sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala | 3 ++- sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala | 25 +++++++++++++----------- sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala | 7 ++++--- sql/hive/src/test/scala/org/apache/spark/sql/hive/UDFSuite.scala | 9 +++++---- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala | 27 ++++++++++++++++---------- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala | 3 ++- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSQLViewSuite.scala | 4 +++- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala | 13 +++++++------ sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala | 3 ++- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala | 8 +++++--- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala | 2 ++ sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/ShowFunctionsSuite sql/core/src/test/resources/sql-tests/results/charvarchar.sql.out | 20 ++++++++++++++++++- sql/core/src/test/resources/sql-tests/results/describe-table-after-alter-table.sql.out | 5 +++++ sql/core/src/test/resources/sql-tests/results/describe.sql.out | 22 +++++++++++++-------- sql/core/src/test/resources/sql-tests/results/explain-aqe.sql.out | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------- sql/core/src/test/resources/sql-tests/results/explain-cbo.sql.out | 8 ++++---- sql/core/src/test/resources/sql-tests/results/explain.sql.out | 100 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------ sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out | 2 +- sql/core/src/test/resources/sql-tests/results/postgreSQL/create_view.sql.out | 40 +++++++++++++++++++++++++------------- sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out | 2 +- sql/core/src/test/resources/sql-tests/results/show-tables.sql.out | 10 ++++++---- sql/core/src/test/resources/sql-tests/results/show-tblproperties.sql.out | 2 +- sql/core/src/test/resources/sql-tests/results/udaf.sql.out | 4 ++-- sql/core/src/test/resources/sql-tests/results/udf/udf-udaf.sql.out | 4 ++-- ``` Closes #37021 from ulysses-you/output-catalog-2. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
close this pr. in favor of #37021 |
What changes were proposed in this pull request?
CatalogImpl#refreshTable
support 3 part name. This option is required since a lot of command use this method.Why are the changes needed?
To make it more clearer that this table or function comes from which catalog. It affects:
Note that, we do not support view catalog so we do not add catalog for view at this pr.
Does this PR introduce any user-facing change?
maybe yes, so add a new config
spark.sql.legacy.identifierOutputCatalogName
to restore the old behaviorHow was this patch tested?
pass CI