-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28265][SQL] Add renameTable to TableCatalog API #25206
[SPARK-28265][SQL] Add renameTable to TableCatalog API #25206
Conversation
PTAL @rdblue |
Are we going to support this from the SQL side? Think it's hard to evaluate the semantics of the API without seeing how it will be used in query planning and execution. |
sql/catalyst/src/test/scala/org/apache/spark/sql/catalog/v2/TestTableCatalog.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalog/v2/TableCatalogSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalog/v2/TableCatalogSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
@mccheah, this is supported by Spark for v1: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L130-L131 I think it makes sense to add it. The plan is simple and just has two identifiers, but we do need to decide whether to require that both identifiers start with the same catalog. For example, what should happen in this case?
Should the catalog name be implied? I think this should throw an error because the catalogs for the identifiers do not match. |
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
Outdated
Show resolved
Hide resolved
A similar problem was already discussed before and the decision was: RENAME TABLE should only do rename, not things like moving databases. As a result, the We can also enforce it at the parser level: |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
Outdated
Show resolved
Hide resolved
@cloud-fan @rdblue Originally I had committed a version enforcing the same namespace for the table identifiers since the spec does not define the behavior in that case, so I tried to be consistent with the existing behavior in Spark, the enforcement is already there in spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala Line 641 in 24e1e41
Also, I think there could be cases when the catalog could rename a table across databases and not requiring any move (of data). However, if there has already been agreement that the Please let me know if there are any more comments on this. |
Let's enforce that to be consistent with the existing behaviors. |
0a64ab0
to
c37202d
Compare
@cloud-fan @rdblue I've added back the namespace checks, raising |
Sorry I missed @cloud-fan's comments before now. I disagree with the decision to throw |
Also I'm -1 on the changes suggested by @mgaido91. The original documentation matched the javadoc for the other methods by using "rename" instead of "renames". In addition, the phrase "must throw" is used specifically to tell implementers the behavior that is required by this API. Using "throws" is appropriate for an API consumer, but the audience for this documentation is the API implementer. The use of "must throw" should be added back. |
c37202d
to
5d2dc8f
Compare
Just to be clear here, @cloud-fan, as mentioned before something like: ALTER TABLE cat.db.table RENAME TO db.table2 Would still fail, since the catalog is not the same and I don't think we should imply it; this can be in the SQL operation implementation. However, to @rdblue's point if the catalog support renames across DBs I don't see why Spark should block it either, other than existing behavior which I don't think this change would affect either if the catalog does not support such operation. |
+1 for failing when the catalog is not the same. Spark should validate that the same catalog is responsible for both table identifiers. |
@rdblue IMO, we should first think about what behavior Spark should provide to end-users, and then design the Data Source API to enforce the behavior. RENAME TABLE should only rename table, not moving between namespaces. This is intuitive and is the behavior of Hive, Postgres and other mainstream SQL systems. This is the decision we made in #14955 . If you want to revisit it, please open a new PR, instead of mixing it with adding new API. If a catalog can move tables between namespaces, that's good and we can probably introduce a new command like MOVE TABLE to expose this ability to end-users. But we shouldn't extend the behavior of |
@cloud-fan, I'm not convinced that a decision for v1 applies to v2, considering all of the challenges with v1 and its very narrow focus on what works with the Hive metastore. @brkyvz, @mccheah, and @RussellSpitzer, what do you think about allowing rename across namespaces or not? |
There is no value for Cassandra for renames across namespaces, but I don't have strong feelings about Spark enforcing this. I'm fine with rejecting this at the Catalog level outside of the Analyzer. Looking closer at the commit i'm fine with rename being able to through Analysis exception, I don't think every catalog has to implement this in the same way though so perhaps we just change the javadoc notes? |
I think we have to be more flexible here because of how loose the definition of a namespace is. Sure, many traditional database systems don't inherently support moving tables from one database to another - but the definition of a namespace in the context of DSv2 is more broad than a database. A namespace could map to a user, for example, in which case moving a table from one namespace to another is effectively transferring ownership of the table; then the underlying catalog could, for example, check ACLs as part of validation and potentially throw an authorization error instead of an analysis one. It depends on the goal at the end of the day - is the goal that the same SQL statements produce exactly the same behavior across all catalogs, or is the goal that the catalog implementation is given maximum optionality and flexibility? If we want SQL to behave exactly the same across all catalogs, then yes we would throw the analysis exception, but then catalog implementations would never be given the opportunity to move tables from one namespace to another, effectively forcing the user to find a workaround. (We could add a flag to But I will note that we've been aiming for consistency in other parts of the DSv2 project. I think it's a fine balance and something we have to call on a case by case basis. |
I think consistency in this case means that table renames behave the same way, not that all sources have the same level of support. A similar example is the effort to add DELETE FROM. JDBC sources can delete any record from a filter, while file sources can only implement filter deletes that use only partition columns. Sources aren't required to have uniform capabilities, they are just required to have consistent behavior -- even if that behavior is what to do when something is not supported. |
Hey, I'm not against the ability to move tables between namespaces, I'm against doing this via I agree with @rdblue about the definition of consistency here. It's OK if JDBC always supports DELETE FROM while file source may throw an exception and say the delete condition can only refer to partition columns. But the problem here is not consistency. It's about if FYI, in other RDBMS, the parser doesn't even allow you to put a multi-part table identifier after |
@cloud-fan, so you're suggesting Spark should create a new SQL command instead of using I think Last, |
I'm in favor of being able to rename tables across namespaces. The V2SessionCatalog can reject such an operation (validate that the namespaces match), since the HiveMetaStore wouldn't allow it. The kind of use cases I see are: ALTER TABLE a RENAME TO b; -- regular RDBMS behavior
ALTER TABLE dev.a RENAME TO prod. -- fail
ALTER TABLE hdfs.dev.a RENAME TO hdfs.prod.a; -- ensure prod is a namespace |
If by definition namespace is part of the table name, then it's OK. I thought table name and namespace together are the table identifier. For the parser rule, what I expect is
That said, I think this command should never fail at Spark side. The table identifier after |
Thanks everyone for your input. I think it looks like we agree on the following points for this PR:
Based on the previous points, I’ve pushed the changes back to how I originally submitted this PR, where the catalog does not throw @rdblue @cloud-fan please let me know if there are any other concerns. Thank you. |
5d2dc8f
to
4c0d834
Compare
@@ -153,6 +153,14 @@ class V2SessionCatalog(sessionState: SessionState) extends TableCatalog { | |||
} | |||
} | |||
|
|||
override def renameTable(oldIdent: Identifier, newIdent: Identifier): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add test cases for v2 session catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it'd be nice to see tests for renaming across namespaces (failing), etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added the tests.
Looks good to me! I'd like tests for the v2 session catalog change, but otherwise I'm +1. |
Yeah, once tests for V2 session catalog are added, I can merge this |
ok to test |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala
Outdated
Show resolved
Hide resolved
LGTM. Shall we also update the RENAME TABLE command? I'm fine if we create a ticket and do it in a follow-up. |
Test build #109011 has finished for PR 25206 at commit
|
@cloud-fan I've created ticket https://issues.apache.org/jira/browse/SPARK-28717 for the follow-up work to update the SQL command |
Test build #109053 has finished for PR 25206 at commit
|
thanks, merging to master! |
What changes were proposed in this pull request?
This PR adds the
renameTable
call to theTableCatalog
API, as described in the Table Metadata API SPIP.This PR is related to: #24246
How was this patch tested?
Added unit tests and contract tests.