[SPARK-28265][SQL] Add renameTable to TableCatalog API #25206

edgarRd · 2019-07-19T21:04:18Z

What changes were proposed in this pull request?

This PR adds the renameTable call to the TableCatalog API, as described in the Table Metadata API SPIP.

This PR is related to: #24246

How was this patch tested?

Added unit tests and contract tests.

edgarRd · 2019-07-19T23:02:44Z

PTAL @rdblue
@mccheah, @cloud-fan since you originally reviewed #24246

mccheah · 2019-07-19T23:24:45Z

Are we going to support this from the SQL side? Think it's hard to evaluate the semantics of the API without seeing how it will be used in query planning and execution.

sql/catalyst/src/test/scala/org/apache/spark/sql/catalog/v2/TestTableCatalog.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalog/v2/TableCatalogSuite.scala

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java

rdblue · 2019-07-19T23:44:48Z

@mccheah, this is supported by Spark for v1: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L130-L131

I think it makes sense to add it. The plan is simple and just has two identifiers, but we do need to decide whether to require that both identifiers start with the same catalog. For example, what should happen in this case?

ALTER TABLE cat.db.table RENAME TO db.table2

Should the catalog name be implied? I think this should throw an error because the catalogs for the identifiers do not match.

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

cloud-fan · 2019-07-21T08:07:28Z

but we do need to decide whether to require that both identifiers start with the same catalog

A similar problem was already discussed before and the decision was: RENAME TABLE should only do rename, not things like moving databases. As a result, the ExternalCatalog.renameTable is defined as def renameTable(db: String, oldName: String, newName: String): Unit.

We can also enforce it at the parser level: ALTER TABLE multiPartName RENAME TO singlePartName. We didn't do it before for backward compatibility reasons, it's a good chance to do it for DS v2 now.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

edgarRd · 2019-07-22T04:17:58Z

@cloud-fan @rdblue Originally I had committed a version enforcing the same namespace for the table identifiers since the spec does not define the behavior in that case, so I tried to be consistent with the existing behavior in Spark, the enforcement is already there in

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

Line 641 in 24e1e41

if (db != newDb) {

Also, I think there could be cases when the catalog could rename a table across databases and not requiring any move (of data). However, if there has already been agreement that the renameTable operation only works within the same DB then I can revert the change I had enforcing the same namespace. If we want to add the enforcement at the parser level, maybe we can do it in another PR.

Please let me know if there are any more comments on this.

cloud-fan · 2019-07-22T04:27:12Z

Let's enforce that to be consistent with the existing behaviors.

edgarRd · 2019-07-30T21:01:55Z

@cloud-fan @rdblue I've added back the namespace checks, raising AnalysisException to match the existing behavior. Please let me know if there's anything else to address. Thanks.

rdblue · 2019-07-30T21:18:19Z

Sorry I missed @cloud-fan's comments before now.

I disagree with the decision to throw AnalysisException if the namespace doesn't match. What is the added benefit here? If a catalog can rename a table to a different namespace, Spark should allow it. I see no reason why Spark should artificially limit a catalog's ability to rename.

rdblue · 2019-07-30T21:20:55Z

Also I'm -1 on the changes suggested by @mgaido91. The original documentation matched the javadoc for the other methods by using "rename" instead of "renames".

In addition, the phrase "must throw" is used specifically to tell implementers the behavior that is required by this API. Using "throws" is appropriate for an API consumer, but the audience for this documentation is the API implementer. The use of "must throw" should be added back.

edgarRd · 2019-07-30T23:18:47Z

Just to be clear here, @cloud-fan, as mentioned before something like:

ALTER TABLE cat.db.table RENAME TO db.table2

Would still fail, since the catalog is not the same and I don't think we should imply it; this can be in the SQL operation implementation. However, to @rdblue's point if the catalog support renames across DBs I don't see why Spark should block it either, other than existing behavior which I don't think this change would affect either if the catalog does not support such operation.

rdblue · 2019-07-30T23:23:48Z

+1 for failing when the catalog is not the same. Spark should validate that the same catalog is responsible for both table identifiers.

cloud-fan · 2019-07-31T12:26:41Z

@rdblue IMO, we should first think about what behavior Spark should provide to end-users, and then design the Data Source API to enforce the behavior.

RENAME TABLE should only rename table, not moving between namespaces. This is intuitive and is the behavior of Hive, Postgres and other mainstream SQL systems. This is the decision we made in #14955 . If you want to revisit it, please open a new PR, instead of mixing it with adding new API.

If a catalog can move tables between namespaces, that's good and we can probably introduce a new command like MOVE TABLE to expose this ability to end-users. But we shouldn't extend the behavior of RENAME TABLE to expose this ability.

rdblue · 2019-07-31T16:12:21Z

@cloud-fan, I'm not convinced that a decision for v1 applies to v2, considering all of the challenges with v1 and its very narrow focus on what works with the Hive metastore.

@brkyvz, @mccheah, and @RussellSpitzer, what do you think about allowing rename across namespaces or not?

RussellSpitzer · 2019-07-31T16:24:15Z

There is no value for Cassandra for renames across namespaces, but I don't have strong feelings about Spark enforcing this. I'm fine with rejecting this at the Catalog level outside of the Analyzer.

Looking closer at the commit i'm fine with rename being able to through Analysis exception, I don't think every catalog has to implement this in the same way though so perhaps we just change the javadoc notes?

mccheah · 2019-07-31T16:45:47Z

I think we have to be more flexible here because of how loose the definition of a namespace is. Sure, many traditional database systems don't inherently support moving tables from one database to another - but the definition of a namespace in the context of DSv2 is more broad than a database. A namespace could map to a user, for example, in which case moving a table from one namespace to another is effectively transferring ownership of the table; then the underlying catalog could, for example, check ACLs as part of validation and potentially throw an authorization error instead of an analysis one.

It depends on the goal at the end of the day - is the goal that the same SQL statements produce exactly the same behavior across all catalogs, or is the goal that the catalog implementation is given maximum optionality and flexibility? If we want SQL to behave exactly the same across all catalogs, then yes we would throw the analysis exception, but then catalog implementations would never be given the opportunity to move tables from one namespace to another, effectively forcing the user to find a workaround. (We could add a flag to ALTER TABLE... RENAME, but if a given database doesn't support moving across namespaces anyways, then this SQL statement would suddenly have different behavior across catalogs anyways, negating the original goal.)

But I will note that we've been aiming for consistency in other parts of the DSv2 project. I think it's a fine balance and something we have to call on a case by case basis.

rdblue · 2019-07-31T20:40:24Z

I think consistency in this case means that table renames behave the same way, not that all sources have the same level of support. A similar example is the effort to add DELETE FROM. JDBC sources can delete any record from a filter, while file sources can only implement filter deletes that use only partition columns. Sources aren't required to have uniform capabilities, they are just required to have consistent behavior -- even if that behavior is what to do when something is not supported.

cloud-fan · 2019-08-01T08:08:59Z

Hey, I'm not against the ability to move tables between namespaces, I'm against doing this via RENAME TABLE. The example of mapping namespaces to users and transferring table ownership does make sense, but it's super weird to accomplish it by RENAME TABLE. I'd expect to see something like MOVE TABLE t FROM ns1 TO ns2.

I agree with @rdblue about the definition of consistency here. It's OK if JDBC always supports DELETE FROM while file source may throw an exception and say the delete condition can only refer to partition columns. But the problem here is not consistency. It's about if RENAME TABLE can change the namespace of the table while renaming it. IMO namespace is part of the table identifier but not part of the table name.

FYI, in other RDBMS, the parser doesn't even allow you to put a multi-part table identifier after ALTER TABLE ... RENAME TO. Spark allows it only for backward compatibility.

rdblue · 2019-08-02T19:31:36Z

@cloud-fan, so you're suggesting Spark should create a new SQL command instead of using RENAME because RENAME would be confusing?

I think RENAME fits this operation. We're just changing the full table name, including namespace, instead of just the short name. In contract, MOVE implies a more expensive operation like a copy of data.

Last, MOVE would require new parser rules, plans, etc. where using RENAME would require only small changes. I see no real benefit to using MOVE instead of RENAME, and I think that the catalog API would be identical to what is proposed here because Spark doesn't intend to actually run a job to copy and move a table. In other words, if this is implemented as a catalog operation, it should be rename because the catalog isn't going to run a job to rewrite data.

brkyvz · 2019-08-02T20:03:54Z

I'm in favor of being able to rename tables across namespaces. The V2SessionCatalog can reject such an operation (validate that the namespaces match), since the HiveMetaStore wouldn't allow it.
Thinking from a use cases standpoint, I like the example that @mccheah gave around ACLs. Also imagine having an object store catalog. Moving an entire table from dir1/dir2/abc to dir3/abc would be a valid operation. On HDFS or Azure Data Lake Gen2, these would be very quick operations.
Spark is not an RDBMS therefore I don't see a reason for constricting the functionality to RDBMS behavior as long as the new behavior we're allowing isn't more confusing or could cause users to shoot themselves in the foot.

The kind of use cases I see are:

ALTER TABLE a RENAME TO b;                      -- regular RDBMS behavior 
ALTER TABLE dev.a RENAME TO prod.               -- fail
ALTER TABLE hdfs.dev.a RENAME TO hdfs.prod.a;   -- ensure prod is a namespace

cloud-fan · 2019-08-05T08:14:30Z

We're just changing the full table name, including namespace, instead of just the short name.

If by definition namespace is part of the table name, then it's OK. I thought table name and namespace together are the table identifier.

For the parser rule, what I expect is

ALTER TABLE [catalog_name] [namespace_name] table_name
TO [new_namespace_name] new_table_name

That said, I think this command should never fail at Spark side. The table identifier after TO should be always interpreted as namespace + table_name.

edgarRd · 2019-08-12T19:23:12Z

Thanks everyone for your input. I think it looks like we agree on the following points for this PR:

The behavior of the catalog operation renameTable is to change the table name, including namespace, defined by the table identifier from oldIdentifier to newIdentifier.
It’s up to the catalog implementation to support this operation or not. The TableCatalog#renameTable operation will throw an appropriate Exception based on the catalog specific implementation.
Spark does not constrict the behavior to particularities in the underlying catalog, nor it aims to define how the catalog must implement this operation, e.g. table move, metadata change, directory relocation, etc.
Rename table is only defined for tables within the same catalog.

Based on the previous points, I’ve pushed the changes back to how I originally submitted this PR, where the catalog does not throw AnalysisException and instead the more general UnsupportedOperationException is optionally thrown.

@rdblue @cloud-fan please let me know if there are any other concerns. Thank you.

rdblue · 2019-08-12T19:29:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

@@ -153,6 +153,14 @@ class V2SessionCatalog(sessionState: SessionState) extends TableCatalog {
    }
  }

+  override def renameTable(oldIdent: Identifier, newIdent: Identifier): Unit = {


Can you also add test cases for v2 session catalog?

Yeah, it'd be nice to see tests for renaming across namespaces (failing), etc.

I just added the tests.

rdblue · 2019-08-12T19:31:01Z

Looks good to me! I'd like tests for the v2 session catalog change, but otherwise I'm +1.

brkyvz · 2019-08-12T20:58:05Z

Yeah, once tests for V2 session catalog are added, I can merge this

cloud-fan · 2019-08-13T03:25:49Z

ok to test

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

cloud-fan · 2019-08-13T03:30:38Z

LGTM. Shall we also update the RENAME TABLE command? I'm fine if we create a ticket and do it in a follow-up.

SparkQA · 2019-08-13T07:05:01Z

Test build #109011 has finished for PR 25206 at commit e95c603.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

edgarRd · 2019-08-13T18:22:55Z

@cloud-fan I've created ticket https://issues.apache.org/jira/browse/SPARK-28717 for the follow-up work to update the SQL command ALTER TABLE RENAME.

SparkQA · 2019-08-13T21:47:41Z

Test build #109053 has finished for PR 25206 at commit 71d9ad1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-14T06:24:23Z

thanks, merging to master!