Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28265][SQL] Add renameTable to TableCatalog API #25206

Closed

Conversation

edgarRd
Copy link
Contributor

@edgarRd edgarRd commented Jul 19, 2019

What changes were proposed in this pull request?

This PR adds the renameTable call to the TableCatalog API, as described in the Table Metadata API SPIP.

This PR is related to: #24246

How was this patch tested?

Added unit tests and contract tests.

@edgarRd
Copy link
Contributor Author

edgarRd commented Jul 19, 2019

PTAL @rdblue
@mccheah, @cloud-fan since you originally reviewed #24246

@mccheah
Copy link
Contributor

mccheah commented Jul 19, 2019

Are we going to support this from the SQL side? Think it's hard to evaluate the semantics of the API without seeing how it will be used in query planning and execution.

@rdblue
Copy link
Contributor

rdblue commented Jul 19, 2019

@mccheah, this is supported by Spark for v1: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L130-L131

I think it makes sense to add it. The plan is simple and just has two identifiers, but we do need to decide whether to require that both identifiers start with the same catalog. For example, what should happen in this case?

ALTER TABLE cat.db.table RENAME TO db.table2

Should the catalog name be implied? I think this should throw an error because the catalogs for the identifiers do not match.

@cloud-fan
Copy link
Contributor

but we do need to decide whether to require that both identifiers start with the same catalog

A similar problem was already discussed before and the decision was: RENAME TABLE should only do rename, not things like moving databases. As a result, the ExternalCatalog.renameTable is defined as def renameTable(db: String, oldName: String, newName: String): Unit.

We can also enforce it at the parser level: ALTER TABLE multiPartName RENAME TO singlePartName. We didn't do it before for backward compatibility reasons, it's a good chance to do it for DS v2 now.

@edgarRd
Copy link
Contributor Author

edgarRd commented Jul 22, 2019

@cloud-fan @rdblue Originally I had committed a version enforcing the same namespace for the table identifiers since the spec does not define the behavior in that case, so I tried to be consistent with the existing behavior in Spark, the enforcement is already there in

Also, I think there could be cases when the catalog could rename a table across databases and not requiring any move (of data). However, if there has already been agreement that the renameTable operation only works within the same DB then I can revert the change I had enforcing the same namespace. If we want to add the enforcement at the parser level, maybe we can do it in another PR.

Please let me know if there are any more comments on this.

@cloud-fan
Copy link
Contributor

Let's enforce that to be consistent with the existing behaviors.

@edgarRd edgarRd force-pushed the SPARK-28265-add-rename-table-catalog-api branch from 0a64ab0 to c37202d Compare July 25, 2019 00:59
@edgarRd
Copy link
Contributor Author

edgarRd commented Jul 30, 2019

@cloud-fan @rdblue I've added back the namespace checks, raising AnalysisException to match the existing behavior. Please let me know if there's anything else to address. Thanks.

@rdblue
Copy link
Contributor

rdblue commented Jul 30, 2019

Sorry I missed @cloud-fan's comments before now.

I disagree with the decision to throw AnalysisException if the namespace doesn't match. What is the added benefit here? If a catalog can rename a table to a different namespace, Spark should allow it. I see no reason why Spark should artificially limit a catalog's ability to rename.

@rdblue
Copy link
Contributor

rdblue commented Jul 30, 2019

Also I'm -1 on the changes suggested by @mgaido91. The original documentation matched the javadoc for the other methods by using "rename" instead of "renames".

In addition, the phrase "must throw" is used specifically to tell implementers the behavior that is required by this API. Using "throws" is appropriate for an API consumer, but the audience for this documentation is the API implementer. The use of "must throw" should be added back.

@edgarRd edgarRd force-pushed the SPARK-28265-add-rename-table-catalog-api branch from c37202d to 5d2dc8f Compare July 30, 2019 21:41
@edgarRd
Copy link
Contributor Author

edgarRd commented Jul 30, 2019

Just to be clear here, @cloud-fan, as mentioned before something like:

ALTER TABLE cat.db.table RENAME TO db.table2

Would still fail, since the catalog is not the same and I don't think we should imply it; this can be in the SQL operation implementation. However, to @rdblue's point if the catalog support renames across DBs I don't see why Spark should block it either, other than existing behavior which I don't think this change would affect either if the catalog does not support such operation.

@rdblue
Copy link
Contributor

rdblue commented Jul 30, 2019

+1 for failing when the catalog is not the same. Spark should validate that the same catalog is responsible for both table identifiers.

@cloud-fan
Copy link
Contributor

@rdblue IMO, we should first think about what behavior Spark should provide to end-users, and then design the Data Source API to enforce the behavior.

RENAME TABLE should only rename table, not moving between namespaces. This is intuitive and is the behavior of Hive, Postgres and other mainstream SQL systems. This is the decision we made in #14955 . If you want to revisit it, please open a new PR, instead of mixing it with adding new API.

If a catalog can move tables between namespaces, that's good and we can probably introduce a new command like MOVE TABLE to expose this ability to end-users. But we shouldn't extend the behavior of RENAME TABLE to expose this ability.

@rdblue
Copy link
Contributor

rdblue commented Jul 31, 2019

@cloud-fan, I'm not convinced that a decision for v1 applies to v2, considering all of the challenges with v1 and its very narrow focus on what works with the Hive metastore.

@brkyvz, @mccheah, and @RussellSpitzer, what do you think about allowing rename across namespaces or not?

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jul 31, 2019

There is no value for Cassandra for renames across namespaces, but I don't have strong feelings about Spark enforcing this. I'm fine with rejecting this at the Catalog level outside of the Analyzer.

Looking closer at the commit i'm fine with rename being able to through Analysis exception, I don't think every catalog has to implement this in the same way though so perhaps we just change the javadoc notes?

@mccheah
Copy link
Contributor

mccheah commented Jul 31, 2019

I think we have to be more flexible here because of how loose the definition of a namespace is. Sure, many traditional database systems don't inherently support moving tables from one database to another - but the definition of a namespace in the context of DSv2 is more broad than a database. A namespace could map to a user, for example, in which case moving a table from one namespace to another is effectively transferring ownership of the table; then the underlying catalog could, for example, check ACLs as part of validation and potentially throw an authorization error instead of an analysis one.

It depends on the goal at the end of the day - is the goal that the same SQL statements produce exactly the same behavior across all catalogs, or is the goal that the catalog implementation is given maximum optionality and flexibility? If we want SQL to behave exactly the same across all catalogs, then yes we would throw the analysis exception, but then catalog implementations would never be given the opportunity to move tables from one namespace to another, effectively forcing the user to find a workaround. (We could add a flag to ALTER TABLE... RENAME, but if a given database doesn't support moving across namespaces anyways, then this SQL statement would suddenly have different behavior across catalogs anyways, negating the original goal.)

But I will note that we've been aiming for consistency in other parts of the DSv2 project. I think it's a fine balance and something we have to call on a case by case basis.

@rdblue
Copy link
Contributor

rdblue commented Jul 31, 2019

I think consistency in this case means that table renames behave the same way, not that all sources have the same level of support. A similar example is the effort to add DELETE FROM. JDBC sources can delete any record from a filter, while file sources can only implement filter deletes that use only partition columns. Sources aren't required to have uniform capabilities, they are just required to have consistent behavior -- even if that behavior is what to do when something is not supported.

@cloud-fan
Copy link
Contributor

Hey, I'm not against the ability to move tables between namespaces, I'm against doing this via RENAME TABLE. The example of mapping namespaces to users and transferring table ownership does make sense, but it's super weird to accomplish it by RENAME TABLE. I'd expect to see something like MOVE TABLE t FROM ns1 TO ns2.

I agree with @rdblue about the definition of consistency here. It's OK if JDBC always supports DELETE FROM while file source may throw an exception and say the delete condition can only refer to partition columns. But the problem here is not consistency. It's about if RENAME TABLE can change the namespace of the table while renaming it. IMO namespace is part of the table identifier but not part of the table name.

FYI, in other RDBMS, the parser doesn't even allow you to put a multi-part table identifier after ALTER TABLE ... RENAME TO. Spark allows it only for backward compatibility.

@rdblue
Copy link
Contributor

rdblue commented Aug 2, 2019

@cloud-fan, so you're suggesting Spark should create a new SQL command instead of using RENAME because RENAME would be confusing?

I think RENAME fits this operation. We're just changing the full table name, including namespace, instead of just the short name. In contract, MOVE implies a more expensive operation like a copy of data.

Last, MOVE would require new parser rules, plans, etc. where using RENAME would require only small changes. I see no real benefit to using MOVE instead of RENAME, and I think that the catalog API would be identical to what is proposed here because Spark doesn't intend to actually run a job to copy and move a table. In other words, if this is implemented as a catalog operation, it should be rename because the catalog isn't going to run a job to rewrite data.

@brkyvz
Copy link
Contributor

brkyvz commented Aug 2, 2019

I'm in favor of being able to rename tables across namespaces. The V2SessionCatalog can reject such an operation (validate that the namespaces match), since the HiveMetaStore wouldn't allow it.
Thinking from a use cases standpoint, I like the example that @mccheah gave around ACLs. Also imagine having an object store catalog. Moving an entire table from dir1/dir2/abc to dir3/abc would be a valid operation. On HDFS or Azure Data Lake Gen2, these would be very quick operations.
Spark is not an RDBMS therefore I don't see a reason for constricting the functionality to RDBMS behavior as long as the new behavior we're allowing isn't more confusing or could cause users to shoot themselves in the foot.

The kind of use cases I see are:

ALTER TABLE a RENAME TO b;                      -- regular RDBMS behavior 
ALTER TABLE dev.a RENAME TO prod.               -- fail
ALTER TABLE hdfs.dev.a RENAME TO hdfs.prod.a;   -- ensure prod is a namespace

@cloud-fan
Copy link
Contributor

We're just changing the full table name, including namespace, instead of just the short name.

If by definition namespace is part of the table name, then it's OK. I thought table name and namespace together are the table identifier.

For the parser rule, what I expect is

ALTER TABLE [catalog_name] [namespace_name] table_name
TO [new_namespace_name] new_table_name

That said, I think this command should never fail at Spark side. The table identifier after TO should be always interpreted as namespace + table_name.

@edgarRd
Copy link
Contributor Author

edgarRd commented Aug 12, 2019

Thanks everyone for your input. I think it looks like we agree on the following points for this PR:

  1. The behavior of the catalog operation renameTable is to change the table name, including namespace, defined by the table identifier from oldIdentifier to newIdentifier.
  2. It’s up to the catalog implementation to support this operation or not. The TableCatalog#renameTable operation will throw an appropriate Exception based on the catalog specific implementation.
  3. Spark does not constrict the behavior to particularities in the underlying catalog, nor it aims to define how the catalog must implement this operation, e.g. table move, metadata change, directory relocation, etc.
  4. Rename table is only defined for tables within the same catalog.

Based on the previous points, I’ve pushed the changes back to how I originally submitted this PR, where the catalog does not throw AnalysisException and instead the more general UnsupportedOperationException is optionally thrown.

@rdblue @cloud-fan please let me know if there are any other concerns. Thank you.

@edgarRd edgarRd force-pushed the SPARK-28265-add-rename-table-catalog-api branch from 5d2dc8f to 4c0d834 Compare August 12, 2019 19:23
@@ -153,6 +153,14 @@ class V2SessionCatalog(sessionState: SessionState) extends TableCatalog {
}
}

override def renameTable(oldIdent: Identifier, newIdent: Identifier): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add test cases for v2 session catalog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it'd be nice to see tests for renaming across namespaces (failing), etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added the tests.

@rdblue
Copy link
Contributor

rdblue commented Aug 12, 2019

Looks good to me! I'd like tests for the v2 session catalog change, but otherwise I'm +1.

@brkyvz
Copy link
Contributor

brkyvz commented Aug 12, 2019

Yeah, once tests for V2 session catalog are added, I can merge this

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

LGTM. Shall we also update the RENAME TABLE command? I'm fine if we create a ticket and do it in a follow-up.

@SparkQA
Copy link

SparkQA commented Aug 13, 2019

Test build #109011 has finished for PR 25206 at commit e95c603.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@edgarRd
Copy link
Contributor Author

edgarRd commented Aug 13, 2019

@cloud-fan I've created ticket https://issues.apache.org/jira/browse/SPARK-28717 for the follow-up work to update the SQL command ALTER TABLE RENAME.

@SparkQA
Copy link

SparkQA commented Aug 13, 2019

Test build #109053 has finished for PR 25206 at commit 71d9ad1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 598fcbe Aug 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants