Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48413][SQL] ALTER COLUMN with collation #46734

Closed
wants to merge 4 commits into from

Conversation

nikolamand-db
Copy link
Contributor

What changes were proposed in this pull request?

Add support for changing collation of a column with ALTER COLUMN command. Use existing support for ALTER COLUMN with type to enable changing collations of column. Syntax example:

ALTER TABLE t1 ALTER COLUMN col TYPE STRING COLLATE UTF8_BINARY_LCASE

Why are the changes needed?

Enable changing collation on column.

Does this PR introduce any user-facing change?

Yes, it adds support for changing collation of column.

How was this patch tested?

Added tests to DDLSuite and DataTypeSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 24, 2024
@nikolamand-db nikolamand-db changed the title [SPARK-48413] ALTER COLUMN with collation [SPARK-48413][SQL] ALTER COLUMN with collation May 24, 2024
@@ -445,7 +446,8 @@ case class AlterTableChangeColumnCommand(
// name(by resolver) and dataType.
private def columnEqual(
field: StructField, other: StructField, resolver: Resolver): Boolean = {
resolver(field.name, other.name) && field.dataType == other.dataType
resolver(field.name, other.name) &&
DataType.equalsIgnoreCompatibleCollationAndNullability(field.dataType, other.dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment of the command in line 359 is not outdated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, please check.

* Check if `from` is equal to `to` type except for collations and nullability, which are
* both checked to be compatible so that data of type `from` can be interpreted as of type `to`.
*/
private[sql] def equalsIgnoreCompatibleCollationAndNullability(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's already give this a name that reflects that is returns true of the it is allowed to evolve the type. This will be true for more cases than collations in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check #46734 (comment). Part of the effort was to change the name of this method to equalsIgnoreCompatibleCollation.

equalsIgnoreCompatibleCollationAndNullability(fromValue, toValue)

case (StructType(fromFields), StructType(toFields)) =>
fromFields.length == toFields.length &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do we use a map to not depend on the order of fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check #46734 (comment).


case (StructType(fromFields), StructType(toFields)) =>
fromFields.length == toFields.length &&
fromFields.zip(toFields).forall { case (fromField, toField) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do a forAll on fromFields and look up each field by name in toFields. That way we do not depend on the order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check #46734 (comment).

case (_: StringType, _: StringType) => true

case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) =>
(tn || !fn) && equalsIgnoreCompatibleCollationAndNullability(fromElement, toElement)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to allow setting the type nullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the approach, we now only check for possible collation difference; nullability must remain the same. In this way by checking only for possible collation difference we scope down the changes whereby previously we were checking for complete data type equality (only comment changes were allowed in alter column command).

@@ -396,8 +396,9 @@ case class AlterTableChangeColumnCommand(
val newDataSchema = table.dataSchema.fields.map { field =>
if (field.name == originColumn.name) {
// Create a new column from the origin column with the new comment.
val newField = field.copy(dataType = newColumn.dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make it more explicit that we effectively allow the data type to change now but we only allow type changes that differ by collation?
E.g. by splitting columnEqual into checking that names match on one side and that the type change is supported on the other + add a withNewType method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified in comment, split the columnEqual to two functions and added withNewType in StructField, please check.

@@ -396,8 +396,9 @@ case class AlterTableChangeColumnCommand(
val newDataSchema = table.dataSchema.fields.map { field =>
if (field.name == originColumn.name) {
// Create a new column from the origin column with the new comment.
val newField = field.copy(dataType = newColumn.dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 359-367: the comment needs to be updated to be clear we now allow changing collation.

Side note: it gives the hive-style syntax for ALTER COLUMN but that's only one of the two syntaxes, see:
https://github.com/apache/spark/blob/master/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L130

The other one is the one (partially) documented and the more capable/preferred one: https://spark.apache.org/docs/3.5.1/sql-ref-syntax-ddl-alter-table.html#parameters-4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment, please check.

Copy link
Contributor

@olaky olaky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I would limit the scope of the withNextType method to make sure it is not used by accident

@stefankandic
Copy link
Contributor

stefankandic commented May 27, 2024

We can address this in a followup but we also shouldn't allow alter from string -> collated string if table is partitioned or bucketed by a that column

@nikolamand-db
Copy link
Contributor Author

This PR is currently blocked by #46758, need to fix to pass the tests.

@olaky
Copy link
Contributor

olaky commented May 28, 2024

We can address this in a followup but we also shouldn't allow alter from string -> collated string if table is partitioned or bucketed by a that column

I actually think that this is possible: Collations do not affect how we store files (we keep partitioning by UTF8_BINARY), and hence it would be possible to allow this. Am I missing something?

@@ -432,6 +445,10 @@ case class AlterTableChangeColumnCommand(
}.getOrElse(throw QueryCompilationErrors.cannotFindColumnError(name, schema.fieldNames))
}

// Change the dataType of the column.
private def withNewType(column: StructField, dataType: DataType): StructField =
column.copy(dataType = dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This looks more concise than calling withNewType, do we really need to create this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was suggested by @johanl-db, please check #46734 (comment).

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f9542d0 Jun 3, 2024
riyaverm-db pushed a commit to riyaverm-db/spark that referenced this pull request Jun 7, 2024
### What changes were proposed in this pull request?

Add support for changing collation of a column with `ALTER COLUMN` command. Use existing support for `ALTER COLUMN` with type to enable changing collations of column. Syntax example:
```
ALTER TABLE t1 ALTER COLUMN col TYPE STRING COLLATE UTF8_BINARY_LCASE
```

### Why are the changes needed?

Enable changing collation on column.

### Does this PR introduce _any_ user-facing change?

Yes, it adds support for changing collation of column.

### How was this patch tested?

Added tests to `DDLSuite` and `DataTypeSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46734 from nikolamand-db/SPARK-48413.

Authored-by: Nikola Mandic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
MaxGekk pushed a commit that referenced this pull request Oct 22, 2024
### What changes were proposed in this pull request?
Fixing the analysis check error introduced in [this](#46734) PR, where the `ALTER COLUMN` command would fail on tables that contain catalog in their names, ensuring consistent behavior across all table naming conventions.

### Why are the changes needed?
Change is needed to enable altering column with all datasources and identifier variations.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added tests to `CollationSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48582 from jovanm-db/alterColumn.

Authored-by: Jovan Markovic <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Oct 22, 2024
### What changes were proposed in this pull request?
Fixing the analysis check error introduced in [this](apache#46734) PR, where the `ALTER COLUMN` command would fail on tables that contain catalog in their names, ensuring consistent behavior across all table naming conventions.

### Why are the changes needed?
Change is needed to enable altering column with all datasources and identifier variations.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added tests to `CollationSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48582 from jovanm-db/alterColumn.

Authored-by: Jovan Markovic <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants