New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SPARK-48413][SQL] ALTER COLUMN with collation #46734

Closed

nikolamand-db wants to merge 4 commits into apache:master from nikolamand-db:SPARK-48413

Contributor

nikolamand-db commented May 24, 2024

What changes were proposed in this pull request?

Add support for changing collation of a column with ALTER COLUMN command. Use existing support for ALTER COLUMN with type to enable changing collations of column. Syntax example:

ALTER TABLE t1 ALTER COLUMN col TYPE STRING COLLATE UTF8_BINARY_LCASE

Why are the changes needed?

Enable changing collation on column.

Does this PR introduce any user-facing change?

Yes, it adds support for changing collation of column.

How was this patch tested?

Added tests to DDLSuite and DataTypeSuite.

Was this patch authored or co-authored using generative AI tooling?

No.


          ALTER COLUMN with collation

73af829

github-actions bot added the SQL label

nikolamand-db changed the title ~~[SPARK-48413] ALTER COLUMN with collation~~ [SPARK-48413][SQL] ALTER COLUMN with collation

olaky reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala Outdated

@@ @@ -445,7 +446,8 @@ case class AlterTableChangeColumnCommand( @@
                 // name(by resolver) and dataType.
                 private def columnEqual(
                     field: StructField, other: StructField, resolver: Resolver): Boolean = {
-                  resolver(field.name, other.name) && field.dataType == other.dataType
+                  resolver(field.name, other.name) &&
+                    DataType.equalsIgnoreCompatibleCollationAndNullability(field.dataType, other.dataType)

Contributor

olaky May 24, 2024

The comment of the command in line 359 is not outdated

Contributor Author

nikolamand-db May 27, 2024

Fixed, please check.

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala Outdated

+                 * Check if `from` is equal to `to` type except for collations and nullability, which are
+                 * both checked to be compatible so that data of type `from` can be interpreted as of type `to`.
+                 */
+                private[sql] def equalsIgnoreCompatibleCollationAndNullability(

Contributor

olaky May 24, 2024

let's already give this a name that reflects that is returns true of the it is allowed to evolve the type. This will be true for more cases than collations in the future

Contributor Author

nikolamand-db May 27, 2024

Please check #46734 (comment). Part of the effort was to change the name of this method to equalsIgnoreCompatibleCollation.

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala Outdated Show resolved Hide resolved

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala Outdated Show resolved Hide resolved

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala

+                        equalsIgnoreCompatibleCollationAndNullability(fromValue, toValue)
+                    case (StructType(fromFields), StructType(toFields)) =>
+                      fromFields.length == toFields.length &&

Contributor

olaky May 24, 2024

or do we use a map to not depend on the order of fields?

Contributor Author

nikolamand-db May 27, 2024

Please check #46734 (comment).

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala

+                    case (StructType(fromFields), StructType(toFields)) =>
+                      fromFields.length == toFields.length &&
+                        fromFields.zip(toFields).forall { case (fromField, toField) =>

Contributor

olaky May 24, 2024

I would rather do a forAll on fromFields and look up each field by name in toFields. That way we do not depend on the order

Contributor Author

nikolamand-db May 27, 2024

Please check #46734 (comment).

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala Outdated

+                    case (_: StringType, _: StringType) => true
+                    case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) =>
+                      (tn || !fn) && equalsIgnoreCompatibleCollationAndNullability(fromElement, toElement)

Contributor

olaky May 24, 2024

Why do we want to allow setting the type nullable?

Contributor Author

nikolamand-db May 27, 2024

Changed the approach, we now only check for possible collation difference; nullability must remain the same. In this way by checking only for possible collation difference we scope down the changes whereby previously we were checking for complete data type equality (only comment changes were allowed in alter column command).

johanl-db reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala Outdated

@@ @@ -396,8 +396,9 @@ case class AlterTableChangeColumnCommand( @@
                   val newDataSchema = table.dataSchema.fields.map { field =>
                     if (field.name == originColumn.name) {
                       // Create a new column from the origin column with the new comment.
+                      val newField = field.copy(dataType = newColumn.dataType)

Contributor

johanl-db May 24, 2024

Can you make it more explicit that we effectively allow the data type to change now but we only allow type changes that differ by collation?
E.g. by splitting columnEqual into checking that names match on one side and that the type change is supported on the other + add a withNewType method.

Contributor Author

nikolamand-db May 27, 2024

Clarified in comment, split the columnEqual to two functions and added withNewType in StructField, please check.

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala Outdated

@@ @@ -396,8 +396,9 @@ case class AlterTableChangeColumnCommand( @@
                   val newDataSchema = table.dataSchema.fields.map { field =>
                     if (field.name == originColumn.name) {
                       // Create a new column from the origin column with the new comment.
+                      val newField = field.copy(dataType = newColumn.dataType)

Contributor

johanl-db May 24, 2024

Lines 359-367: the comment needs to be updated to be clear we now allow changing collation.

Side note: it gives the hive-style syntax for ALTER COLUMN but that's only one of the two syntaxes, see:
https://github.com/apache/spark/blob/master/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L130

The other one is the one (partially) documented and the more capable/preferred one: https://spark.apache.org/docs/3.5.1/sql-ref-syntax-ddl-alter-table.html#parameters-4

Contributor Author

nikolamand-db May 27, 2024

Updated the comment, please check.


          Rework approach

8f09c3e

olaky approved these changes

View reviewed changes

Contributor

olaky left a comment

Looks good, but I would limit the scope of the withNextType method to make sure it is not used by accident

sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala Outdated Show resolved Hide resolved

Contributor

stefankandic commented May 27, 2024 •

edited

Loading

We can address this in a followup but we also shouldn't allow alter from string -> collated string if table is partitioned or bucketed by a that column

Contributor Author

nikolamand-db commented May 27, 2024

This PR is currently blocked by #46758, need to fix to pass the tests.

Contributor

olaky commented May 28, 2024

We can address this in a followup but we also shouldn't allow alter from string -> collated string if table is partitioned or bucketed by a that column

I actually think that this is possible: Collations do not affect how we store files (we keep partitioning by UTF8_BINARY), and hence it would be possible to allow this. Am I missing something?


          Disable altering bucket column collation

cf1de43

stefankandic approved these changes

View reviewed changes


          Merge branch 'master' into SPARK-48413

60a3757

cloud-fan reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

@@ @@ -432,6 +445,10 @@ case class AlterTableChangeColumnCommand( @@
                   }.getOrElse(throw QueryCompilationErrors.cannotFindColumnError(name, schema.fieldNames))
                 }
+                // Change the dataType of the column.
+                private def withNewType(column: StructField, dataType: DataType): StructField =
+                  column.copy(dataType = dataType)

Contributor

cloud-fan Jun 3, 2024

nit: This looks more concise than calling withNewType, do we really need to create this function?

Contributor Author

nikolamand-db Jun 3, 2024

This was suggested by @johanl-db, please check #46734 (comment).

Contributor

cloud-fan commented Jun 3, 2024

thanks, merging to master!

cloud-fan closed this in

f9542d0

riyaverm-db pushed a commit to riyaverm-db/spark that referenced this pull request


          [SPARK-48413][SQL] ALTER COLUMN with collation

f0c5020

### What changes were proposed in this pull request?

Add support for changing collation of a column with `ALTER COLUMN` command. Use existing support for `ALTER COLUMN` with type to enable changing collations of column. Syntax example:
```
ALTER TABLE t1 ALTER COLUMN col TYPE STRING COLLATE UTF8_BINARY_LCASE
```

### Why are the changes needed?

Enable changing collation on column.

### Does this PR introduce _any_ user-facing change?

Yes, it adds support for changing collation of column.

### How was this patch tested?

Added tests to `DDLSuite` and `DataTypeSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46734 from nikolamand-db/SPARK-48413.

Authored-by: Nikola Mandic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

jovanm-db mentioned this pull request

[SPARK-48413][SQL][FOLLOW-UP] Fix ALTER COLUMN with collation #48582

Closed

MaxGekk pushed a commit that referenced this pull request


          [SPARK-48413][SQL][FOLLOW-UP] Fix ALTER COLUMN with collation

6e3dcdb

### What changes were proposed in this pull request?
Fixing the analysis check error introduced in [this](#46734) PR, where the `ALTER COLUMN` command would fail on tables that contain catalog in their names, ensuring consistent behavior across all table naming conventions.

### Why are the changes needed?
Change is needed to enable altering column with all datasources and identifier variations.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added tests to `CollationSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48582 from jovanm-db/alterColumn.

Authored-by: Jovan Markovic <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

ericm-db pushed a commit to ericm-db/spark that referenced this pull request


          [SPARK-48413][SQL][FOLLOW-UP] Fix ALTER COLUMN with collation

14d376b

### What changes were proposed in this pull request?
Fixing the analysis check error introduced in [this](apache#46734) PR, where the `ALTER COLUMN` command would fail on tables that contain catalog in their names, ensuring consistent behavior across all table naming conventions.

### Why are the changes needed?
Change is needed to enable altering column with all datasources and identifier variations.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added tests to `CollationSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48582 from jovanm-db/alterColumn.

Authored-by: Jovan Markovic <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL