[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling #32448

Kimahriman · 2021-05-06T01:33:50Z

What changes were proposed in this pull request?

This PR changes the unionByName with null filling logic to append new nested struct fields from the right side of the union to the schema versus sorting fields alphabetically. It removes the need to use UpdateField expressions, and just directly projects new nested structs from each side of the union with the correct schema. This changes the union'd schema from being alphabetically sorted previously to now "left dominant", where the fields from the left side of the union are included and then the missing ones from the right are added in the same order found originally.

Also adds a resolver to the StructType merging to handle case insensitivity, as the resulting union logical and physical expressions using StructType.merge to provide the resulting schema of the union.

Why are the changes needed?

Certain nested structs would cause unionByName with null filling to error out due to part of the logic for rewriting the expression tree to sort the structs.

Does this PR introduce any user-facing change?

Shouldn't other than fixing certain cases that caused errors. I don't know if adding the resolver to the StructType merging has any unintended side effects, so definitely would like some thoughts on that. Also the order of the StructFields is slightly different now, though that shouldn't have too much of an effect.

How was this patch tested?

Updated existing tests based on the new StructField ordering and added a new test for the case that was broken originally.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala

Kimahriman · 2021-05-06T01:36:50Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -429,7 +429,7 @@ object ParquetFileFormat extends Logging {
    }

    finalSchemas.reduceOption { (left, right) =>
-      try left.merge(right) catch { case e: Throwable =>
+      try left.merge(right, sparkSession.sessionState.conf.resolver) catch { case e: Throwable =>


Also don't know enough about this code to know what the impact is

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

HyukjinKwon · 2021-05-06T02:52:51Z

cc @viirya FYI

viirya · 2021-05-07T07:15:08Z

ok to test

SparkQA · 2021-05-07T08:01:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42762/

SparkQA · 2021-05-07T08:01:32Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42762/

SparkQA · 2021-05-07T11:43:52Z

Test build #138240 has finished for PR 32448 at commit 3c0d3d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-05-07T16:31:16Z

Thanks for working on this! It looks like a better approach. Let me take a closer look in next few days.

HyukjinKwon · 2021-05-10T06:13:50Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala

      assert(unionDf.schema.toDDL ==
        "`id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, " +
-          "`nested`: STRUCT<`A`: INT, `a`: INT, `b`: BIGINT, `c`: STRING>>")
+          "`nested`: STRUCT<`a`: INT, `c`: STRING, `A`: INT, `b`: BIGINT>>")


Can we update migration guide (https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md)?

is this an expected behavior change? and why do we prefer the new behavior?

Yeah, the nested fields don't necessarily have to be sorted. The behaviour should be same or at least similar with the outermost schema. Sorting is sort of unexpected I think.

You want a 3.1 -> 3.2 migration message saying fields are no longer sorted but instead kept in order with new fields added to the end?

Following on to that, how is it determined when things are backported to previous releases (3.1 in this case) versus saved for the next minor release? Is that up to submitters or do maintainers make that call? There's a little bit of a "behavior change" here, though it's mostly a bug fix. Similar with #32338, where it's a bug fix that could be useful in a 3.1 patch release.

HyukjinKwon · 2021-05-10T06:17:53Z

Can you also update https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2078-L2082

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

viirya

Overall looks good, with a few comments.

This approach looks more straightforward for adding missing fields. Previously I hadn't thought about this approach and took a more complicated one.

Let me further look at the test portion.

Kimahriman · 2021-05-10T19:42:02Z

Sounds good, wanted to make sure things looked sane before cleaning things up. I'll try and get those comments finished up in the next day or so.

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala

Kimahriman · 2021-05-11T01:57:53Z

Should I remove the findMissingFields function that was added for the original method?

viirya · 2021-05-11T01:59:15Z

Should I remove the findMissingFields function that was added for the original method?

Yea, please remove it as it is useless now.

SparkQA · 2021-05-13T02:16:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43001/

SparkQA · 2021-05-13T02:16:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43001/

SparkQA · 2021-05-13T05:56:30Z

Test build #138481 has finished for PR 32448 at commit 93b47d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Kimahriman · 2021-05-13T11:34:04Z

GitHub Test failure looks due to #32533

cloud-fan · 2021-05-13T17:58:09Z

#32533 is merged, can you rebase to get the fix?

docs/sql-migration-guide.md

SparkQA · 2021-06-12T15:08:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44264/

SparkQA · 2021-06-12T18:48:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44268/

SparkQA · 2021-06-12T20:04:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44268/

SparkQA · 2021-06-12T22:29:44Z

Test build #139743 has finished for PR 32448 at commit 7bca531.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-06-13T00:05:40Z

retest this please

SparkQA · 2021-06-13T01:30:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44270/

SparkQA · 2021-06-13T02:03:45Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44270/

SparkQA · 2021-06-13T04:19:36Z

Test build #139745 has finished for PR 32448 at commit 7bca531.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-06-13T06:41:39Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala

-    assert(StructType.findMissingFields(source4, schema, resolver)
-      .exists(_.sameType(missing4)))
+    assert(schema2.merge(schema1, resolver).sameType(StructType.fromDDL(
+      "a2 STRING, a3 DOUBLE, nested STRUCT<b2: STRING, b3: DOUBLE, b1: INT>, a1 INT"


When schema2 merges schema1, don't we keep its original case? E.g. "A2 STRING, a3 DOUBLE, nested STRUCT<B2: STRING, b3: DOUBLE, b1: INT>, a1 INT"

Ah yeah it does, sameType is just doing a case insensitive comparison so it didn't matter that my manual type was wrong. I'll update to === instead and that seems to fix the check

viirya

Looks okay, with one minor question.

viirya · 2021-06-13T06:53:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala

@@ -181,7 +100,8 @@ object ResolveUnion extends Rule[LogicalPlan] {
            // like that. We will sort columns in the struct expression to make sure two sides of
            // union have consistent schema.
            aliased += foundAttr
-            Alias(addFields(foundAttr, target), foundAttr.name)()
+            val targetType = target.merge(source, conf.resolver)


BTW, merge will throw an exception if two schemas conflict. I recall that union of conflicting schemas doesn't fail in ResolveUnion, but in CheckAnalysis. Could we follow original behavior?

Hmmm good question, I'm not sure exactly how that would work without adding extra logic to StructType.merge to ignore conflicts. And now that you bring that up I'm starting to think using StructType.merge isn't the best method since it does care about DataType. I just noticed it doesn't handle similar types, so you get errors if you try to merge a float and a double, whereas the normal union just handles that. I might try to rework this again to not use the StructType.merge after all...

Kimahriman · 2021-06-14T02:45:10Z

So @viirya's comment made me realize that StructType.merge isn't quite the right solution since it immediately fails on exact type mismatch and can't handle similar types like float/double. I updated things locally to not use it anymore and just resolve things based on name and let the types figure themselves out later like other unions.

Separately, the StructType.merge not considering case sensitivity is still a bug that can crop up with unions (not even just unionByName with null filling).

So I can either just push the minor update I have here to address the comment, or I can close this and open two separate PRs to address each individually.

SparkQA · 2021-06-15T19:09:32Z

Test build #139825 has finished for PR 32448 at commit 4a14101.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

…e test for case insensitivity

Kimahriman · 2021-06-15T19:54:52Z

I pushed the small fix that moves the compatibility analysis after the union resolving, making ResolveUnion no longer use StructType.merge. I updated the title and description as well. Let me know if you want me to create separate PRs for each now instead

SparkQA · 2021-06-15T19:59:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44353/

SparkQA · 2021-06-15T20:35:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44353/

SparkQA · 2021-06-15T21:03:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44355/

SparkQA · 2021-06-15T21:38:38Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44355/

SparkQA · 2021-06-16T00:39:40Z

Test build #139827 has finished for PR 32448 at commit 5762ddc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Kimahriman · 2021-06-22T12:36:29Z

Any thoughts? I think it definitely makes sense to split this into two PRs now since there's two separate bugs being fixed. I could either backout all of the StructType.merge changes in this PR to only have the unionByName fix (that doesn't require StructType.merge anymore and I'm annoyed I didn't figure that out sooner), or close this PR and start fresh with two new PRs (and a new JIRA for the separate bug that the StructType.merge update fixes)

cloud-fan · 2021-06-22T14:02:00Z

I'm +1 to open 2 fresh PRs, thanks!

Kimahriman · 2021-06-23T11:18:47Z

Closing in favor of #33040

github-actions bot added the SQL label May 6, 2021

Kimahriman commented May 6, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala Show resolved Hide resolved

Kimahriman commented May 6, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala Show resolved Hide resolved

Kimahriman commented May 6, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala Outdated Show resolved Hide resolved

Kimahriman force-pushed the union-by-name-struct-merge branch from 50e45d1 to 3c0d3d0 Compare May 7, 2021 00:44

HyukjinKwon reviewed May 10, 2021

View reviewed changes