-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution #32192
Conversation
Test build #752552040 for PR 32192 at commit |
UpdateAction( | ||
updateCondition.map(resolveExpressionByPlanChildren(_, m)), | ||
resolveAssignments(assignments = None, m, resolveValuesWithSourceOnly = false)) | ||
// For UPDATE *, the value must from source table. | ||
resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid resolving UpdateAction
again with resolveValuesWithSourceOnly = false
in the next analysis round, I added resolveMergeExprOrFail
to fail earlier if an attribute can't be resolved.
cc @aokolnychyi and @RussellSpitzer |
I think this is a great idea, I see folks hitting this extremely frequently. It does feel like it would be a rather large change to the current behavior though. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137424 has finished for PR 32192 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #138256 has finished for PR 32192 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is not ANSI, there exists some references which the users might expect. It would be great if you add a comparison section to compare your proposal with them in your PR description officially for a record. At least, I guess we can find the followings.
- Snowflake: https://docs.snowflake.com/en/sql-reference/sql/merge.html
- Oracle: https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_9016.htm
- SQL Server: https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15
- Databricks Delta: https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html
I've made it clear at the beginning of the PR description: INSERT/UPDATE * is an extension and I can't find it in other mainstream databases. If you open the docs you posted, none of them (except for Delta lake) supports INSERT/UPDATE *. Actually the original Spark PR to add MERGE SQL syntax was following https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html . It's a mistake that the INSERT/UPDATE * behavior is different and the current behavior is super confusing that we'd better fix it. |
Got it for the further explanation. |
thanks for the review, merging to master! |
…tion In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE *. This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own. The behavior today is very weird: assume the source table has `n1` columns, target table has `n2` columns. We generate the assignments by taking the first `min(n1, n2)` columns from source & target tables and pairing them by ordinal. This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values. Fix the MEREG INSERT/UPDATE * to be more user-friendly and easy to do schema evolution. Yes, but MERGE is only supported by very few data sources. new tests Closes apache#32192 from cloud-fan/merge. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE *. This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own.
The behavior today is very weird: assume the source table has
n1
columns, target table hasn2
columns. We generate the assignments by taking the firstmin(n1, n2)
columns from source & target tables and pairing them by ordinal.This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values.
Why are the changes needed?
Fix the MEREG INSERT/UPDATE * to be more user-friendly and easy to do schema evolution.
Does this PR introduce any user-facing change?
Yes, but MERGE is only supported by very few data sources.
How was this patch tested?
new tests