-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47217][SQL] Fix deduplicated expression resolution #45552
[SPARK-47217][SQL] Fix deduplicated expression resolution #45552
Conversation
333e530
to
c0bde7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @peter-toth .
cc @Hisoka-X and @cloud-fan
@@ -227,6 +227,11 @@ class Dataset[T] private[sql]( | |||
dsIds.add(id) | |||
plan.setTagValue(Dataset.DATASET_ID_TAG, dsIds) | |||
} | |||
// A plan might get its PLAN_ID_TAG via connect or belong to multiple dataframes so only assign | |||
// an id to a plan if it doesn't have any | |||
if (plan.getTagValue(LogicalPlan.PLAN_ID_TAG).isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this PR aims to fix a vanilla (non-Connect) Spark SQL issue with PLAN_ID_TAG
.
However, currently the PLAN_ID_TAG
is only dedicated for Spark Connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that idea came up here: #45446 (comment)
The column reference in classic Spark SQL DataFrame API is very broken and I really don't want to add more hacks here and there to fix certain cases. In Spark Connect, we've redesigned the column reference and it's much more reliable and reasonable. How about adding a config to let the classic column reference use the same implement of spark connect's? Ideally users should update their DataFrame query to always use named column like SQL API.
But if users really want to stick with the old style, they can turn on the config. |
I think unifying the resolution logic in Spark 4.0 is a good idea and the above example work well via connect. But I don't fully get the last sentence:
Is there any known downside of the new connect style resolution? Why would someone switch back to the old style? Also, what shall we do with Spark 3.5? It would be great to fix this regression without requiring users to change their code. |
There is no known downside of the connect style resolution, but it sounds scary to replace an existing implementation (which has been there for many years) with a new one completely, especially the old implementation is quite hacky and may work for some corner cases unintentionally. I'd prefer to add a config to enable connect style resolution, but not by default. |
I had some time to play with connect today and as I said it does work well with the query in the PR description, but it doesn't seem to support even the most basic self joins when join conditon is ambiguous:
Am I missing something? |
@cloud-fan, I wonder if the above test should work in connect or in connect is it always required to alias conflicting columns as in SQL? |
@cloud-fan, since connect style resolution doesn't seem to work in all cases I wonder if we can fix non-connect resolution with this PR? |
c0bde7e
to
a4db113
Compare
@zhengruifeng can you take a look? I think we should make this work. We can find |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
There seems to be a regression from Spark 3.5 to 4.0 caused by https://issues.apache.org/jira/browse/SPARK-43838 / #41347 as the following code no longer succeed:
Please note that if we dig deeper then it turns out that if we omit the extra project from
df
then the following is actually a bug that came in with the very first version ofDeduplicateRelations
, that deduplicatedMultiInstanceRelation
s only:The root cause seems to be
DeduplicateRelations
as it changesdf("a")
(a#7
) comming from the right side when it runs on the join:and then when the
.select()
API adds aProject
node containingdf("a")
above the join, it can't be resolved.This is because
DeduplicateRelations
always keeps the attributes of the first occurance of a node (Project [_1#2 AS a#7, _2#3 AS b#8]
in this case) and creates new instances for other occurances. The rule doesn't (and can't) take into account if a top level attribute can actually come from a node or not.If
spark.sql.analyzer.failAmbiguousSelfJoin
is enabled then theDetectAmbiguousSelfJoin
catches the issue asIf it is not enabled then
a#7
can't be resolved error is thrown.To solve the above regression this PR:
LogicalPlan.PLAN_ID_TAG
s to logical plans ofDataset
s that don't have any id yet. (Connect planner already does this.)AttributeReference
s toUnresolvedAttribute
s in certainDataset
APIs if an attribute doesn't seem valid based on the output of the underlying logical plan. Also, theUnresolvedAttribute
s get the necessary tags to get resolved byResolveReferences
rule (ColumnResolutionHelper.tryResolveDataFrameColumns()
logic) later.Why are the changes needed?
To fix the regression.
Does this PR introduce any user-facing change?
Yes, it fixes the regression.
How was this patch tested?
New and existing UTs.
Was this patch authored or co-authored using generative AI tooling?
No.