-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7871][SQL]Improve the outputPartitioning for HashOuterJoin #6413
Conversation
Test build #33519 has finished for PR 6413 at commit
|
Test build #33589 has finished for PR 6413 at commit
|
.queryExecution.executedPlan | ||
val exchanges = planned.collect { case n: Exchange => n } | ||
|
||
assert(exchanges.size === 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is these changs doesn't effect to
testData
.join(testData2, testData("key") === testData2("a"), "outer")
.join(testData2, testData("a") === testData3("a"), "outer")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires some further change I think, @yhuai should have some idea on this.
Test build #33593 has finished for PR 6413 at commit
|
* as a valid value if `nullKeysSensitive` == true. | ||
* | ||
* For examples: | ||
* JOIN KEYS: values contains null will be considered as invalid values, which means |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here values
means the original value of the table or the intermediate value of the join?
is the null
in original data of table also considered as invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be the input value. (Either the original data from table or the intermediate result(e.g. join outputs)).
Validity of the null in the original table, depends on the semantics, in Join
, it's should also be invalid, but it's valid for Group BY
.
It would an other optimization for repartition, contains null
in the join keys.
sorry, it will causes performance regression for case like
|
Test build #35457 has finished for PR 6413 at commit
|
retest this please |
Test build #35498 has finished for PR 6413 at commit
|
Test build #35961 has finished for PR 6413 at commit
|
Test build #35965 has finished for PR 6413 at commit
|
retest this please |
Test build #35975 has finished for PR 6413 at commit
|
@yhuai sorry it's a big change. :) can you review this for me? |
Sorry, I found another bug, will solve it soon. |
Test build #36368 has finished for PR 6413 at commit
|
Test build #36379 has finished for PR 6413 at commit
|
Test build #36392 has finished for PR 6413 at commit
|
Test build #36499 has finished for PR 6413 at commit
|
clusterKeys, | ||
sortKeys, | ||
globalOrdered, | ||
additionalNullClusterKeyGenerated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may use the copy
method comes with all case classes:
this.copy(numPartitions = Some(num))
@JoshRosen Right, #7773 and the null-unsafe/safe parts of #7685 will address the issue. |
ok, I am closing this pr. |
https://issues.apache.org/jira/browse/SPARK-7871
Optimize the Full outer join followed by another join with the same equi-join key.
For example:
As we probably involve more factors to determine if data shuffle needed, I've also refactor the entire code for
EnsureRequirements
, by introducing theGap
of therequiredDistribution
and theoutputPartitioning
.This contains the code refactor for
exchange
in Spark SQL. It's a WIP PR and still need to do:requiredDistribution
andoutputPartitioning
A known issue will be solved in another PR once this PR merged. (https://issues.apache.org/jira/browse/SPARK-2205)