-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's partitioning is not decided #19257
[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's partitioning is not decided #19257
Conversation
Jenkins test this please |
Test build #81847 has finished for PR 19257 at commit
|
cc @cloud-fan @gatorsmile @sameeragarwal for review |
Maybe we need to rethink about the planning phase for adding shuffles. How about we add a placeholder for shuffle node and then replace the placeholder with actual shuffle node in |
By "placeholder shuffle nodes" you mean dummy ones ? We need to know the exact partitioning of the children which dummy nodes won't give (maybe I didn't get what you meant). My fear is that I agree that we need to rethink about the planning phase for adding shuffles. |
We only add the dummy shuffle node when it's necessary, e.g.
Let's say
Now we still keep exact partitioning, i.e. left child is partitioned by a, right child is partitioned by a,b |
Hi, All. |
@dongjoon-hyun : It will take me time to get back to this. Having said that , its not ideal to have master is bad state. How about disabling the rule by default (using a config) ? |
Or we could move forward with the current approach and defer the refactoring around how shuffles are added in planning phase. |
Thank you for your update, @tejasapatil . @cloud-fan and @gatorsmile . Could you give us some direction? |
how is this coming? it will be good to fix this in 2.2? |
@felixcheung Don't worry, the bug only exists in the master branch, so it won't block the 2.2.1 release. I have corrected the JIRA ticket's affected version to 2.3 . Also I'm looking into this issue |
|) c | ||
|JOIN table2 | ||
|ON c.i = table2.i | ||
|""".stripMargin).explain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use checkAnswer instead of explain in the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
After some more thoughts, I think the best choice is to do planning bottom up. That requires a lot of refactoring and I'm fine to merge this workaround first. LGTM except one minor comment for the test. |
Thank you for the decision, @cloud-fan . It's great to see the progress on this! |
@@ -31,6 +32,8 @@ import org.apache.spark.sql.internal.SQLConf | |||
* input partition ordering requirements are met. | |||
*/ | |||
case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] { | |||
private val reorderJoinPredicates = new ReorderJoinPredicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change class ReorderJoinPredicates
to object ReorderJoinPredicates
?
@@ -265,6 +268,7 @@ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] { | |||
if (childPartitioning.guarantees(partitioning)) child else operator | |||
case _ => operator | |||
} | |||
case operator: SparkPlan => ensureDistributionAndOrdering(operator) | |||
case operator: SparkPlan => | |||
ensureDistributionAndOrdering(reorderJoinPredicates.apply(operator)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, do something like
ensureDistributionAndOrdering(ReorderJoinPredicates(operator))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment to explain why we do it here? It is hard for new comers to understand the assumptions we made here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like having a rule invoked in such fashion is in-consistent compared to rest of the codebase .... from point of view of someone new to codebase, it will look odd. I removed the rule and instead moved these methods inside EnsureRequirements
. Let me know how you feel about the changed version
6ff4ed0
to
d9620ef
Compare
Test build #84232 has finished for PR 19257 at commit
|
@cloud-fan @gatorsmile any more changes needed on this PR before merging? I don't see any un-addressed comments left. |
Gentle ping~ |
retest this please |
Test build #84806 has finished for PR 19257 at commit
|
* partitioning of the join nodes' children. | ||
*/ | ||
def reorderJoinPredicates(plan: SparkPlan): SparkPlan = { | ||
def reorderJoinKeys( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not prefer the embedded function.
* introduced). This rule will change the ordering of the join keys to match with the | ||
* partitioning of the join nodes' children. | ||
*/ | ||
def reorderJoinPredicates(plan: SparkPlan): SparkPlan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
rightPartitioning: Partitioning): (Seq[Expression], Seq[Expression]) = { | ||
|
||
def reorder(expectedOrderOfKeys: Seq[Expression], | ||
currentOrderOfKeys: Seq[Expression]): (Seq[Expression], Seq[Expression]) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indents.
|ON a.i = b.i | ||
|JOIN table2 c | ||
|ON a.i = c.i | ||
|""".stripMargin)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the other test cases.
"""
| xyz
"""
LGTM except a few style comments. We can merge it and fix it in the follow-up PR. Thanks! |
Thanks! Merged to master. |
Created #20041 for addressing the follow-up comments by @gatorsmile |
…ild's partitioning is not decided ## What changes were proposed in this pull request? This is a followup PR of #19257 where gatorsmile had left couple comments wrt code style. ## How was this patch tested? Doesn't change any functionality. Will depend on build to see if no checkstyle rules are violated. Author: Tejas Patil <[email protected]> Closes #20041 from tejasapatil/followup_19257.
What changes were proposed in this pull request?
See jira description for the bug : https://issues.apache.org/jira/browse/SPARK-22042
Fix done in this PR is: In
EnsureRequirements
, applyReorderJoinPredicates
over the input tree before doing its core logic. Since the tree is transformed bottom-up, we can assure that the children are resolved before doingReorderJoinPredicates
.Theoretically this will guarantee to cover all such cases while keeping the code simple. My small grudge is for cosmetic reasons. This PR will look weird given that we don't call rules from other rules (not to my knowledge). I could have moved all the logic for
ReorderJoinPredicates
intoEnsureRequirements
but that will make it a but crowded. I am happy to discuss if there are better options.How was this patch tested?
Added a new test case