[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's partitioning is not decided #19257

tejasapatil · 2017-09-17T01:15:29Z

What changes were proposed in this pull request?

See jira description for the bug : https://issues.apache.org/jira/browse/SPARK-22042

Fix done in this PR is: In EnsureRequirements, apply ReorderJoinPredicates over the input tree before doing its core logic. Since the tree is transformed bottom-up, we can assure that the children are resolved before doing ReorderJoinPredicates.

Theoretically this will guarantee to cover all such cases while keeping the code simple. My small grudge is for cosmetic reasons. This PR will look weird given that we don't call rules from other rules (not to my knowledge). I could have moved all the logic for ReorderJoinPredicates into EnsureRequirements but that will make it a but crowded. I am happy to discuss if there are better options.

How was this patch tested?

Added a new test case

tejasapatil · 2017-09-17T01:15:38Z

Jenkins test this please

SparkQA · 2017-09-17T03:54:19Z

Test build #81847 has finished for PR 19257 at commit 6ff4ed0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-09-20T15:31:10Z

cc @cloud-fan @gatorsmile @sameeragarwal for review

cloud-fan · 2017-09-21T05:30:28Z

Maybe we need to rethink about the planning phase for adding shuffles. How about we add a placeholder for shuffle node and then replace the placeholder with actual shuffle node in EnsureRequirements? Then we can make sure the plan tree is always resolved.

tejasapatil · 2017-09-21T06:39:34Z

By "placeholder shuffle nodes" you mean dummy ones ? We need to know the exact partitioning of the children which dummy nodes won't give (maybe I didn't get what you meant). My fear is that EnsureRequirements might choose to make some decisions about shuffle which would affect the results of ReorderJoinPredicates rule.

I agree that we need to rethink about the planning phase for adding shuffles.

cloud-fan · 2017-09-21T06:57:02Z

We need to know the exact partitioning of the children which dummy nodes won't give

We only add the dummy shuffle node when it's necessary, e.g.

         hash-join
          /      \
    child1   child2

Let's say hash-join needs children to be clustered by a, b, and child1 is already partitioned by a, and child2 has no partitioning. After adding the dummy nodes:

         hash-join
          /      \
         /     dummy-shuffle
        /            |
    child1       child2

Now we still keep exact partitioning, i.e. left child is partitioned by a, right child is partitioned by a,b

dongjoon-hyun · 2017-11-10T07:46:06Z

Hi, All.
Master branch still has this problem. Can we proceed this?

tejasapatil · 2017-11-10T17:11:48Z

@dongjoon-hyun : It will take me time to get back to this. Having said that , its not ideal to have master is bad state. How about disabling the rule by default (using a config) ?

tejasapatil · 2017-11-10T17:13:15Z

Or we could move forward with the current approach and defer the refactoring around how shuffles are added in planning phase.

dongjoon-hyun · 2017-11-10T17:46:54Z

Thank you for your update, @tejasapatil .

@cloud-fan and @gatorsmile . Could you give us some direction?

felixcheung · 2017-11-17T02:36:41Z

how is this coming? it will be good to fix this in 2.2?

cloud-fan · 2017-11-17T12:39:05Z

@felixcheung Don't worry, the bug only exists in the master branch, so it won't block the 2.2.1 release. I have corrected the JIRA ticket's affected version to 2.3 . Also I'm looking into this issue

cloud-fan · 2017-11-17T15:38:46Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

+              |) c
+              |JOIN table2
+              |ON c.i = table2.i
+              |""".stripMargin).explain()


use checkAnswer instead of explain in the test

cloud-fan · 2017-11-17T15:40:31Z

After some more thoughts, I think the best choice is to do planning bottom up. That requires a lot of refactoring and I'm fine to merge this workaround first.

LGTM except one minor comment for the test.

dongjoon-hyun · 2017-11-17T17:55:05Z

Thank you for the decision, @cloud-fan . It's great to see the progress on this!

gatorsmile · 2017-11-22T07:20:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

@@ -31,6 +32,8 @@ import org.apache.spark.sql.internal.SQLConf
 * input partition ordering requirements are met.
 */
 case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] {
+  private val reorderJoinPredicates = new ReorderJoinPredicates


Change class ReorderJoinPredicates to object ReorderJoinPredicates ?

gatorsmile · 2017-11-22T07:21:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

@@ -265,6 +268,7 @@ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] {
          if (childPartitioning.guarantees(partitioning)) child else operator
        case _ => operator
      }
-    case operator: SparkPlan => ensureDistributionAndOrdering(operator)
+    case operator: SparkPlan =>
+      ensureDistributionAndOrdering(reorderJoinPredicates.apply(operator))


Then, do something like

ensureDistributionAndOrdering(ReorderJoinPredicates(operator))

Could you add a comment to explain why we do it here? It is hard for new comers to understand the assumptions we made here.

It feels like having a rule invoked in such fashion is in-consistent compared to rest of the codebase .... from point of view of someone new to codebase, it will look odd. I removed the rule and instead moved these methods inside EnsureRequirements. Let me know how you feel about the changed version

…itioning is not decided

SparkQA · 2017-11-28T01:15:36Z

Test build #84232 has finished for PR 19257 at commit d9620ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ash211 · 2017-11-29T16:08:26Z

@cloud-fan @gatorsmile any more changes needed on this PR before merging? I don't see any un-addressed comments left.

dongjoon-hyun · 2017-12-12T20:15:21Z

Gentle ping~

gatorsmile · 2017-12-12T23:23:39Z

retest this please

SparkQA · 2017-12-13T02:05:40Z

Test build #84806 has finished for PR 19257 at commit d9620ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-13T07:27:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+   * partitioning of the join nodes' children.
+   */
+  def reorderJoinPredicates(plan: SparkPlan): SparkPlan = {
+    def reorderJoinKeys(


We do not prefer the embedded function.

gatorsmile · 2017-12-13T07:27:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+   * introduced). This rule will change the ordering of the join keys to match with the
+   * partitioning of the join nodes' children.
+   */
+  def reorderJoinPredicates(plan: SparkPlan): SparkPlan = {


gatorsmile · 2017-12-13T07:27:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+        rightPartitioning: Partitioning): (Seq[Expression], Seq[Expression]) = {
+
+      def reorder(expectedOrderOfKeys: Seq[Expression],
+                  currentOrderOfKeys: Seq[Expression]): (Seq[Expression], Seq[Expression]) = {


gatorsmile · 2017-12-13T07:28:28Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

+                |ON a.i = b.i
+                |JOIN table2 c
+                |ON a.i = c.i
+                |""".stripMargin))


Please follow the other test cases.

""" | xyz """

gatorsmile · 2017-12-13T07:29:50Z

LGTM except a few style comments. We can merge it and fix it in the follow-up PR. Thanks!

gatorsmile · 2017-12-13T07:30:17Z

Thanks! Merged to master.

tejasapatil · 2017-12-21T00:42:16Z

Created #20041 for addressing the follow-up comments by @gatorsmile

…ild's partitioning is not decided ## What changes were proposed in this pull request? This is a followup PR of #19257 where gatorsmile had left couple comments wrt code style. ## How was this patch tested? Doesn't change any functionality. Will depend on build to see if no checkstyle rules are violated. Author: Tejas Patil <[email protected]> Closes #20041 from tejasapatil/followup_19257.

tejasapatil mentioned this pull request Nov 12, 2017

[DO NOT REVIEW][SPARK-22042] [SQL] Insert shuffle nodes in entire tree before applying ReorderJoinPredicates #19725

Closed

cloud-fan reviewed Nov 17, 2017

View reviewed changes

gatorsmile reviewed Nov 22, 2017

View reviewed changes

tejasapatil added 3 commits November 27, 2017 14:27

[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's part…

a3c1ec9

…itioning is not decided

review comments

d218fc3

review comments

d9620ef

tejasapatil force-pushed the SPARK-22042_ReorderJoinPredicates branch from 6ff4ed0 to d9620ef Compare November 27, 2017 22:29

gatorsmile reviewed Dec 13, 2017

View reviewed changes

asfgit closed this in 682eb4f Dec 13, 2017

tejasapatil added a commit to tejasapatil/spark that referenced this pull request Dec 21, 2017

Followup of comments over PR apache#19257

8c37d46

tejasapatil mentioned this pull request Dec 21, 2017

[SPARK-22042] [FOLLOW-UP] [SQL] ReorderJoinPredicates can break when child's partitioning is not decided #20041

Closed

tejasapatil deleted the SPARK-22042_ReorderJoinPredicates branch December 21, 2017 00:42

maropu mentioned this pull request Jan 29, 2019

[SPARK-25401] [SQL] Reorder join predicates to match child outputOrdering #23267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's partitioning is not decided #19257

[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's partitioning is not decided #19257

tejasapatil commented Sep 17, 2017

tejasapatil commented Sep 17, 2017

SparkQA commented Sep 17, 2017

tejasapatil commented Sep 20, 2017

cloud-fan commented Sep 21, 2017

tejasapatil commented Sep 21, 2017

cloud-fan commented Sep 21, 2017 •

edited

Loading

dongjoon-hyun commented Nov 10, 2017

tejasapatil commented Nov 10, 2017

tejasapatil commented Nov 10, 2017

dongjoon-hyun commented Nov 10, 2017

felixcheung commented Nov 17, 2017

cloud-fan commented Nov 17, 2017

cloud-fan Nov 17, 2017

tejasapatil Nov 27, 2017

cloud-fan commented Nov 17, 2017

dongjoon-hyun commented Nov 17, 2017

gatorsmile Nov 22, 2017

gatorsmile Nov 22, 2017

gatorsmile Nov 22, 2017 •

edited

Loading

tejasapatil Nov 27, 2017 •

edited

Loading

SparkQA commented Nov 28, 2017

ash211 commented Nov 29, 2017

dongjoon-hyun commented Dec 12, 2017

gatorsmile commented Dec 12, 2017

SparkQA commented Dec 13, 2017

gatorsmile Dec 13, 2017

gatorsmile Dec 13, 2017

gatorsmile Dec 13, 2017

gatorsmile Dec 13, 2017 •

edited

Loading

gatorsmile commented Dec 13, 2017

gatorsmile commented Dec 13, 2017

tejasapatil commented Dec 21, 2017

[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's partitioning is not decided #19257

[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's partitioning is not decided #19257

Conversation

tejasapatil commented Sep 17, 2017

What changes were proposed in this pull request?

How was this patch tested?

tejasapatil commented Sep 17, 2017

SparkQA commented Sep 17, 2017

tejasapatil commented Sep 20, 2017

cloud-fan commented Sep 21, 2017

tejasapatil commented Sep 21, 2017

cloud-fan commented Sep 21, 2017 • edited Loading

dongjoon-hyun commented Nov 10, 2017

tejasapatil commented Nov 10, 2017

tejasapatil commented Nov 10, 2017

dongjoon-hyun commented Nov 10, 2017

felixcheung commented Nov 17, 2017

cloud-fan commented Nov 17, 2017

cloud-fan Nov 17, 2017

Choose a reason for hiding this comment

tejasapatil Nov 27, 2017

Choose a reason for hiding this comment

cloud-fan commented Nov 17, 2017

dongjoon-hyun commented Nov 17, 2017

gatorsmile Nov 22, 2017

Choose a reason for hiding this comment

gatorsmile Nov 22, 2017

Choose a reason for hiding this comment

gatorsmile Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

tejasapatil Nov 27, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Nov 28, 2017

ash211 commented Nov 29, 2017

dongjoon-hyun commented Dec 12, 2017

gatorsmile commented Dec 12, 2017

SparkQA commented Dec 13, 2017

gatorsmile Dec 13, 2017

Choose a reason for hiding this comment

gatorsmile Dec 13, 2017

Choose a reason for hiding this comment

gatorsmile Dec 13, 2017

Choose a reason for hiding this comment

gatorsmile Dec 13, 2017 • edited Loading

Choose a reason for hiding this comment

gatorsmile commented Dec 13, 2017

gatorsmile commented Dec 13, 2017

tejasapatil commented Dec 21, 2017

cloud-fan commented Sep 21, 2017 •

edited

Loading

gatorsmile Nov 22, 2017 •

edited

Loading

tejasapatil Nov 27, 2017 •

edited

Loading

gatorsmile Dec 13, 2017 •

edited

Loading