[SPARK-25401] [SQL] Reorder join predicates to match child outputOrdering #23267

davidvrba · 2018-12-09T21:30:09Z

What changes were proposed in this pull request?

In case of SortMergeJoin if tables are bucketed with keys (k1, k2) and sorted with keys (k2, k1), EnsureRequirements will add unnecessary SortExec. In this PR the improvement is that we reorder join predicate keys once more - to align it with child outputOrdering.

How was this patch tested?

Adding a new test.

dongjoon-hyun · 2018-12-09T22:32:59Z

ok to test

SparkQA · 2018-12-10T02:11:39Z

Test build #99889 has finished for PR 23267 at commit 6022e77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davidvrba · 2018-12-11T05:24:18Z

cc @gatorsmile @cloud-fan @dongjoon-hyun can i ask for review please

davidvrba · 2019-01-22T08:39:29Z

Could someone please take a look at this PR? @gatorsmile @cloud-fan @dongjoon-hyun @holdenk

mn-mikke · 2019-01-29T10:39:27Z

cc @maropu @mgaido91

mgaido91

I am not sure about this change. If we change the keys ordering for matching the sort order, that order doesn't match anymore the one for partitionings. This may introduce some problems.

I am leaving anyway some comments on the code, in case my general concern above is answered...

mgaido91 · 2019-01-29T10:59:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

@@ -293,6 +314,21 @@ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] {
    }
  }

+  private def reorderJoinPredicatesForOrdering(plan: SparkPlan): SparkPlan = {


I think we can avoid this and include this transformation in the former reorderJoinPredicates method, after the reorder for partitionings. I'd rather have a reorderJoinKeysForOrderings called there or something similar.

I am not sure this would work. The point here was to first reorder join predicates for partitioning, then check for the child outputPartitioning (which happens in the method ensureDistributionAndOrdering) and decide if we need Exchange or not and AFTER that reorder the join predicates again to satisfy the child outputOrdering to avoid Exchange.

mgaido91 · 2019-01-29T11:04:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+              rightOrders.length == leftKeys.length &&
+              leftKeys.forall { x =>
+                (rightOrders.map(_.asInstanceOf[Expression])).exists(_.semanticEquals(x))} =>
+            reorder(leftKeys, rightKeys, rightOrders.map(_.asInstanceOf[Expression]), leftKeys)


reorder(leftKeys, rightKeys, rightOrders.map(_.asInstanceOf[Expression]), rigthKeys)

and please add a UT which fails before correcting this and passes after.

Should this UT test the reorderJoinKeys function? Or do you have something else in mind?

I meant a test like the one you added. But please, first do prove that the current solution is fine (since I doubt so, see #23267 (comment)). Once we ensure that the current change is safe, you can go ahead addressing these comments. Thanks.

davidvrba · 2019-01-29T12:30:00Z

@mgaido91 Thank you very much for your comments. We are changing the key order at the end of the method ensureDistributionAndOrdering and at that moment the child outputPartitioning is already checked and therefore it is ok to change the order of the join keys - it is not going to add unnecessary Exchange for the mismatch.

mgaido91 · 2019-01-29T12:51:02Z

@davidvrba my point is: is it safe to do so? I mean, are we changing the plan 2 times potentially now: the first time we reorder the keys in order to accomplish with the partitioning, the second for the orderings.

So with the second change we are basically "undoing" part of the previous change, which consists of:

changing the order of the keys so that they match the partitioning one;
adding/not adding the Exchange according to the modified plan.

Hence I am not sure this change you are introducing is fine in general. My understanding is that the change here is not safe in all conditions, in particular in the case when 2 re-orderings occur. If you can show that all the possible cases are safe, then it is fine, but my feeling is that it is not.
Hope this more verbose comment explains more clearly what I meant. Thanks.

maropu · 2019-01-29T13:13:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

@@ -276,7 +297,7 @@ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] {
   * introduced). This rule will change the ordering of the join keys to match with the
   * partitioning of the join nodes' children.
   */
-  private def reorderJoinPredicates(plan: SparkPlan): SparkPlan = {


In the historical reason (#19257 (comment)), this method was added as a workaround. So, I feel it is compliated to extend this method for this case... basically, IMO we need a general logic here to cover this case and more. cc: @cloud-fan

AmplabJenkins · 2019-09-16T18:17:15Z

Can one of the admins verify this patch?

HyukjinKwon · 2019-09-17T00:31:38Z

Closing this due to author's inactivity.

spark-25401 reorder join predicates to match child outputOrdering

6022e77

mgaido91 reviewed Jan 29, 2019

View reviewed changes

maropu reviewed Jan 29, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

HyukjinKwon closed this Sep 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25401] [SQL] Reorder join predicates to match child outputOrdering #23267

[SPARK-25401] [SQL] Reorder join predicates to match child outputOrdering #23267

davidvrba commented Dec 9, 2018

dongjoon-hyun commented Dec 9, 2018

SparkQA commented Dec 10, 2018

davidvrba commented Dec 11, 2018

davidvrba commented Jan 22, 2019

mn-mikke commented Jan 29, 2019

mgaido91 left a comment

mgaido91 Jan 29, 2019

davidvrba Jan 29, 2019

mgaido91 Jan 29, 2019

davidvrba Jan 29, 2019

mgaido91 Jan 29, 2019

davidvrba commented Jan 29, 2019

mgaido91 commented Jan 29, 2019

maropu Jan 29, 2019 •

edited

Loading

AmplabJenkins commented Sep 16, 2019

HyukjinKwon commented Sep 17, 2019

[SPARK-25401] [SQL] Reorder join predicates to match child outputOrdering #23267

[SPARK-25401] [SQL] Reorder join predicates to match child outputOrdering #23267

Conversation

davidvrba commented Dec 9, 2018

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Dec 9, 2018

SparkQA commented Dec 10, 2018

davidvrba commented Dec 11, 2018

davidvrba commented Jan 22, 2019

mn-mikke commented Jan 29, 2019

mgaido91 left a comment

Choose a reason for hiding this comment

mgaido91 Jan 29, 2019

Choose a reason for hiding this comment

davidvrba Jan 29, 2019

Choose a reason for hiding this comment

mgaido91 Jan 29, 2019

Choose a reason for hiding this comment

davidvrba Jan 29, 2019

Choose a reason for hiding this comment

mgaido91 Jan 29, 2019

Choose a reason for hiding this comment

davidvrba commented Jan 29, 2019

mgaido91 commented Jan 29, 2019

maropu Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

AmplabJenkins commented Sep 16, 2019

HyukjinKwon commented Sep 17, 2019

maropu Jan 29, 2019 •

edited

Loading