[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN #11121

JoshRosen · 2016-02-08T21:19:05Z

This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases:

If a limit is on top of a UNION ALL operator, then a partition-local limit operator will be pushed to each of the union operator's children.
If a limit is on top of an OUTER JOIN then a partition-local limit will be pushed to one side of the join. For LEFT OUTER and RIGHT OUTER joins, the limit will be pushed to the left and right side, respectively. For FULL OUTER join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger.

These optimizations were proposed previously by @gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical Limit operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the Limit operator into separate LocalLimit and GlobalLimit operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of @gatorsmile's patches, with changes and simplifications due to partition-local-limiting.

When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a maxRows method to SparkPlan which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion.

…new split Limit operator.

SparkQA · 2016-02-08T21:39:42Z

Test build #50940 has finished for PR 11121 at commit 7b86111.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-08T22:53:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+        case RightOuter => join.copy(right = maybePushLimit(exp, right))
+        case LeftOuter => join.copy(left = maybePushLimit(exp, left))
+        case FullOuter =>
+          join.copy(left = maybePushLimit(exp, left), right = maybePushLimit(exp, right))


This is not right. Please check the original PR. @yhuai and I had a discussion about this issue.

Since the full outer join will remove the duplicates, we are unable to add the extra limit to both sides. As long as we can ensure the completeness of one Child, the generated results will be still correct like the left/right outer join.

Thanks!

I'll fix this up now and will add a brief comment summarizing this.

Great! Please also update the description of this PR. Thanks!

I have one concern about the rule as implemented in your PR:

If we have a full outer join which initially has neither of its children limited and then we push a limit to the side with larger statistics, then a second firing of the LimitPushDown rule would match on one of the cases where only a single side is limited and would push a limit to the other side, leading to the wrong answer because we would have limited both sides.

Therefore, I think we might want to restrict this rule to only fire in the case where a neither side of the full outer join has a pre-existing limit.

Also, I wonder whether we should check whether maxRows is defined rather than checking whether the outer join's children are Limits, since that frees us from having to reason about whether the limit could be further pushed. On the other hand, if we always leave the original LocalLimit in place then I don't think we currently need to worry about the limit being further pushed down to a point where the child would no longer be a limit.

Yeah, you are right. If one side has a pre-existing limit, we just need to add it in that side. Of course, two adjacent limits can be combined.

Yeah, maxRows was added for this purpose. This original idea is from @marmbrus .

…ds to be idempotent.

…ion problem.

JoshRosen · 2016-02-09T00:19:37Z

Alright, updated to address the FullOuter bugs. I'm going to take another self-review pass later to make sure that I haven't overlooked anything else.

JoshRosen · 2016-02-09T00:22:38Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushdownSuite.scala

+    val optimized = Optimize.execute(originalQuery.analyze)
+    val correctAnswer = Limit(1, x.join(LocalLimit(1, yBig), FullOuter)).analyze
+    comparePlans(optimized, correctAnswer)
+  }


I'll also add tests for the cases where both inputs are limited.

I've added more tests now.

SparkQA · 2016-02-09T02:20:02Z

Test build #50947 has finished for PR 11121 at commit 9abb38f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-10T20:08:01Z

Note: when we merge this, we should remove the triggering of the rule from the optimizer, and only add it back once we have whole-stage codegen for Limit.

SparkQA · 2016-02-10T21:16:27Z

Test build #51045 has finished for PR 11121 at commit d59d2c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-10T23:20:03Z

Test build #51052 has finished for PR 11121 at commit ad5e40f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-02-11T22:31:03Z

I've updated this to disable the optimizer rule for now (it's still tested in the LimitPushdownSuite, though).

rxin · 2016-02-11T22:54:50Z

cc @cloud-fan for review

cloud-fan · 2016-02-12T08:38:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+   * Any operator that a Limit can be pushed passed should override this function (e.g., Union).
+   * Any operator that can push through a Limit should override this function (e.g., Project).
+   */
+  def maxRows: Option[Expression] = None


are we going to handle non-literal maxRow in the future? If not, maybe define it as Option[Long] is simpler and better?

I experimented with this but ran into problems because the argument of LIMIT can be an expression.

how about returning None if the argument of Limit is non-literal expression?

+1

It feels to me that this is only useful when we know the value, not if it is some subquery.

How does this interact with the constant-folding rules? Will expressions be maximally constant-folded before this is invoked? Just trying to reason about whether there are any ordering issues here.

It's a def not a lazy val, I think it's fine with constant-folding rules.
BTW using Option[Expression] may be sub-optimal for something like Union, as its maxRow: Some(children.flatMap(_.maxRows).reduce { (a, b) => Add(a, b) }) can't be constant-folded(maxRow is a method) and will always be a non-literal expression, we can't push limit through it.

cloud-fan · 2016-02-12T09:11:35Z

LGTM except one comment

JoshRosen · 2016-02-13T03:38:20Z

I've updated this patch to change maxRows into an Option[Long] and have added a test to make sure that constant-folding works as expected.

JoshRosen · 2016-02-13T04:06:37Z

Jenkins, retest this please.

SparkQA · 2016-02-13T04:25:41Z

Test build #51227 has finished for PR 11121 at commit c28343f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-13T06:12:17Z

#11171 did a few changes in conversion from logical plan to SQL. I believe this is the cause of build failure.

SparkQA · 2016-02-13T09:50:26Z

Test build #51234 has finished for PR 11121 at commit 020fafe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-02-13T19:35:06Z

I've updated this to fix compilation after that SQLBuilder change. One quick question: do we have to worry about the logical plan -> SQL conversion being applied to optimized plans? If so, things might get tricky because we now have separate GlobalLimit and LocalLimit logical plan nodes. Because of the unapply method that I defined on the Limit object, we'll continue to match the previous behavior for GlobalLimit(x, LocalLimit(x, child)), so this should continue to work for queries which haven't undergone optimization.

gatorsmile · 2016-02-13T19:46:48Z

IMO, in the future, if we need to convert the optimized plans to SQL, we need to add a few rules in SQLBuilder to revert back the changes of some optimization rules. Otherwise, the Parser is unable to parse the generated SQL. I already hit a couple of issues caused by Analyzer rules.

Also cc @liancheng

rxin · 2016-02-14T02:04:41Z

We could consider adding a partition local limit or some hint at some point in the parser.

cc @hvanhovell

JoshRosen · 2016-02-15T01:19:35Z

To confirm: is merging this patch blocking on anything or can concerns related to converting optimized plans to SQL be addressed in a followup patch?

Would you like me to add explicit test cases for plan -> SQL generation for a variety of queries involving limits?

cloud-fan · 2016-02-15T01:24:40Z

I think this is good to go, we can defer the "converting optimized plans to SQL" in a followup patch.
cc @liancheng

rxin · 2016-02-15T01:28:17Z

Yea - I don't think we ever turn optimized plan into sql right now.

liancheng · 2016-02-15T01:30:16Z

@cloud-fan I agree. For SQL generation, currently we can only focus on resolved plans parsed from HiveQL. @gatorsmile I think after finishing that part, we may gain better knowledge about how to handle arbitrary resolved logical plans. How do you think?

rxin · 2016-02-15T01:32:03Z

I'm going to merge this in master.

gatorsmile · 2016-02-15T02:05:53Z

@liancheng I agree. : )

JoshRosen added 5 commits February 8, 2016 12:16

Split logical limit operator.

9df9ffd

Also split physical planning.

449e96e

Import gatorsmile's pushdown through union and clean up to work with …

060b9b8

…new split Limit operator.

Fix join pushdown.

00e7f39

Define maxRows in more operators.

7b86111

gatorsmile reviewed Feb 8, 2016
View reviewed changes

JoshRosen added 5 commits February 8, 2016 15:07

Style and indentation fixes.

985e00f

Keep LocalLimit on top after pushdown.

8e1491b

Add failing tests demonstrating why FullOuter limit pushdown rule nee…

30b9613

…ds to be idempotent.

Update FullOuter pushdown per discussion to address repeated applicat…

96701ad

…ion problem.

Undo weird IntelliJ auto-formatting issue.

9abb38f

JoshRosen reviewed Feb 9, 2016
View reviewed changes

JoshRosen added 2 commits February 10, 2016 11:04

Merge remote-tracking branch 'origin/master' into limit-pushdown-2

d9a3ca4

More pushdown tests.

d59d2c4

Disable optimizer rule for now.

ad5e40f

cloud-fan reviewed Feb 12, 2016
View reviewed changes

Change maxRows from an expression into a Long.

c28343f

JoshRosen added 2 commits February 12, 2016 23:22

Merge remote-tracking branch 'origin/master' into limit-pushdown-2

ac3e978

Fix compilation in SQLBuilder.

020fafe

asfgit closed this in a8bbc4f Feb 15, 2016

JoshRosen deleted the limit-pushdown-2 branch February 15, 2016 01:42

[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN #11121

[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN #11121

Conversation

JoshRosen commented Feb 8, 2016

SparkQA commented Feb 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshRosen commented Feb 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 9, 2016

rxin commented Feb 10, 2016

SparkQA commented Feb 10, 2016

SparkQA commented Feb 10, 2016

JoshRosen commented Feb 11, 2016

rxin commented Feb 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 12, 2016

JoshRosen commented Feb 13, 2016

JoshRosen commented Feb 13, 2016

SparkQA commented Feb 13, 2016

gatorsmile commented Feb 13, 2016

SparkQA commented Feb 13, 2016

JoshRosen commented Feb 13, 2016

gatorsmile commented Feb 13, 2016

rxin commented Feb 14, 2016

JoshRosen commented Feb 15, 2016

cloud-fan commented Feb 15, 2016

rxin commented Feb 15, 2016

liancheng commented Feb 15, 2016

rxin commented Feb 15, 2016

gatorsmile commented Feb 15, 2016