Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4937][SQL] Adding optimization to simplify the And, Or condition in spark sql #3778

Closed
wants to merge 9 commits into from

Conversation

scwf
Copy link
Contributor

@scwf scwf commented Dec 23, 2014

Adding optimization to simplify the And/Or condition in spark sql.

There are two kinds of Optimization
1 Numeric condition optimization, such as:
a < 3 && a > 5 ---- False
a < 1 || a > 0 ---- True
a > 3 && a > 5 => a > 5
(a < 2 || b > 5) && a < 2 => a < 2

2 optimizing the some query from a cartesian product into equi-join, such as this sql (one of hive-testbench):

select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#32'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 7 and l_quantity <= 7 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#35'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 15 and l_quantity <= 15 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#24'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 26 and l_quantity <= 26 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)

It has a repeated expression in Or, so we can optimize it by (a && b) || (a && c) = a && (b || c)
Before optimization, this sql hang in my locally test, and the physical plan is:
image

After optimization, this sql run successfully in 20+ seconds, and its physical plan is:
image

This PR focus on the second optimization and some simple ones of the first. For complex Numeric condition optimization, I will make a follow up PR.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24738 has started for PR 3778 at commit 2f4a35d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24738 has finished for PR 3778 at commit 2f4a35d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CombinePredicate extends BinaryPredicate
    • case class And(left: Expression, right: Expression) extends CombinePredicate
    • case class Or(left: Expression, right: Expression) extends CombinePredicate
    • case class SplitFragment(unit: Expression, remain: Option[Expression])
    • implicit class CombinePredicateExtension(source: CombinePredicate)
    • implicit class ExpressionCookies(expression: Expression)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24738/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24739 has started for PR 3778 at commit 6ec1aa1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24739 has finished for PR 3778 at commit 6ec1aa1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CombinePredicate extends BinaryPredicate
    • case class And(left: Expression, right: Expression) extends CombinePredicate
    • case class Or(left: Expression, right: Expression) extends CombinePredicate
    • case class SplitFragment(unit: Expression, remain: Option[Expression])
    • implicit class CombinePredicateExtension(source: CombinePredicate)
    • implicit class ExpressionCookies(expression: Expression)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24739/
Test FAILed.

@scwf scwf changed the title [SPARK-4938][SQL] Adding optimization to simplify the filter condition [SPARK-4937][SQL] Adding optimization to simplify the filter condition Dec 23, 2014
@liancheng
Copy link
Contributor

Hey @scwf, I'm kinda lost in all the predicates provided in the PR description... Would you mind to provide several simpler cases to illustrate what kind of optimizations this PR enables?

I guess essentially the second case is this:

  1. (a && b) || (a && c) is turned into a && (b || c),
  2. a happens to be a join key equality comparison, thus forms an equi-join
  3. the cartesian product is replaced by a shuffle join

Is it correct?

@scwf
Copy link
Contributor Author

scwf commented Dec 24, 2014

Yes, correct, later i will give a detail description for this:)

@scwf
Copy link
Contributor Author

scwf commented Dec 24, 2014

change to WIP, todos:
1 fix test case
2 optimize and simplify the code

@scwf scwf changed the title [SPARK-4937][SQL] Adding optimization to simplify the filter condition [WIP][SPARK-4937][SQL] Adding optimization to simplify the filter condition Dec 24, 2014
@liancheng
Copy link
Contributor

@scwf These optimizations are useful, particularly the one that eliminates common predicates. Thanks for bringing them up! However, the implementation in this PR is really over complicated... I opened #3784 for a simpler version.

For the numeric comparisons, I think we can first cast all numeric comparisons to Double type and derive a Double range from each predicate and then eliminate unnecessary comparisons by comparing ranges. Double is preferred because Double.PositiveInfinity and Double.NegativeInfinity can be handy when dealing with predicates like a < 10 and a > 100.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24768 has started for PR 3778 at commit f1a487f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24773 has started for PR 3778 at commit 8c0316f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24768 has finished for PR 3778 at commit f1a487f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CombinePredicate extends BinaryPredicate
    • case class And(left: Expression, right: Expression) extends CombinePredicate
    • case class Or(left: Expression, right: Expression) extends CombinePredicate
    • implicit class CombinePredicateExtension(source: CombinePredicate)
    • implicit class ExpressionCookies(expression: Expression)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24768/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24773 has finished for PR 3778 at commit 8c0316f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CombinePredicate extends BinaryPredicate
    • case class And(left: Expression, right: Expression) extends CombinePredicate
    • case class Or(left: Expression, right: Expression) extends CombinePredicate
    • implicit class CombinePredicateExtension(source: CombinePredicate)
    • implicit class ExpressionCookies(expression: Expression)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24773/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 25, 2014

Test build #24804 has started for PR 3778 at commit 8733027.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 25, 2014

Test build #24804 has finished for PR 3778 at commit 8733027.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CombinePredicate extends BinaryPredicate
    • case class And(left: Expression, right: Expression) extends CombinePredicate
    • case class Or(left: Expression, right: Expression) extends CombinePredicate
    • implicit class CombinePredicateExtension(source: CombinePredicate)
    • implicit class ExpressionCookies(expression: Expression)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24804/
Test PASSed.

@scwf scwf changed the title [WIP][SPARK-4937][SQL] Adding optimization to simplify the filter condition [WIP][SPARK-4937][SQL] Adding optimization to simplify the And, Or condition in spark sql Dec 25, 2014
val result = compare(vLeft, vRight)
result.filter(_ > 0).map(c => {
if (((left.isLess || left.isLessEquals)
&& (right.isLess || right.isLessEquals || right.isEquals))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually similar deeply nested if statements can be refactored into carefully organized pattern matches. Using pattern matching also frees you from helper methods like .isLess and isLessEquals, thus ExpressionCookies won't be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it, actually i have recognized this. but the case here really complex...:)
I am trying to refactory this

@scwf scwf changed the title [WIP][SPARK-4937][SQL] Adding optimization to simplify the And, Or condition in spark sql [SPARK-4937][SQL] Adding optimization to simplify the And, Or condition in spark sql Dec 30, 2014
@scwf
Copy link
Contributor Author

scwf commented Dec 30, 2014

Hi, @marmbrus this is a more complete version than #3784 and @liancheng told me he will close #3784 in favor of this PR, but now you have merged #3784. So can you revert that one or Maybe should i change this with newly master branch?

@marmbrus
Copy link
Contributor

Oh I see, sorry that was not clear from the discussion. I think it would be best to do this as an augmentation to what is already merged in.

@scwf
Copy link
Contributor Author

scwf commented Dec 30, 2014

Sure, updated!

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24917 has started for PR 3778 at commit 9570211.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24917 has finished for PR 3778 at commit 9570211.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24917/
Test PASSed.

@scwf
Copy link
Contributor Author

scwf commented Jan 10, 2015

@marmbrus, any comments here? i think this is ok to go

val batches =
Batch("AnalysisNodes", Once,
EliminateAnalysisOperators) ::
Batch("Constant Folding", FixedPoint(50),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the indenting is off here.

@marmbrus
Copy link
Contributor

This looks good to me. @liancheng have you looked this over too?

@SparkQA
Copy link

SparkQA commented Jan 11, 2015

Test build #25368 has started for PR 3778 at commit 58bcbc2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 11, 2015

Test build #25368 has finished for PR 3778 at commit 58bcbc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25368/
Test PASSed.

@scwf
Copy link
Contributor Author

scwf commented Jan 13, 2015

hey @liancheng, any other comments here?

@liancheng
Copy link
Contributor

Sorry for the late reply, this LGTM, thanks!

@scwf
Copy link
Contributor Author

scwf commented Jan 14, 2015

Thanks, @marmbrus this should be ready to go.

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

@asfgit asfgit closed this in ee1c1f3 Jan 16, 2015
@scwf scwf deleted the filter1 branch January 16, 2015 22:03
// a && a => a
case (l, r) if l fastEquals r => l
case (_, _) =>
val lhsSet = splitDisjunctivePredicates(left).toSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this has been merged, but can you submit another PR to add comments explaining how this works? I'm afraid this block of code is beyond a normal engineer's capability to understand without spending hours staring at it. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, later i can do this

asfgit pushed a commit that referenced this pull request Jan 17, 2015
…nSimplification`

Follow up of #3778
/cc rxin

Author: scwf <[email protected]>

Closes #4086 from scwf/commentforspark-4937 and squashes the following commits:

aaf89f6 [scwf] code style issue
2d3406e [scwf] added comment for spark-4937
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants