[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) #11348

liancheng · 2016-02-24T18:29:06Z

What changes were proposed in this pull request?

Predicates shouldn't be pushed through project with nondeterministic field(s).

See graphframes/graphframes#23 and SPARK-13473 for more details.

This PR targets master, branch-1.6, and branch-1.5.

How was this patch tested?

A test case is added in FilterPushdownSuite. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.

liancheng · 2016-02-24T18:30:34Z

cc @mengxr

mengxr · 2016-02-24T18:36:00Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

@@ -156,6 +156,17 @@ class FilterPushdownSuite extends PlanTest {
    comparePlans(optimized, originalQuery)
  }

+  test("nondeterministic: can't push down filter through project with nondeterministic field") {
+    val originalQuery = testRelation
+        .select(Rand(10).as('rand), 'a)


minor: indentation

marmbrus · 2016-02-24T18:36:47Z

LGTM

liancheng · 2016-02-24T18:37:28Z

We should also revise nondeterminism handling in PushPredicateThroughGenerate, PushPredicateThroughAggregate, and PushPredicateThroughJoin. But they can be added in follow-up PRs.

liancheng · 2016-02-24T18:42:25Z

test this please

yhuai · 2016-02-24T18:48:30Z

test this please

SparkQA · 2016-02-24T20:09:51Z

Test build #51890 has finished for PR 11348 at commit 863c5ec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-24T20:14:44Z

Test build #51888 has finished for PR 11348 at commit 863c5ec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-02-25T10:05:22Z

We can probably further simplify PushPredicateThroughProject. Before this PR, we don't push a predicate if it refers to any non-deterministic field(s). However, as what this PR fixes, we shouldn't push a predicate through any project that has non-deterministic field(s). Another case worth noting is that a predicate containing non-deterministic expression(s) but not referring to any non-deterministic field(s) is OK to be pushed down. For example, it's OK to push down the following filter predicate:

// from:
sqlContext.range(3).select('id as 'a, 'id 'as 'b).filter(rand(42) > 0.5)

// to:
sqlContext.range(3).filter(rand(42) > 0.5).select('id as 'a, 'id 'as 'b)

This means that we can push down a filter predicate through a project if and only if all fields of the project are deterministic. That's why those two test cases are considered outdated and removed.

cc @cloud-fan

(To be safe, I won't do the above update in this PR since it also targets to 1.6 and 1.5.)

SparkQA · 2016-02-25T11:39:19Z

Test build #51957 has finished for PR 11348 at commit 51ad500.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-25T12:10:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -804,7 +804,9 @@ object SimplifyFilters extends Rule[LogicalPlan] {
 */
 object PushPredicateThroughProject extends Rule[LogicalPlan] with PredicateHelper {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
-    case filter @ Filter(condition, project @ Project(fields, grandChild)) =>
+    case filter @ Filter(condition, project @ Project(fields, grandChild))
+      if fields.forall(_.deterministic) =>


how about we add some comments to explain why we can't push down filter through project with non-deterministic fields? e.g. number of input rows is also an implicit input for non-deterministic expressions, push down filter will break it.

cloud-fan · 2016-02-25T12:12:02Z

LGTM except one comment

liancheng · 2016-02-25T12:33:59Z

Thanks for the review, comment added.

…ministic field(s) ## What changes were proposed in this pull request? Predicates shouldn't be pushed through project with nondeterministic field(s). See graphframes/graphframes#23 and SPARK-13473 for more details. This PR targets master, branch-1.6, and branch-1.5. ## How was this patch tested? A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case. Author: Cheng Lian <[email protected]> Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field. (cherry picked from commit 3fa6491) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2016-02-25T12:49:48Z

The last commit is adding comments, so it's safe to merge after style check passed.
Thanks, merged into master, branch-1.6 and branch-1.5!

SparkQA · 2016-02-25T14:29:16Z

Test build #51969 has finished for PR 11348 at commit 24b96bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This is a follow-up of PR #11348. After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <[email protected]> Closes #11864 from liancheng/spark-13473-cleanup.

## What changes were proposed in this pull request? This is a follow-up of PR apache#11348. After PR apache#11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <[email protected]> Closes apache#11864 from liancheng/spark-13473-cleanup.

wecharyu · 2024-07-09T03:50:15Z

Hi @liancheng, @mengxr, @cloud-fan,

I'm trying to understand why pushing down filters with nondeterministic fields is considered a bug. How would the different nondeterministic results impact the query?

For instance, other engines like Hive do push down filters in these cases. Could this change lead to performance regressions in our queries?

mengxr reviewed Feb 24, 2016
View reviewed changes

Don't push predicate through project with nondeterministic field(s)

863c5ec

liancheng force-pushed the spark-13473-no-ppd-through-nondeterministic-project-field branch from 0f3175a to 863c5ec Compare February 24, 2016 18:38

Removes outdated test cases

51ad500

cloud-fan reviewed Feb 25, 2016
View reviewed changes

Adds comment

24b96bd

asfgit closed this in 3fa6491 Feb 25, 2016

liancheng deleted the spark-13473-no-ppd-through-nondeterministic-project-field branch February 25, 2016 14:14

liancheng mentioned this pull request Mar 21, 2016

[SPARK-13473][SQL] Simplifies PushPredicateThroughProject #11864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) #11348

[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) #11348

liancheng commented Feb 24, 2016

liancheng commented Feb 24, 2016

mengxr Feb 24, 2016

marmbrus commented Feb 24, 2016

liancheng commented Feb 24, 2016

liancheng commented Feb 24, 2016

yhuai commented Feb 24, 2016

SparkQA commented Feb 24, 2016

SparkQA commented Feb 24, 2016

liancheng commented Feb 25, 2016

SparkQA commented Feb 25, 2016

cloud-fan Feb 25, 2016

cloud-fan commented Feb 25, 2016

liancheng commented Feb 25, 2016

cloud-fan commented Feb 25, 2016

SparkQA commented Feb 25, 2016

wecharyu commented Jul 9, 2024

[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) #11348

[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) #11348

Conversation

liancheng commented Feb 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

liancheng commented Feb 24, 2016

mengxr Feb 24, 2016

Choose a reason for hiding this comment

marmbrus commented Feb 24, 2016

liancheng commented Feb 24, 2016

liancheng commented Feb 24, 2016

yhuai commented Feb 24, 2016

SparkQA commented Feb 24, 2016

SparkQA commented Feb 24, 2016

liancheng commented Feb 25, 2016

SparkQA commented Feb 25, 2016

cloud-fan Feb 25, 2016

Choose a reason for hiding this comment

cloud-fan commented Feb 25, 2016

liancheng commented Feb 25, 2016

cloud-fan commented Feb 25, 2016

SparkQA commented Feb 25, 2016

wecharyu commented Jul 9, 2024