-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) #11348
[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) #11348
Conversation
cc @mengxr |
@@ -156,6 +156,17 @@ class FilterPushdownSuite extends PlanTest { | |||
comparePlans(optimized, originalQuery) | |||
} | |||
|
|||
test("nondeterministic: can't push down filter through project with nondeterministic field") { | |||
val originalQuery = testRelation | |||
.select(Rand(10).as('rand), 'a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: indentation
LGTM |
We should also revise nondeterminism handling in |
0f3175a
to
863c5ec
Compare
test this please |
1 similar comment
test this please |
Test build #51890 has finished for PR 11348 at commit
|
Test build #51888 has finished for PR 11348 at commit
|
We can probably further simplify // from:
sqlContext.range(3).select('id as 'a, 'id 'as 'b).filter(rand(42) > 0.5)
// to:
sqlContext.range(3).filter(rand(42) > 0.5).select('id as 'a, 'id 'as 'b) This means that we can push down a filter predicate through a project if and only if all fields of the project are deterministic. That's why those two test cases are considered outdated and removed. cc @cloud-fan (To be safe, I won't do the above update in this PR since it also targets to 1.6 and 1.5.) |
Test build #51957 has finished for PR 11348 at commit
|
@@ -804,7 +804,9 @@ object SimplifyFilters extends Rule[LogicalPlan] { | |||
*/ | |||
object PushPredicateThroughProject extends Rule[LogicalPlan] with PredicateHelper { | |||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | |||
case filter @ Filter(condition, project @ Project(fields, grandChild)) => | |||
case filter @ Filter(condition, project @ Project(fields, grandChild)) | |||
if fields.forall(_.deterministic) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we add some comments to explain why we can't push down filter through project with non-deterministic fields? e.g. number of input rows is also an implicit input for non-deterministic expressions, push down filter will break it.
LGTM except one comment |
Thanks for the review, comment added. |
…ministic field(s) ## What changes were proposed in this pull request? Predicates shouldn't be pushed through project with nondeterministic field(s). See graphframes/graphframes#23 and SPARK-13473 for more details. This PR targets master, branch-1.6, and branch-1.5. ## How was this patch tested? A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case. Author: Cheng Lian <[email protected]> Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field. (cherry picked from commit 3fa6491) Signed-off-by: Wenchen Fan <[email protected]>
…ministic field(s) ## What changes were proposed in this pull request? Predicates shouldn't be pushed through project with nondeterministic field(s). See graphframes/graphframes#23 and SPARK-13473 for more details. This PR targets master, branch-1.6, and branch-1.5. ## How was this patch tested? A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case. Author: Cheng Lian <[email protected]> Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field. (cherry picked from commit 3fa6491) Signed-off-by: Wenchen Fan <[email protected]>
The last commit is adding comments, so it's safe to merge after style check passed. |
Test build #51969 has finished for PR 11348 at commit
|
## What changes were proposed in this pull request? This is a follow-up of PR #11348. After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <[email protected]> Closes #11864 from liancheng/spark-13473-cleanup.
## What changes were proposed in this pull request? This is a follow-up of PR apache#11348. After PR apache#11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <[email protected]> Closes apache#11864 from liancheng/spark-13473-cleanup.
Hi @liancheng, @mengxr, @cloud-fan, I'm trying to understand why pushing down filters with nondeterministic fields is considered a bug. How would the different nondeterministic results impact the query? For instance, other engines like Hive do push down filters in these cases. Could this change lead to performance regressions in our queries? |
What changes were proposed in this pull request?
Predicates shouldn't be pushed through project with nondeterministic field(s).
See graphframes/graphframes#23 and SPARK-13473 for more details.
This PR targets master, branch-1.6, and branch-1.5.
How was this patch tested?
A test case is added in
FilterPushdownSuite
. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.