-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13473][SQL] Simplifies PushPredicateThroughProject #11864
Conversation
cc @cloud-fan @yhuai |
It will be good to also have an example to explain the reason. |
Test build #53684 has finished for PR 11864 at commit
|
Is the code cleaning based on the following condition: This is just check if all the elements in the |
LGTM, @gatorsmile , when we run into this branch, the condition won't contain any non-deterministic expressions, see https://github.com/apache/spark/pull/11864/files#diff-a636a87d8843eeccca90140be91d4fafR886 |
@gatorsmile You're right. Note that code removed in this PR is already dead paths since now the predicate can't reference any non-deterministic fields. On the other hand, it's OK to push down a predicate containing non-deterministic expressions. For example: df.select('a, 'b).filter('c > rand(42)) is equivalent to df.filter('c > rand(42)).select('a, 'b) |
@yhuai Updated PR description with examples. |
I see. Thank you! @liancheng @cloud-fan LGTM |
Just a minor issue in the description: df.select('a, 'b).filter('c > rand(42)) Actually, since |
@gatorsmile Thanks for pointing this out! Fixed the PR description. |
Merging to master. |
## What changes were proposed in this pull request? This is a follow-up of PR apache#11348. After PR apache#11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <[email protected]> Closes apache#11864 from liancheng/spark-13473-cleanup.
What changes were proposed in this pull request?
This is a follow-up of PR #11348.
After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up.
To be more specific, the following optimization is allowed:
while this isn't:
How was this patch tested?
Existing test cases should do the work.