Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13473][SQL] Simplifies PushPredicateThroughProject #11864

Closed
wants to merge 1 commit into from

Conversation

liancheng
Copy link
Contributor

What changes were proposed in this pull request?

This is a follow-up of PR #11348.

After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up.

To be more specific, the following optimization is allowed:

// From:
df.select('a, 'b).filter('c > rand(42))
// To:
df.filter('c > rand(42)).select('a, 'b)

while this isn't:

// From:
df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb)
// To:
df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c)

How was this patch tested?

Existing test cases should do the work.

@liancheng
Copy link
Contributor Author

cc @cloud-fan @yhuai

@yhuai
Copy link
Contributor

yhuai commented Mar 21, 2016

It will be good to also have an example to explain the reason.

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53684 has finished for PR 11864 at commit d5460fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Is the code cleaning based on the following condition:
https://github.com/apache/spark/pull/11864/files#diff-a636a87d8843eeccca90140be91d4fafR886

This is just check if all the elements in the projectList of Project are deterministic? However, part of the condition in Filter could be still non-deterministic? Please correct me if my understanding is wrong. Thanks!

@cloud-fan
Copy link
Contributor

LGTM, @gatorsmile , when we run into this branch, the condition won't contain any non-deterministic expressions, see https://github.com/apache/spark/pull/11864/files#diff-a636a87d8843eeccca90140be91d4fafR886

@liancheng
Copy link
Contributor Author

@gatorsmile You're right. Note that code removed in this PR is already dead paths since now the predicate can't reference any non-deterministic fields. On the other hand, it's OK to push down a predicate containing non-deterministic expressions. For example:

df.select('a, 'b).filter('c > rand(42))

is equivalent to

df.filter('c > rand(42)).select('a, 'b)

@liancheng
Copy link
Contributor Author

@yhuai Updated PR description with examples.

@gatorsmile
Copy link
Member

I see. Thank you! @liancheng @cloud-fan

LGTM

@gatorsmile
Copy link
Member

Just a minor issue in the description:

df.select('a, 'b).filter('c > rand(42))

Actually, since 'c is not selected, we need to update the filter to filter('a > rand(42)) or filter('b > rand(42))

@liancheng
Copy link
Contributor Author

@gatorsmile Thanks for pointing this out! Fixed the PR description.

@liancheng
Copy link
Contributor Author

Merging to master.

@asfgit asfgit closed this in f2e855f Mar 22, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This is a follow-up of PR apache#11348.

After PR apache#11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up.

To be more specific, the following optimization is allowed:

```scala
// From:
df.select('a, 'b).filter('c > rand(42))
// To:
df.filter('c > rand(42)).select('a, 'b)
```

while this isn't:

```scala
// From:
df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb)
// To:
df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c)
```

## How was this patch tested?

Existing test cases should do the work.

Author: Cheng Lian <[email protected]>

Closes apache#11864 from liancheng/spark-13473-cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants