[SPARK-13473][SQL] Simplifies PushPredicateThroughProject #11864

liancheng · 2016-03-21T16:50:32Z

What changes were proposed in this pull request?

This is a follow-up of PR #11348.

After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up.

To be more specific, the following optimization is allowed:

// From:
df.select('a, 'b).filter('c > rand(42))
// To:
df.filter('c > rand(42)).select('a, 'b)

while this isn't:

// From:
df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb)
// To:
df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c)

How was this patch tested?

Existing test cases should do the work.

liancheng · 2016-03-21T16:50:47Z

cc @cloud-fan @yhuai

yhuai · 2016-03-21T17:27:41Z

It will be good to also have an example to explain the reason.

SparkQA · 2016-03-21T18:28:30Z

Test build #53684 has finished for PR 11864 at commit d5460fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-21T20:43:53Z

Is the code cleaning based on the following condition:
https://github.com/apache/spark/pull/11864/files#diff-a636a87d8843eeccca90140be91d4fafR886

This is just check if all the elements in the projectList of Project are deterministic? However, part of the condition in Filter could be still non-deterministic? Please correct me if my understanding is wrong. Thanks!

cloud-fan · 2016-03-22T00:07:08Z

LGTM, @gatorsmile , when we run into this branch, the condition won't contain any non-deterministic expressions, see https://github.com/apache/spark/pull/11864/files#diff-a636a87d8843eeccca90140be91d4fafR886

liancheng · 2016-03-22T00:09:43Z

@gatorsmile You're right. Note that code removed in this PR is already dead paths since now the predicate can't reference any non-deterministic fields. On the other hand, it's OK to push down a predicate containing non-deterministic expressions. For example:

df.select('a, 'b).filter('c > rand(42))

is equivalent to

df.filter('c > rand(42)).select('a, 'b)

liancheng · 2016-03-22T00:12:27Z

@yhuai Updated PR description with examples.

gatorsmile · 2016-03-22T00:58:07Z

I see. Thank you! @liancheng @cloud-fan

LGTM

gatorsmile · 2016-03-22T01:49:28Z

Just a minor issue in the description:

df.select('a, 'b).filter('c > rand(42))

Actually, since 'c is not selected, we need to update the filter to filter('a > rand(42)) or filter('b > rand(42))

liancheng · 2016-03-22T10:51:33Z

@gatorsmile Thanks for pointing this out! Fixed the PR description.

liancheng · 2016-03-22T10:51:41Z

Merging to master.

## What changes were proposed in this pull request? This is a follow-up of PR apache#11348. After PR apache#11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <[email protected]> Closes apache#11864 from liancheng/spark-13473-cleanup.

Simplifies PushPredicateThroughProject

d5460fd

asfgit closed this in f2e855f Mar 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13473][SQL] Simplifies PushPredicateThroughProject #11864

[SPARK-13473][SQL] Simplifies PushPredicateThroughProject #11864

liancheng commented Mar 21, 2016

liancheng commented Mar 21, 2016

yhuai commented Mar 21, 2016

SparkQA commented Mar 21, 2016

gatorsmile commented Mar 21, 2016

cloud-fan commented Mar 22, 2016

liancheng commented Mar 22, 2016

liancheng commented Mar 22, 2016

gatorsmile commented Mar 22, 2016

gatorsmile commented Mar 22, 2016

liancheng commented Mar 22, 2016

liancheng commented Mar 22, 2016

[SPARK-13473][SQL] Simplifies PushPredicateThroughProject #11864

[SPARK-13473][SQL] Simplifies PushPredicateThroughProject #11864

Conversation

liancheng commented Mar 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

liancheng commented Mar 21, 2016

yhuai commented Mar 21, 2016

SparkQA commented Mar 21, 2016

gatorsmile commented Mar 21, 2016

cloud-fan commented Mar 22, 2016

liancheng commented Mar 22, 2016

liancheng commented Mar 22, 2016

gatorsmile commented Mar 22, 2016

gatorsmile commented Mar 22, 2016

liancheng commented Mar 22, 2016

liancheng commented Mar 22, 2016