-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15122][SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out #12954
Conversation
Test build #57994 has finished for PR 12954 at commit
|
@@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog} | |||
import org.apache.spark.sql.catalyst.encoders.OuterScopes | |||
import org.apache.spark.sql.catalyst.expressions._ | |||
import org.apache.spark.sql.catalyst.expressions.aggregate._ | |||
import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It weird to use optimizer rule in analyzer. We could have this workaround for now, could you create a JIRA for do a proper fix later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree on that. I could move the rule from optimizer
to analysis
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the rule in the latest commit.
I still can't run Q41 after this PR:
|
@hvanhovell never mind, I made a mistake in the query |
* correlated [[ScalarSubquery]] expressions are wrapped in a some Predicate expression, we rewrite | ||
* them into [[PredicateSubquery]] expressions. | ||
*/ | ||
object RewriteScalarSubqueriesInFilter extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule is not necessary for this fix, right?
Also, we have a rule to turn outer join into semi join if possible, does that one not work in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not nessecary for this fix. It is left over from the initial attempt. I do think it adds value, since it allows us to use queries which might return multiple values for one key.
We do have a rule that promotes and outer join into an inner join, but we do not have one that turns a outer/inner join into a semi join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it if you want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With inner join, the predicate could be pushed down into subquery actually, but it was not with semi join. So it's not that obvious that which one is better.
After looking it for a long time, I'm not convinced it's correct ( or the best way to do it). Could you keep it out of this PR for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of Semi Join is that a subquery can actually return multiple results for one row without causing correctness problems. My initial approach was to relax the rules for scalar subquery (allow disjunctive predicates) and prevent possible duplicates by using left semis. This didn't work because I was also pulling non-correlated predicates through the aggregate (which makes its results invalid).
I am not sure which predicates you want to push through the left semi. Since all predicates that should be pushed down the left hand side are already in the predicate condition. But I might be missing something here.
I have remove this in my last commit. I do feel that this might be a small improvement over the current situation. Let's revisit this after Spark 2.0.
Test build #58031 has finished for PR 12954 at commit
|
* 3. Merge same expressions | ||
* 4. Removes `Not` operator. | ||
*/ | ||
object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this one into analysis is even worse ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what do you suggest then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import it into analyzer as temporary fix, we could think of a proper way later.
Test build #58047 has finished for PR 12954 at commit
|
LGTM, |
…ng them out ## What changes were proposed in this pull request? The official TPC-DS 41 query currently fails because it contains a scalar subquery with a disjunctive correlated predicate (the correlated predicates were nested in ORs). This makes the `Analyzer` pull out the entire predicate which is wrong and causes the following (correct) analysis exception: `The correlated scalar subquery can only contain equality predicates` This PR fixes this by first simplifing (or normalizing) the correlated predicates before pulling them out of the subquery. ## How was this patch tested? Manual testing on TPC-DS 41, and added a test to SubquerySuite. Author: Herman van Hovell <[email protected]> Closes #12954 from hvanhovell/SPARK-15122. (cherry picked from commit df89f1d) Signed-off-by: Davies Liu <[email protected]>
What changes were proposed in this pull request?
The official TPC-DS 41 query currently fails because it contains a scalar subquery with a disjunctive correlated predicate (the correlated predicates were nested in ORs). This makes the
Analyzer
pull out the entire predicate which is wrong and causes the following (correct) analysis exception:The correlated scalar subquery can only contain equality predicates
This PR fixes this by first simplifing (or normalizing) the correlated predicates before pulling them out of the subquery.
How was this patch tested?
Manual testing on TPC-DS 41, and added a test to SubquerySuite.