Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17897] [SQL] [BACKPORT-2.0] Fixed IsNotNull Constraint Inference Rule #16894

Closed
wants to merge 1 commit into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

This PR is to backport #16067 to Spark 2.0


The constraints of an operator is the expressions that evaluate to true for all the rows produced. That means, the expression result should be neither false nor unknown (NULL). Thus, we can conclude that IsNotNull on all the constraints, which are generated by its own predicates or propagated from the children. The constraint can be a complex expression. For better usage of these constraints, we try to push down IsNotNull to the lowest-level expressions (i.e., Attribute). IsNotNull can be pushed through an expression when it is null intolerant. (When the input is NULL, the null-intolerant expression always evaluates to NULL.)

Below is the existing code we have for IsNotNull pushdown.

  private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = expr match {
    case a: Attribute => Seq(a)
    case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
      expr.children.flatMap(scanNullIntolerantExpr)
    case _ => Seq.empty[Attribute]
  }

IsNotNull itself is not null-intolerant. It converts null to false. If the expression does not include any Not-like expression, it works; otherwise, it could generate a wrong result. This PR is to fix the above function by removing the IsNotNull from the inference. After the fix, when a constraint has a IsNotNull expression, we infer new attribute-specific IsNotNull constraints if and only if IsNotNull appears in the root.

Without the fix, the following test case will return empty.

val data = Seq[java.lang.Integer](1, null).toDF("key")
data.filter("not key is not null").show()

Before the fix, the optimized plan is like

== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
   +- LocalRelation [value#1]

After the fix, the optimized plan is like

== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter NOT isnotnull(value#1)
   +- LocalRelation [value#1]

How was this patch tested?

Added a test

@SparkQA
Copy link

SparkQA commented Feb 11, 2017

Test build #72734 has started for PR 16894 at commit 11d2684.

@hvanhovell
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 11, 2017

Test build #72735 has finished for PR 16894 at commit 11d2684.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72743 has finished for PR 16894 at commit 11d2684.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Feb 12, 2017
… Rule

### What changes were proposed in this pull request?

This PR is to backport #16067 to Spark 2.0

----

The `constraints` of an operator is the expressions that evaluate to `true` for all the rows produced. That means, the expression result should be neither `false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the constraints, which are generated by its own predicates or propagated from the children. The constraint can be a complex expression. For better usage of these constraints, we try to push down `IsNotNull` to the lowest-level expressions (i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is null intolerant. (When the input is NULL, the null-intolerant expression always evaluates to NULL.)

Below is the existing code we have for `IsNotNull` pushdown.
```Scala
  private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = expr match {
    case a: Attribute => Seq(a)
    case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
      expr.children.flatMap(scanNullIntolerantExpr)
    case _ => Seq.empty[Attribute]
  }
```

**`IsNotNull` itself is not null-intolerant.** It converts `null` to `false`. If the expression does not include any `Not`-like expression, it works; otherwise, it could generate a wrong result. This PR is to fix the above function by removing the `IsNotNull` from the inference. After the fix, when a constraint has a `IsNotNull` expression, we infer new attribute-specific `IsNotNull` constraints if and only if `IsNotNull` appears in the root.

Without the fix, the following test case will return empty.
```Scala
val data = Seq[java.lang.Integer](1, null).toDF("key")
data.filter("not key is not null").show()
```
Before the fix, the optimized plan is like
```
== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
   +- LocalRelation [value#1]
```

After the fix, the optimized plan is like
```
== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter NOT isnotnull(value#1)
   +- LocalRelation [value#1]
```

### How was this patch tested?
Added a test

Author: Xiao Li <[email protected]>

Closes #16894 from gatorsmile/isNotNull2.0.
@cloud-fan
Copy link
Contributor

thanks, merging to 2.0

@gatorsmile gatorsmile closed this Feb 12, 2017
@yhuai
Copy link
Contributor

yhuai commented Feb 12, 2017

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants