Skip to content

Commit

Permalink
[SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision b…
Browse files Browse the repository at this point in the history
…y zero in a case of empty table with analyzed statistics

>What changes were proposed in this pull request?

During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException.
In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max  are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned.

>How was this patch tested?

In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above,  and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section.

Author: Mykhailo Shtelma <[email protected]>
Author: smikesh <[email protected]>

Closes apache#21052 from mshtelma/filter_estimation_evaluateInSet_Bugs.
  • Loading branch information
Mykhailo Shtelma authored and gatorsmile committed Apr 22, 2018
1 parent 7bc853d commit c48085a
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,10 @@ case class FilterEstimation(plan: Filter) extends Logging {
val dataType = attr.dataType
var newNdv = ndv

if (ndv.toDouble == 0 || colStat.min.isEmpty || colStat.max.isEmpty) {
return Some(0.0)
}

// use [min, max] to filter the original hSet
dataType match {
case _: NumericType | BooleanType | DateType | TimestampType =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,17 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
expectedRowCount = 3)
}

test("evaluateInSet with all zeros") {
validateEstimatedStats(
Filter(InSet(attrString, Set(3, 4, 5)),
StatsTestPlan(Seq(attrString), 0,
AttributeMap(Seq(attrString ->
ColumnStat(distinctCount = Some(0), min = None, max = None,
nullCount = Some(0), avgLen = Some(0), maxLen = Some(0)))))),
Seq(attrString -> ColumnStat(distinctCount = Some(0))),
expectedRowCount = 0)
}

test("cint NOT IN (3, 4, 5)") {
validateEstimatedStats(
Filter(Not(InSet(attrInt, Set(3, 4, 5))), childStatsTestPlan(Seq(attrInt), 10L)),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
}
}
}

test("Simple queries must be working, if CBO is turned on") {
withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
withTable("TBL1", "TBL") {
import org.apache.spark.sql.functions._
val df = spark.range(1000L).select('id,
'id * 2 as "FLD1",
'id * 12 as "FLD2",
lit("aaa") + 'id as "fld3")
df.write
.mode(SaveMode.Overwrite)
.bucketBy(10, "id", "FLD1", "FLD2")
.sortBy("id", "FLD1", "FLD2")
.saveAsTable("TBL")
sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")
sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3")
val df2 = spark.sql(
"""
|SELECT t1.id, t1.fld1, t1.fld2, t1.fld3
|FROM tbl t1
|JOIN tbl t2 on t1.id=t2.id
|WHERE t1.fld3 IN (-123.23,321.23)
""".stripMargin)
df2.createTempView("TBL2")
sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").queryExecution.executedPlan
}
}
}
}

0 comments on commit c48085a

Please sign in to comment.