Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23799] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics #20913

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
5c8fe0e
During evaluation of IN conditions, if the source table is empty, div…
Mar 26, 2018
67597fd
Added test case for the the following situation: During evaluation of…
Apr 3, 2018
989d4cc
[SPARK-23572][DOCS] Bring "security.md" up to date.
Mar 26, 2018
d49c6dd
[SPARK-23162][PYSPARK][ML] Add r2adj into Python API in LinearRegress…
kevinyu98 Mar 26, 2018
1dfa74b
[SPARK-23794][SQL] Make UUID as stateful expression
viirya Mar 27, 2018
10d6ec1
[SPARK-23096][SS] Migrate rate source to V2
jerryshao Mar 27, 2018
440549c
[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow…
BryanCutler Mar 28, 2018
c1b407c
[SPARK-23765][SQL] Supports custom line separator for json datasource
HyukjinKwon Mar 28, 2018
923ab78
Revert "[SPARK-23096][SS] Migrate rate source to V2"
gatorsmile Mar 28, 2018
ac4a3c3
[SPARK-23675][WEB-UI] Title add spark logo, use spark logo image
Mar 29, 2018
0826158
[SPARK-23806] Broadcast.unpersist can cause fatal exception when used…
Mar 29, 2018
96dc23b
[SPARK-23770][R] Exposes repartitionByRange in SparkR
HyukjinKwon Mar 29, 2018
b8882b8
[SPARK-23785][LAUNCHER] LauncherBackend doesn't check state of connec…
Mar 29, 2018
7c5ca63
[SPARK-23639][SQL] Obtain token before init metastore client in Spark…
yaooqinn Mar 29, 2018
132afa7
[SPARK-23808][SQL] Set default Spark session in test-only spark sessi…
jose-torres Mar 30, 2018
542f0ba
[SPARK-23743][SQL] Changed a comparison logic from containing 'slf4j'…
jongyoul Mar 30, 2018
082919f
[SPARK-23727][SQL] Support for pushing down filters for DateType in p…
yucai Mar 30, 2018
6bd199b
Roll forward "[SPARK-23096][SS] Migrate rate source to V2"
jose-torres Mar 30, 2018
96ba7af
[SPARK-23500][SQL][FOLLOWUP] Fix complex type simplification rules to…
gatorsmile Mar 30, 2018
32382fc
[SPARK-23640][CORE] Fix hadoop config may override spark config
wangyum Mar 30, 2018
14a48e3
[SPARK-23827][SS] StreamingJoinExec should ensure that input data is …
tdas Mar 30, 2018
5152b3c
[SPARK-23040][CORE][FOLLOW-UP] Avoid double wrap result Iterator.
jiangxb1987 Mar 31, 2018
1f9284a
[SPARK-15009][PYTHON][FOLLOWUP] Add default param checks for CountVec…
BryanCutler Apr 2, 2018
55f3371
[SPARK-23825][K8S] Requesting memory + memory overhead for pod memory
dvogelbacher Apr 2, 2018
f86b703
[SPARK-23285][K8S] Add a config property for specifying physical exec…
liyinan926 Apr 2, 2018
54d3dd0
[SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes
kiszk Apr 2, 2018
782d7af
[SPARK-23834][TEST] Wait for connection before disconnect in Launcher…
Apr 2, 2018
2934384
[SPARK-23690][ML] Add handleinvalid to VectorAssembler
Apr 2, 2018
9b81269
[SPARK-19964][CORE] Avoid reading from remote repos in SparkSubmitSuite.
Apr 3, 2018
36a1607
[MINOR][DOC] Fix a few markdown typos
Apr 3, 2018
e29fc9c
[MINOR][CORE] Show block manager id when remove RDD/Broadcast fails.
jiangxb1987 Apr 3, 2018
4efdf84
[SPARK-23099][SS] Migrate foreach sink to DataSourceV2
jose-torres Apr 3, 2018
02bc4f6
[SPARK-23587][SQL] Add interpreted execution for MapObjects expression
viirya Apr 3, 2018
f8094fb
[SPARK-23809][SQL] Active SparkSession should be set by getOrCreate
ericl Apr 4, 2018
b55e39f
[SPARK-23802][SQL] PropagateEmptyRelation can leave query plan in unr…
Apr 4, 2018
699cc97
[SPARK-23826][TEST] TestHiveSparkSession should set default session
gatorsmile Apr 4, 2018
7561a62
[SPARK-21351][SQL] Update nullability based on children's output
maropu Apr 4, 2018
a85535b
[SPARK-23583][SQL] Invoke should support interpreted execution
kiszk Apr 4, 2018
295d11f
[SPARK-23668][K8S] Add config option for passing through k8s Pod.spec…
Apr 4, 2018
80cab07
[SPARK-23838][WEBUI] Running SQL query is displayed as "completed" in…
gengliangwang Apr 4, 2018
984faf5
[SPARK-23637][YARN] Yarn might allocate more resource if a same execu…
Apr 4, 2018
5883c3c
Merge branch 'master' into filter_estimation_devision_by_zero
Apr 10, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -427,7 +427,11 @@ case class FilterEstimation(plan: Filter) extends Logging {

// return the filter selectivity. Without advanced statistics such as histograms,
// we have to assume uniform distribution.
Some(math.min(newNdv.toDouble / ndv.toDouble, 1.0))
if (ndv.toDouble != 0) {
Copy link
Member

@maropu maropu Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the concrete example query when ndv.toDouble == 0?
Also, is this only an place where we need this check?
For example, we don't here?:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have experienced this problem for the sub condition with IN clause, smth like "FLD in ("value")".
To my mind, this happens, if the table is empty. In this case ndv will be 0.
I think, it will make sense, to check it everywhere it is used in this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for the empty table case?
I think we need to fix the other places if they have the same issue. cc: @wzhfy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have developed the test that illustrates the problem. It is just a small scala app right now.
Right now it breaks after the last select. Adding REFRESH TABLE TBL does not fix the problem in this particular case.
Should I add it to some particular test suite ?

    object SparkCBOStatisticsTest {
      def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder
          .master("local[*]")
          .config("spark.sql.adaptive.enabled", "true")
          .config("spark.sql.cbo.enabled", "true")
          .config("spark.sql.cbo.joinReorder.enabled", "true")
          .config("spark.sql.sources.bucketing.enabled", "true")
          .getOrCreate()
        import spark.implicits._
        import org.apache.spark.sql.functions._
        val df = spark.range(10000L).select('id,
          'id * 2 as "fld1",
          'id * 12 as "fld2",
          lit("aaa") + 'id as "fld3",
          lit("bbb") + 'id * 'id as "fld4")
        df.write
          .mode(SaveMode.Overwrite)
          .bucketBy(10, "id","fld1", "fld2")
          .sortBy("id","fld1", "fld2")
          .saveAsTable("TBL")
        spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")
        spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, FLD2, FLD3")
        val df2 = spark.sql("select t1.fld1*t2.fld2 as fld1, t1.fld2, t1.fld3 from tbl t1 join tbl t2 on t1.id=t2.id where t1.fld1>100 and t1.fld3 in (-123.23,321.23) ")
        df2.explain(true)
        df2.show(100)
        df2.createTempView("tbl2")
        spark.sql("DESC FORMATTED TBL").show(false)
        spark.sql("select * from tbl2 where fld3 in ('qqq', 'qwe') and fld1<100  ").show(100, false)
    
      }
    }

Some(math.min(newNdv.toDouble / ndv.toDouble, 1.0))
} else {
Some(0.0)
}
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,17 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
expectedRowCount = 3)
}

test("evaluateInSet with all zeros") {
validateEstimatedStats(
Filter(InSet(attrString, Set(3, 4, 5)),
StatsTestPlan(Seq(attrString), 10,
AttributeMap(Seq(attrString ->
ColumnStat(distinctCount = Some(0), min = Some(0), max = Some(0),
nullCount = Some(0), avgLen = Some(0), maxLen = Some(0)))))),
Seq(attrString -> ColumnStat(distinctCount = Some(0))),
expectedRowCount = 0)
}

test("cint NOT IN (3, 4, 5)") {
validateEstimatedStats(
Filter(Not(InSet(attrInt, Set(3, 4, 5))), childStatsTestPlan(Seq(attrInt), 10L)),
Expand Down