Skip to content

Commit

Permalink
[SPARK-18111][SQL] Wrong approximate quantile answer when multiple re…
Browse files Browse the repository at this point in the history
…cords have the minimum value(for branch 2.0)

## What changes were proposed in this pull request?
When multiple records have the minimum value, the answer of `StatFunctions.multipleApproxQuantiles` is wrong.

## How was this patch tested?
add a test case

Author: wangzhenhua <[email protected]>

Closes #15732 from wzhfy/percentile2.
  • Loading branch information
wzhfy authored and rxin committed Nov 2, 2016
1 parent 1696bcf commit 3253ae7
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -337,7 +337,9 @@ object StatFunctions extends Logging {
res.prepend(head)
// If necessary, add the minimum element:
val currHead = currentSamples.head
if (currHead.value < head.value) {
// don't add the minimum element if `currentSamples` has only one element (both `currHead` and
// `head` point to the same element)
if (currHead.value <= head.value && currentSamples.length > 1) {
res.prepend(currentSamples.head)
}
res.toArray
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,19 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
}
}

test("approximate quantile, multiple records with the minimum value in a partition") {
val data = Seq(1, 1, 2, 1, 1, 3, 1, 1, 4, 1, 1, 5)
val df = spark.sparkContext.makeRDD(data, 4).toDF("col")
val epsilons = List(0.1, 0.05, 0.001)
val quantile = 0.5
val expected = 1
for (epsilon <- epsilons) {
val Array(answer) = df.stat.approxQuantile("col", Array(quantile), epsilon)
val error = 2 * data.length * epsilon
assert(math.abs(answer - expected) < error)
}
}

test("crosstab") {
val rng = new Random()
val data = Seq.tabulate(25)(i => (rng.nextInt(5), rng.nextInt(10)))
Expand Down

0 comments on commit 3253ae7

Please sign in to comment.