Skip to content

Commit

Permalink
[SPARK-2145] Add lower bound on sampling rate
Browse files Browse the repository at this point in the history
to guarantee sampling performance
  • Loading branch information
dorx committed Jun 19, 2014
1 parent 0214a76 commit 944a10c
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* the RDD to guarantee sample size with a 99.99% confidence; when sampling with replacement, we
* need two additional passes over the RDD to guarantee sample size with a 99.99% confidence.
*
* Note that if the sampling rate for any stratum is < 1e-10, we will throw an exception to
* avoid not being able to ever create the sample as an artifact of the RNG's quality.
*
* @param withReplacement whether to sample with or without replacement
* @param fractionByKey function mapping key to sampling rate
* @param seed seed for the random number generator
Expand All @@ -227,6 +230,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
fractionByKey: K => Double,
seed: Long = Utils.random.nextLong,
exact: Boolean = true): RDD[(K, V)]= {

require(fractionByKey.asInstanceOf[Map[K, Double]].forall({case(k, v) => v >= 1e-10}),
"Unable to support sampling rates < 1e-10.")

if (withReplacement) {
val counts = if (exact) Some(this.countByKey()) else None
val samplingFunc =
Expand Down
4 changes: 3 additions & 1 deletion core/src/main/scala/org/apache/spark/rdd/RDD.scala
Original file line number Diff line number Diff line change
Expand Up @@ -350,11 +350,13 @@ abstract class RDD[T: ClassTag](

/**
* Return a sampled subset of this RDD.
*
* fraction < 1e-10 not supported.
*/
def sample(withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T] = {
require(fraction >= 0.0, "Invalid fraction value: " + fraction)
require(fraction >= 1e-10, "Invalid fraction value: " + fraction)
if (withReplacement) {
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), seed)
} else {
Expand Down

0 comments on commit 944a10c

Please sign in to comment.