rand::sample's API should be fallible #194

vitiral · 2017-11-21T16:06:56Z

Most rust operations are fallible if they don't make sense, either through a panic or by returning an Option or Result. However, rand::sample has the following API:

pub fn sample<T, I, R>(rng: &mut R, iterable: I, amount: usize) -> Vec<T> where
    I: IntoIterator<Item = T>,
    R: Rng,

Randomly sample up to amount elements from a finite iterator.

This function will silently return less than amount if the size of I is less than amount. I would say this is very odd, and returning Option<Vec<T>> would have been a better design choice. You can imagine the confusion if you did rand::sample(rng, v, 100) but then only had 2 elements!

If it were changed to returning None if I.len() < amount then the current behavior could still be accomplished with:

rand::sample(&mut rng, vec.iter(), usize::min(amount, vec.len())).unwrap()

With the current behavior, user has to put guards around their code (which is not required by the compiler) to prevent getting less than amount, which is unusual in rust.

Another argument is that trying to sample when amount > len() will just return an "inadequately shuffled" version of the iterator. If someone wanted that behavior it would be more correct to just collect() the iterator and call shuffle() on it.

The text was updated successfully, but these errors were encountered:

dhardy · 2017-11-21T17:17:10Z

Do you have a context where you're using this function? I'm not sure what the main use-cases are which makes it hard to say what it should do.

vitiral · 2017-11-21T17:30:29Z

The major point is how this behaves compared to functions like slice or even Vec::get which do not permit you to attempt to access slices that are outside of the range.

#[test]
fn it_works() {
    let v = vec![1,2,3,4];
    assert_eq!(&v[2..6], &[3,4]);
}

panics with:

        thread 'it_works' panicked at 'index 6 out of range for slice of length 4', /checkout/src/libcore/slice/mod.rs:745:4

The way this currently behaves is more like python's slice mechanics, where doing list[5:100] will raise an error only if the min-index does not exist, but doesn't care if it returns less than 95 elements.

The counterpoint is something like Iterator::take, but that actually returns an iterator, so you still get None values when it runs out.

Edit: I consider sample to be akin to a "random slice of values" -- obviously not in order, etc but in that vein. Therefore it should always return exactly amount values.

vitiral · 2017-11-21T17:33:39Z

Also, I a requesting this change as part of the rand crate stabilization: rust-lang/rfcs#2106

dhardy · 2017-11-21T17:39:05Z

Is it? I don't remember writing that. Do you mean it was requested in a comment?

Regarding the previous post: true. But another point related to sample is that the algorithm could be more efficient if the input is given as a slice (or something allowing random access), roughly when input.len() > 2 * amount, as well as providing true sorting.

So without use cases I still don't know what this function should do (I'd be tempted just not to stabilise it, e.g. put it behind a feature gate).

vitiral · 2017-11-21T17:46:41Z

sorry, I meant that I am requesting it as part of that effort :)

vitiral · 2017-11-21T17:49:50Z

I might be wrong, but I think rust allows you to overload to provide different implementations for different input types, so such efficiency could be provided.

Also, correct me if I'm wrong but the function already requires an iterator of a known length right?

... Hmm, I'm looking at it and I think it just requires the input to be an iterator, so it doesn't necessarily know anything about the length. I think there are really two use cases here that should mabe be two separate functions. At the very least, one that knows about the length and one that doesn't.

dhardy · 2017-11-21T18:37:45Z

Iterators never have a known length in Rust. At best, they have a size-hint, but that's too unreliable for this use-case.

You're right, two different functions are called for to cover this functionality fully. But I wonder which one(s) would actually be used?

bluss · 2017-11-21T20:31:34Z

I think the current interface of sample kind of makes sense and is self-consistent, but it's a very tricky to use function. This is reservoir sampling, and the function includes a note that the order of elements in the output is not random!

The interface is that you input a sequence of n elements and then you pick at most m of those elements, where each element has equal chance of ending up in the output. This generalizes quite nicely into supporting situations where n <= m; you simply get the whole sequence back.

vitiral · 2017-11-21T21:14:14Z

So I can see three major use cases and am starting to think that these should be implemented as traits

This one is the same as what exists and would be implemented just for any iterator

trait SampleIter<T: Clone>: IntoIter<Item=T> {
    /// Randomly sample up to amount elements from a finite iterator.
    /// The order of elements in the sample is not random.
    ///
    /// This consumes the entire iterator.
    pub fn sample<R: Rng>(&mut self, rng: &mut R, amount: usize) -> Vec<T> 
}

This is similar, but is for slices.

Edit: I've split it into two methods so that sample has identical behavior to its iterator friend.

trait SampleSlice<T> {
    /// Randomly sample up to amount elements from a a slice.
    ///
    /// This takes at most `O(amount)` time.
    ///
    /// returns a clone of `self` if `amount <= self.len()`
    pub fn sample<R: Rng>(&self, rng: &mut R, amount: usize) -> Vec<T> 

    /// Randomly sample up to amount elements from a a slice.
    ///
    /// This takes at most `O(amount)` time.
    ///
    /// Panics if self.len() < amount.
    pub fn sample_exact<R: Rng>(&self, rng: &mut R, amount: usize) -> Vec<T> 
}

Another implementation could hopefully be done on HashSet, HashMap, BTreeMap, etc -- I'm not sure if there is a way to reduce those to O(amount) work, but if there were it would be very nice!

The point here is that each is implementation specific and I can see use cases for all of them:

Randomly select 20 lines from a file (or really any long stream): use SampleIter trait, it will not take up any extra memory!
Randomly select 20 items from a Vec/Slice: use SampleSlice trait, it will only take `O(amount) time!
Randomly select 20 possiblilities from a HashSet/Map: use SampleHash and (hopefully) get speedup over SampleIter, hopefully on-par with SampleSlice. Having these even be deterministic with the same seed + items would be very nice!

vitiral · 2017-11-21T21:15:46Z

@bluss I actually didn't realize that if n <= m it will actually be the elements in the order they were consumed. That is quite an interesting property that I think a SampleIter would want to preserve (and document).

vitiral · 2017-11-21T21:19:45Z

I edited SampleSlice slightly to add to methods to the trait.

bluss · 2017-11-21T21:20:50Z

Sample is suggestive of a random order, so I would prefer something matter of fact - like naming it reservoir sample, explicitly after the algorithm it implements.

vitiral · 2017-11-21T21:22:12Z

Also, it might be possible to have only the traits Sample and SampleExact, but my trait-foo isn't good enough to get rid of the &mut self for the iterator definition.

@bluss that sounds fine to me. Actually, sample_exact could be named just sample then (and panic for the incorrect amount) because you can implement a random sample on a slice with no time penalty right?

Edit: and calling it when n == m would be equivalent to shuffle()

vitiral · 2017-11-21T21:27:33Z

Also, an implementation detail: for SampleExact we should probably return Vec<&T> as the references should be sufficient and have size/time savings (as well as not require Clone to be implemented).

burdges · 2017-11-22T01:48:51Z

This function makes sense only when your sample includes a significant fraction of the output of the source iterator. You could build a stable version of this too, maybe by using a heap, or maybe through indexing:

struct StableSample<T> {
    index: usize,
    reservoir: Vec<(isize,isize,T)>,
}
impl Iterator for StableSample<T> {
    type Item = T;
    fn next() -> Option<T> {
        reservoir.get(start).map(|spot| {
            index = spot.1;
            spot.2
        })
    }
}

pub fn stable_sample<T, I, R>(rng: &mut R, iterable: I, amount: usize) -> StableSample<T>
    where I: IntoIterator<Item=T>,
          R: Rng,
{
    let mut iter = iterable.into_iter();
    let mut reservoir: Vec<(isize,isize,T)> = iter.by_ref().take(amount).enumerate().map(|(x,y)| (x-1,x+1,y)).collect();
    // continue unless the iterator was exhausted
    let mut index = 0;
    if reservoir.len() == amount {
        let mut old_k = amount-1;
        for (i, elem) in iter.enumerate() {
            let k = rng.gen_range(0, i + 1 + amount);
            if let Some(spot) = reservoir.get_mut(k) {
                spot.2 = elem;
                if let Some(prev) = reservoir.get_mut(spot.0) { prev.1 = spot.1; } else { index = k; }
                if let Some(next) = reservoir.get_mut(spot.1) { next.0 = spot.0; }
                if let Some(old) = reservoir.get_mut(old_k) { old.1 = k; old_k = k; } else { unreachable!() }
                spot.0 = old_k;
                spot.1 = amount;
            }
        }
    }
    StableSample { index, reservoir }
}

An aborted Fisher–Yates shuffle makes more sense when you do not mind permuting your source. Also Robert Floyd's combination algorithm makes more sense when sampling from a much larger Range.

vitiral · 2017-11-22T23:17:05Z

@burdges just to be clear when you say "this function" you mean rand::sample as it exists today right?

vitiral · 2017-11-22T23:23:51Z

I think the request now boils down to these three things:

Rename rand::sample -> rand::sample_reservoir: this is detailed by @bluss and would be a good change for clarity's sake. The documentation could also be improved to discuss the implementation details and what happens if amount >= iter.len().
Add a Sample trait with one method:
- fn sample<T: Clone>(&self, amount: usize) -> Option<Vec<T>>: samples exactly amount items or returns None. The output is completely randomized (equivalent to calling shuffle).
Add a SampleRef trait with one method:
- fn sample_ref<T>(&'a self, amount: usize) -> Option<Vec<&T>>: sampe as sample() above except returns references to the values instead of clones.

Sample and SampleRef could then be implemented for any types which want to implement them, with a first candidate being &[T]

Edit: Another possiblity would be a SampleOwned that consumes self, but I don't see a good use case for that.

Edit2: Someone could convince me that Sample::sample and SampleRef::sample_ref should panic instead of returning Option... maybe a try_sample as well that returns Option while sample panics?

dhardy · 2017-11-23T14:37:05Z

@vitiral would you make a PR, either here or against my branch? (Not really sure what's best; since this is fairly independent of the other stuff in my branch it could just land here.)

The other question: should sample be moved? I already put it in a submodule in my branch.

vitiral · 2017-11-23T14:43:34Z

I can see what I can do!

I'm not sure I like the name sequences since I'm hoping to be able to randomly sample from HashSets, etc in O(amount) as well -- which aren't really sequences. I'm still not sure how possible that is.

dhardy · 2017-11-23T14:45:21Z

I wasn't at all sure how to organise this functionality so counter-proposals are welcome.

Another option would be to put all this in another crate?

burdges · 2017-11-23T14:52:20Z

Yes I was only pointing out that different situations make different sampling schemes optimal.

vitiral · 2017-11-24T16:09:01Z

see #195 for an initial stab

…eservoir

This was referenced Nov 21, 2017

consider for new epoch: rand::sample's API should not be infalible rust-lang/rust#46163

Closed

sample() and select() functions are missing proptest-rs/proptest#13

Closed

vitiral changed the title ~~rand::sample's API should not be infalible~~ rand::sample's API should be fallible Nov 21, 2017

vitiral mentioned this issue Nov 21, 2017

Rand crate revision (pre-stabilisation) rust-lang/rfcs#2106

Closed

dhardy mentioned this issue Nov 24, 2017

Re-export rand-core members in rand? dhardy/rand#15

Closed

vitiral mentioned this issue Nov 24, 2017

Sample #195

Merged

vitiral added a commit to vitiral/rand that referenced this issue Nov 25, 2017

close rust-random#194: add Sample trait and rename sample -> sample_r…

c75f356

…eservoir

vitiral added a commit to vitiral/rand that referenced this issue Nov 25, 2017

close rust-random#194: add Sample trait and rename sample -> sample_r…

566fbaf

…eservoir

dhardy closed this as completed in 3e9eac0 Dec 5, 2017

vitiral mentioned this issue Dec 5, 2017

investigate generic sample_indicies #202

Closed

pitdicker pushed a commit to pitdicker/rand that referenced this issue Apr 4, 2018

close rust-random#194: add Sample trait and rename sample -> sample_r…

9165b0e

…eservoir

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rand::sample's API should be fallible #194

rand::sample's API should be fallible #194

vitiral commented Nov 21, 2017 •

edited

Loading

dhardy commented Nov 21, 2017

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

dhardy commented Nov 21, 2017

vitiral commented Nov 21, 2017

vitiral commented Nov 21, 2017

dhardy commented Nov 21, 2017

bluss commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017

bluss commented Nov 21, 2017

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

burdges commented Nov 22, 2017 •

edited

Loading

vitiral commented Nov 22, 2017

vitiral commented Nov 22, 2017 •

edited

Loading

dhardy commented Nov 23, 2017 •

edited

Loading

vitiral commented Nov 23, 2017 •

edited

Loading

dhardy commented Nov 23, 2017 •

edited

Loading

burdges commented Nov 23, 2017

vitiral commented Nov 24, 2017

rand::sample's API should be fallible #194

rand::sample's API should be fallible #194

Comments

vitiral commented Nov 21, 2017 • edited Loading

dhardy commented Nov 21, 2017

vitiral commented Nov 21, 2017 • edited Loading

vitiral commented Nov 21, 2017 • edited Loading

dhardy commented Nov 21, 2017

vitiral commented Nov 21, 2017

vitiral commented Nov 21, 2017

dhardy commented Nov 21, 2017

bluss commented Nov 21, 2017 • edited Loading

vitiral commented Nov 21, 2017 • edited Loading

vitiral commented Nov 21, 2017 • edited Loading

vitiral commented Nov 21, 2017

bluss commented Nov 21, 2017

vitiral commented Nov 21, 2017 • edited Loading

vitiral commented Nov 21, 2017 • edited Loading

burdges commented Nov 22, 2017 • edited Loading

vitiral commented Nov 22, 2017

vitiral commented Nov 22, 2017 • edited Loading

dhardy commented Nov 23, 2017 • edited Loading

vitiral commented Nov 23, 2017 • edited Loading

dhardy commented Nov 23, 2017 • edited Loading

burdges commented Nov 23, 2017

vitiral commented Nov 24, 2017

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

bluss commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

vitiral commented Nov 21, 2017 •

edited

Loading

burdges commented Nov 22, 2017 •

edited

Loading

vitiral commented Nov 22, 2017 •

edited

Loading

dhardy commented Nov 23, 2017 •

edited

Loading

vitiral commented Nov 23, 2017 •

edited

Loading

dhardy commented Nov 23, 2017 •

edited

Loading