Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rand::sample's API should be fallible #194

Closed
vitiral opened this issue Nov 21, 2017 · 22 comments
Closed

rand::sample's API should be fallible #194

vitiral opened this issue Nov 21, 2017 · 22 comments

Comments

@vitiral
Copy link
Contributor

vitiral commented Nov 21, 2017

Most rust operations are fallible if they don't make sense, either through a panic or by returning an Option or Result. However, rand::sample has the following API:

pub fn sample<T, I, R>(rng: &mut R, iterable: I, amount: usize) -> Vec<T> where
    I: IntoIterator<Item = T>,
    R: Rng, 

Randomly sample up to amount elements from a finite iterator.

This function will silently return less than amount if the size of I is less than amount. I would say this is very odd, and returning Option<Vec<T>> would have been a better design choice. You can imagine the confusion if you did rand::sample(rng, v, 100) but then only had 2 elements!

If it were changed to returning None if I.len() < amount then the current behavior could still be accomplished with:

rand::sample(&mut rng, vec.iter(), usize::min(amount, vec.len())).unwrap()

With the current behavior, user has to put guards around their code (which is not required by the compiler) to prevent getting less than amount, which is unusual in rust.

Another argument is that trying to sample when amount > len() will just return an "inadequately shuffled" version of the iterator. If someone wanted that behavior it would be more correct to just collect() the iterator and call shuffle() on it.

@vitiral vitiral changed the title rand::sample's API should not be infalible rand::sample's API should be fallible Nov 21, 2017
@dhardy
Copy link
Member

dhardy commented Nov 21, 2017

Do you have a context where you're using this function? I'm not sure what the main use-cases are which makes it hard to say what it should do.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

The major point is how this behaves compared to functions like slice or even Vec::get which do not permit you to attempt to access slices that are outside of the range.

#[test]
fn it_works() {
    let v = vec![1,2,3,4];
    assert_eq!(&v[2..6], &[3,4]);
}

panics with:

        thread 'it_works' panicked at 'index 6 out of range for slice of length 4', /checkout/src/libcore/slice/mod.rs:745:4

The way this currently behaves is more like python's slice mechanics, where doing list[5:100] will raise an error only if the min-index does not exist, but doesn't care if it returns less than 95 elements.

The counterpoint is something like Iterator::take, but that actually returns an iterator, so you still get None values when it runs out.

Edit: I consider sample to be akin to a "random slice of values" -- obviously not in order, etc but in that vein. Therefore it should always return exactly amount values.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

Also, I a requesting this change as part of the rand crate stabilization: rust-lang/rfcs#2106

@dhardy
Copy link
Member

dhardy commented Nov 21, 2017

Is it? I don't remember writing that. Do you mean it was requested in a comment?

Regarding the previous post: true. But another point related to sample is that the algorithm could be more efficient if the input is given as a slice (or something allowing random access), roughly when input.len() > 2 * amount, as well as providing true sorting.

So without use cases I still don't know what this function should do (I'd be tempted just not to stabilise it, e.g. put it behind a feature gate).

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

sorry, I meant that I am requesting it as part of that effort :)

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

I might be wrong, but I think rust allows you to overload to provide different implementations for different input types, so such efficiency could be provided.

Also, correct me if I'm wrong but the function already requires an iterator of a known length right?

... Hmm, I'm looking at it and I think it just requires the input to be an iterator, so it doesn't necessarily know anything about the length. I think there are really two use cases here that should mabe be two separate functions. At the very least, one that knows about the length and one that doesn't.

@dhardy
Copy link
Member

dhardy commented Nov 21, 2017

Iterators never have a known length in Rust. At best, they have a size-hint, but that's too unreliable for this use-case.

You're right, two different functions are called for to cover this functionality fully. But I wonder which one(s) would actually be used?

@bluss
Copy link
Contributor

bluss commented Nov 21, 2017

I think the current interface of sample kind of makes sense and is self-consistent, but it's a very tricky to use function. This is reservoir sampling, and the function includes a note that the order of elements in the output is not random!

The interface is that you input a sequence of n elements and then you pick at most m of those elements, where each element has equal chance of ending up in the output. This generalizes quite nicely into supporting situations where n <= m; you simply get the whole sequence back.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

So I can see three major use cases and am starting to think that these should be implemented as traits

This one is the same as what exists and would be implemented just for any iterator

trait SampleIter<T: Clone>: IntoIter<Item=T> {
    /// Randomly sample up to amount elements from a finite iterator.
    /// The order of elements in the sample is not random.
    ///
    /// This consumes the entire iterator.
    pub fn sample<R: Rng>(&mut self, rng: &mut R, amount: usize) -> Vec<T> 
}

This is similar, but is for slices.

Edit: I've split it into two methods so that sample has identical behavior to its iterator friend.

trait SampleSlice<T> {
    /// Randomly sample up to amount elements from a a slice.
    ///
    /// This takes at most `O(amount)` time.
    ///
    /// returns a clone of `self` if `amount <= self.len()`
    pub fn sample<R: Rng>(&self, rng: &mut R, amount: usize) -> Vec<T> 

    /// Randomly sample up to amount elements from a a slice.
    ///
    /// This takes at most `O(amount)` time.
    ///
    /// Panics if self.len() < amount.
    pub fn sample_exact<R: Rng>(&self, rng: &mut R, amount: usize) -> Vec<T> 
}

Another implementation could hopefully be done on HashSet, HashMap, BTreeMap, etc -- I'm not sure if there is a way to reduce those to O(amount) work, but if there were it would be very nice!

The point here is that each is implementation specific and I can see use cases for all of them:

  • Randomly select 20 lines from a file (or really any long stream): use SampleIter trait, it will not take up any extra memory!
  • Randomly select 20 items from a Vec/Slice: use SampleSlice trait, it will only take `O(amount) time!
  • Randomly select 20 possiblilities from a HashSet/Map: use SampleHash and (hopefully) get speedup over SampleIter, hopefully on-par with SampleSlice. Having these even be deterministic with the same seed + items would be very nice!

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

@bluss I actually didn't realize that if n <= m it will actually be the elements in the order they were consumed. That is quite an interesting property that I think a SampleIter would want to preserve (and document).

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

I edited SampleSlice slightly to add to methods to the trait.

@bluss
Copy link
Contributor

bluss commented Nov 21, 2017

Sample is suggestive of a random order, so I would prefer something matter of fact - like naming it reservoir sample, explicitly after the algorithm it implements.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

Also, it might be possible to have only the traits Sample and SampleExact, but my trait-foo isn't good enough to get rid of the &mut self for the iterator definition.

@bluss that sounds fine to me. Actually, sample_exact could be named just sample then (and panic for the incorrect amount) because you can implement a random sample on a slice with no time penalty right?

Edit: and calling it when n == m would be equivalent to shuffle()

@vitiral
Copy link
Contributor Author

vitiral commented Nov 21, 2017

Also, an implementation detail: for SampleExact we should probably return Vec<&T> as the references should be sufficient and have size/time savings (as well as not require Clone to be implemented).

@burdges
Copy link
Contributor

burdges commented Nov 22, 2017

This function makes sense only when your sample includes a significant fraction of the output of the source iterator. You could build a stable version of this too, maybe by using a heap, or maybe through indexing:

struct StableSample<T> {
    index: usize,
    reservoir: Vec<(isize,isize,T)>,
}
impl Iterator for StableSample<T> {
    type Item = T;
    fn next() -> Option<T> {
        reservoir.get(start).map(|spot| {
            index = spot.1;
            spot.2
        })
    }
}

pub fn stable_sample<T, I, R>(rng: &mut R, iterable: I, amount: usize) -> StableSample<T>
    where I: IntoIterator<Item=T>,
          R: Rng,
{
    let mut iter = iterable.into_iter();
    let mut reservoir: Vec<(isize,isize,T)> = iter.by_ref().take(amount).enumerate().map(|(x,y)| (x-1,x+1,y)).collect();
    // continue unless the iterator was exhausted
    let mut index = 0;
    if reservoir.len() == amount {
        let mut old_k = amount-1;
        for (i, elem) in iter.enumerate() {
            let k = rng.gen_range(0, i + 1 + amount);
            if let Some(spot) = reservoir.get_mut(k) {
                spot.2 = elem;
                if let Some(prev) = reservoir.get_mut(spot.0) { prev.1 = spot.1; } else { index = k; }
                if let Some(next) = reservoir.get_mut(spot.1) { next.0 = spot.0; }
                if let Some(old) = reservoir.get_mut(old_k) { old.1 = k; old_k = k; } else { unreachable!() }
                spot.0 = old_k;
                spot.1 = amount;
            }
        }
    }
    StableSample { index, reservoir }
}

An aborted Fisher–Yates shuffle makes more sense when you do not mind permuting your source. Also Robert Floyd's combination algorithm makes more sense when sampling from a much larger Range.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 22, 2017

@burdges just to be clear when you say "this function" you mean rand::sample as it exists today right?

@vitiral
Copy link
Contributor Author

vitiral commented Nov 22, 2017

I think the request now boils down to these three things:

  1. Rename rand::sample -> rand::sample_reservoir: this is detailed by @bluss and would be a good change for clarity's sake. The documentation could also be improved to discuss the implementation details and what happens if amount >= iter.len().
  2. Add a Sample trait with one method:
    • fn sample<T: Clone>(&self, amount: usize) -> Option<Vec<T>>: samples exactly amount items or returns None. The output is completely randomized (equivalent to calling shuffle).
  3. Add a SampleRef trait with one method:
    • fn sample_ref<T>(&'a self, amount: usize) -> Option<Vec<&T>>: sampe as sample() above except returns references to the values instead of clones.

Sample and SampleRef could then be implemented for any types which want to implement them, with a first candidate being &[T]

Edit: Another possiblity would be a SampleOwned that consumes self, but I don't see a good use case for that.

Edit2: Someone could convince me that Sample::sample and SampleRef::sample_ref should panic instead of returning Option... maybe a try_sample as well that returns Option while sample panics?

@dhardy
Copy link
Member

dhardy commented Nov 23, 2017

@vitiral would you make a PR, either here or against my branch? (Not really sure what's best; since this is fairly independent of the other stuff in my branch it could just land here.)

The other question: should sample be moved? I already put it in a submodule in my branch.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 23, 2017

I can see what I can do!

I'm not sure I like the name sequences since I'm hoping to be able to randomly sample from HashSets, etc in O(amount) as well -- which aren't really sequences. I'm still not sure how possible that is.

@dhardy
Copy link
Member

dhardy commented Nov 23, 2017

I wasn't at all sure how to organise this functionality so counter-proposals are welcome.

Another option would be to put all this in another crate?

@burdges
Copy link
Contributor

burdges commented Nov 23, 2017

Yes I was only pointing out that different situations make different sampling schemes optimal.

@vitiral
Copy link
Contributor Author

vitiral commented Nov 24, 2017

see #195 for an initial stab

vitiral added a commit to vitiral/rand that referenced this issue Nov 25, 2017
vitiral added a commit to vitiral/rand that referenced this issue Nov 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants