Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterator: remove rewind. Implement cycle by storing elements in an array #7440

Merged

Conversation

asterite
Copy link
Member

@asterite asterite commented Feb 15, 2019

This is a controversial PR. It removes the #rewind method from all existing Iterators.

The reasons for this PR are explained in this comment but I'll explain it again here.

The reason why I originally added Iterator#rewind was:

  • I saw that Ruby has Enumerable#cycle
  • I thought of a way to implement it and the first idea I had was to be able to rewind iterators

Later on I found our that cycle was actually implemented in Ruby by building an internal Array of the elements in the Enumerable.

I think the advantages of implementing cycle like this, and removing rewind are:

  • some iterators can't be rewinded (like Iterator.of)
  • some iterators can be rewinded, but they actually don't end up rewinding anything (this is case of STDIN.each_line, which doesn't give an error, we can't detect the error and it simply works wrong)
  • implementing Iterator is simple: just implement next, no need to think about rewinding

This is a breaking change. However, I seriously doubt anyone out there is using rewind on Iterator (but maybe they are using cycle). I'm not sure, of course, but in any case I'd prefer to remove it.

@Sija
Copy link
Contributor

Sija commented Feb 16, 2019

This is a breaking change. However, I seriously doubt anyone out there is using rewind on Iterator (but maybe they are using cycle). I'm not sure, of course, but in any case I'd prefer to remove it.

https://github.com/search?q=rewind+-filename%3Aiterator_spec.cr&l=Crystal&type=Code

There's a bit of 'em, although it's debatable how much of the results are meaningful.

@asterite
Copy link
Member Author

asterite commented Feb 16, 2019

Thanks Sija. All the result I saw seem to be IO#rewind or implementations of rewind (sometimes called in initialize, but that's easy to fix), but none of them seem to be actual usages of the rewind method. This further confirms my belief that rewind is there just because it needs to be there, not because it's actually useful (other than for implementing cycle in a different way).

@asterite
Copy link
Member Author

asterite commented Feb 16, 2019

Let's wait Brian before merging this.

@asterite asterite added this to the 0.28.0 milestone Feb 16, 2019
@jhass jhass requested a review from bcardiff February 18, 2019 08:57
Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change. The implementation is more composable, there is less code and the results are more consistent despite the nature of the iterator. 💯

@asterite asterite merged commit c3f676d into crystal-lang:master Feb 26, 2019
@bcardiff
Copy link
Member

A final note :-) Buffered#rewind still exists and is fine to be used as long as unbuffered_rewind is implemented. So some cases of IO based rewing can still work.

@asterite asterite deleted the feature/iterator-no-more-rewind branch March 30, 2019 16:34
@baelter
Copy link

baelter commented May 6, 2019

I was using this to get the size of an iterator without having to allocate an array for the elements.
E.g.

size = itr.size
itr.rewind
...

Can this be done in any other way?

@bcardiff
Copy link
Member

bcardiff commented May 7, 2019

@baelter You could see if the specific iterator support #rewind (or #dup maybe) and but have a fallback when you don't/can't know the size.

Something like:

if itr.responds_to?(:rewind)
  # acurate size hint 
  size = itr.size
  itr.rewind
  a = Array(T).new(size)
else
  # no size hint
  a = Array(T).new
end

# use `a` and and `itr`

But depending on the use case just collection the items in the array might be enough. If you definitely needs the whole set of data there is probably no improvement by using an iterator.

@baelter
Copy link

baelter commented May 8, 2019

I don't need all the items, so I would prefer to not have to allocate memory for new array for them only to get the size of the collection. I suppose I could iterate everything and count and only emit objects where needed.

#duped iterators have the same state as the original it seems like.

@asterite
Copy link
Member Author

asterite commented May 8, 2019

Maybe you could describe your entire use case? It seems pretty weird to me.

@Sija
Copy link
Contributor

Sija commented May 8, 2019

@asterite btw, I had a similar issue recently, could you please take a look at my approach in Sija/retriable.cr@8cd9986 - does it make sense / is there a room for improvement?

@asterite
Copy link
Member Author

asterite commented May 8, 2019

Are the items there like thousands or millions? Is it an infinite list? Because if not, will the code work equally well with an Array? I think all of this might be a case of premature optimization.

@baelter
Copy link

baelter commented May 9, 2019

Use case is a huge collection, potentially millions, of largeish objects that are being exposed in a paged json api. I need the have the totalt count but only give a small subset to the json builder.

I want to only use an iterator all the way to avoid allocating an array for the entire collection.

I've found a working solution in iterating the entire collection and counting the total while only building json for the page window. Code was nicer with rewind though.

@asterite
Copy link
Member Author

asterite commented May 9, 2019

One way to do it is:

def results_and_total(input)
  iter = input.each
  results = iter.first(10).to_a
  total = results.size + iter.size
  p! input, results, total
end

results_and_total(1..100)
results_and_total(1..8)

That is, you take what you need, and then you get the size of the remaining stuff.

This isn't longer code than using rewind, and it's also faster because you consume the iterator just once.

@felixbuenemann
Copy link
Contributor

felixbuenemann commented May 27, 2019

One thing I miss with #cycle compared to #rewind is that I can no longer detect, when the end of a cycle is reached.

I previously had code that cycled a list of languages from inside an iterator of documents and generated each document in each language.

This means that each call to the iterator returned something like this:

  • doc#1.en
  • doc#1.de
  • doc#2.en
  • doc#2.de

(The nested iterator is streaming an unknown amount of documents that are way too large to keep in memory, thus the use of the Iterator protocol.)

In this specific case I could just replace #rewind with reinitialising the languages iterator, but I couldn't use #cycle even though that's exactly what I'm doing.

This only works if you have all the state you need to reinitialise the iterator that needs to be cycled, while previously a reference to the iterator itself was sufficient.

I think overall the implementation of #rewind was a bit flawed compared to Ruby, since in Ruby calling Enumerable methods like #map always rewinds the Enumerator, but int Crystal that wasn't the case.

With #rewind gone, that behaviour is less surprising.

Here's an example of the #rewind in Ruby:

e = [1, 2, 3, 4, 5, 6].each
e.first(3).to_a => [1, 2, 3]
e.first(3).to_a => [1, 2, 3]

Compared to Crystal:

e = [1, 2, 3, 4, 5, 6].each
e.first(3).to_a => [1, 2, 3]
e.first(3).to_a => [4, 5, 6]

This difference has bitten me a few times when porting over code from Ruby to Crystal.

This specific case is also more surprising since the docs for #first(n : Int) say:

Returns an iterator that only returns the first n elements of the initial iterator.

But actually it returns the next n elements of the iterator.

@asterite
Copy link
Member Author

@felixbuenemann I agree that Ruby seems to be more intuitive. I have no idea how they implement that, though.

Also, try this in Ruby:

# with stdin having one \n two \n three \n four \n five
a = STDIN.each_line
p a.next # => "one"
p a.next # => "two"
p a.first(1) # => ["three"]
p a.first(2) # => ["four", "five"]

Now with an array:

a = %w(one two three four five).each
p a.next # => "one"
p a.next # => "two"
p a.first(1) # => ["one"]
p a.first(2) # => ["one", "two"]

Now do the same in Crystal (adding to_a after first(...)) and you get the same results in both cases.

I think Crystal is a bit more consistent here.

@felixbuenemann
Copy link
Contributor

Yeah, I agree Crystal is more consistent here, but Ruby is more convenient.

I think in Ruby it would be nicer if it threw an exception if an Enumerable method is called and the underlying Enumerator cannot be rewound. Currently it throws an Exception when calling #rewind explicitly, but somehow swallows the error for automatic rewinding…

@RX14
Copy link
Member

RX14 commented May 29, 2019

I think #first should just become #next(n) then. I don't think this functionality is worth implementing, but the name should change.

@asterite
Copy link
Member Author

Note that first in Ruby always returns from the beginning of the enumeration, unlike Crystal.

@felixbuenemann
Copy link
Contributor

Internally Ruby's #first(n) delegates to #take(n), which would be a less confusing name for Crystal.

I'm not sure that overloading #next is a good idea since it's also the name of the primary method used to implement iterators and that could lead to confusion.

@RX14
Copy link
Member

RX14 commented Jun 7, 2019

I agree, take is a better name here.

However, while trying to make a PR to rename it, I found that iterator inherits enumerable, which means it inherits the #first method from there.

Iterator then overrides enumerable to make #first return an iterator, which changes the return type of #first

Iterator really shouldn't be enumerable, the enumerable class makes assumptions that the enumberable #each always starts at the start, which isn't the case for iterator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants