Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Implement Ruby's Enumerator Class #6357

Open
felixbuenemann opened this issue Jul 9, 2018 · 71 comments
Open

[RFC] Implement Ruby's Enumerator Class #6357

felixbuenemann opened this issue Jul 9, 2018 · 71 comments

Comments

@felixbuenemann
Copy link
Contributor

I really like Ruby's Enumerator class and I think it would be great to have something similar in Crystal.

The Enumerator class can be used to implement generators.

Here's an example from the Ruby docs:

fib = Enumerator.new do |y|
  a = b = 1
  loop do
    y << a
    a, b = b, a + b
  end
end

p fib.take(10) # => [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

The y block parameter in the example is called the yielder and the iterator returned from new blocks until the yielder receives a value or the block ends.

Since I was porting some Ruby code that depended on Enumerator to Crystal, I wen't ahead and implemented my own generic Enumerator class using the Iterator module and Channel to implement the concurrent behavior:

class Enumerator(T)
  include Iterator(T)

  def initialize(&block : Yielder(T) -> Nil)
    @channel = Channel(T).new
    @block = block
    @block_called = false
    spawn_block
  end

  def next
    @channel.receive
  rescue Channel::ClosedError
    stop
  end

  def rewind
    if @block_called
      @channel = Channel(T).new
      @block_called = false
      spawn_block
    end
    self
  end

  private def spawn_block : Nil
    spawn do
      begin
        @block.call(Yielder(T).new(@channel))
      ensure
        @channel.close
        @block_called = true
      end
    end
  end

  private struct Yielder(T)
    def initialize(@channel : Channel(T))
    end

    def <<(data : T)
      @channel.send(data)
      self
    end
  end
end

The Fibonacci generator could now be implemented similar to the Ruby example:

fib = Enumerator(UInt32).new do |y|
  a = b = 1_u32
  loop do
    y << a
    a, b = b, a + b
  end
end

p fib.first(10).to_a # => [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

Please let me know if you think this is a useful addition to the Crystal Standard Library.

If you do, I would be happy to polish the code up into a PR along with documentation and specs.

I've also implemented Enumerator#peek locally, which allows to look at the next iterator value without moving it forward by buffering the value on peek, but left it out of the example for simplicity.

@asterite
Copy link
Member

asterite commented Jul 9, 2018

Yup, it's been discussed in the past:

https://groups.google.com/forum/#!searchin/crystal-lang/Enumerator|sort:date/crystal-lang/L08Hz8oXuIU/WqIDLxzKDgAJ

https://play.crystal-lang.org/#/r/aob

The main issue is that it's super slow. Like, 1000x times slower than Iterator. And we don't want to encourage slow code, much less in the standard library...

Well, at least that was the last time I benchmarked it, maybe it should be benchmarked again.

By the way, fibonacci can be currently implemented with Iterator, and it's super fast, no need for fibers nor channels:

a = b = 1
fib = Iterator.of do
  a.tap { a, b = b, a + b }
end

p fib.first(10).to_a

https://play.crystal-lang.org/#/r/4gz8

But, of course, more complex enumerators which yield in the middle of a method and continue from there can't be implemented with Iterator.of.

@bew
Copy link
Contributor

bew commented Jul 9, 2018

maybe similar to #4438 ?

@felixbuenemann fun fact: your implementation reminds me #4198 ;)

@felixbuenemann
Copy link
Contributor Author

@asterite Yeah, my actual use case is much more complicated. I have a nested while loop with XML::Reader that builds records from expanded sub nodes and yields them into the iterator.

It can certainly implemented with a simple yield without an iterator, but it makes it harder to do stuff like each_slice which you get for free with iterators.

In my case the performance of the enumerator shouldn't really matter, since most of the processing is in libxml2 anyways.

I'm not partial to the ruby specific implementation, but some form of a generator abstraction that doesn't have Iterators limitations would be great.

@bew Thanks for the link, I searched the existing issues for "enumerator" but not for "generator".

@asterite
Copy link
Member

asterite commented Jul 9, 2018

@felixbuenemann You can implement Iterator and use a stack. That's a way to remove recursion (for example to implement Iterator for a binary tree).

I just tried to implement Enumerator with just fibers, no channels, like in Ruby. It's just a bad implementation because I'm confused about where a fiber should resume, but this basic example works:

class Enumerator(T)
  include Iterator(T)

  struct Yielder(T)
    getter! enumerator

    def initialize(@enumerator : Enumerator(T))
    end

    def <<(value : T)
      @enumerator.value = value
    end
  end

  getter! fiber : Fiber
  getter! yielder : Yielder(T)

  def initialize(&@block : Yielder(T) ->)
    @current_fiber = Fiber.current
    @yielder = Yielder(T).new(self)
    @fiber = Fiber.new do
      @block.call(@yielder.not_nil!)
    end
  end

  @value : T?

  def value=(@value : T)
    @current_fiber.resume
  end

  def next
    # This is probably not correct
    @current_fiber = Fiber.current

    stop unless fiber.alive?

    fiber.resume

    fiber.alive? ? @value.as(T) : stop
  end
end

fib = Enumerator(Int32).new do |y|
  3.times do |i|
    y << i
  end
end

p fib.next
p fib.next
p fib.next

That is, using some sort of cooperative fibers, instead of the non-cooperative fibers we have now. But that @current_fiber.resume should probably be Ruby's Fiber.yield.

But still, doing a simple benchmark of that implementation using Fibers vs Iterator gives 0.37s for Enumerator vs. 0.005s for Iterator. It's a dramatic difference. The code is something like this:

e = Enumerator.new do |y|
  loop do
    y << 1
  end
end

time = Time.now
10_000_000.times do
  e.next
end
puts Time.now - time

But then in Ruby that same example takes 12 seconds... so maybe it's not that bad? :-)

If we can figure out how to implement it with fibers and context switching, without channels, it might be a good fit for the standard library, even if it's slower than a manually written Iterator (a hand-written Iterator can always be written if Enumerator becomes the bottleneck of your program).

@asterite
Copy link
Member

asterite commented Jul 9, 2018

Oh, and in the example above I defined Fiber#@alive and Fiber#alive? for it to work (just a boolean that's true when the fiber starts, and false when it ends). But for some reason once the fiber finishes, everything crashes with a segfault... probably because the fiber is not registered in the Scheduler.

@asterite
Copy link
Member

asterite commented Jul 9, 2018

(using the implementation in the first comment in this issue, using channels, the benchmark takes 0.6 seconds)

@felixbuenemann
Copy link
Contributor Author

Hey, I was also looking for a way to find out if a fiber was still running, because at first I tried to save the fiber from spawn to an instance variable and then restart the fiber if needed during rewind.

So I think adding such a flag to Fiber would be generally useful.

I'll see if I can get your improved version running with my code.

Can you post the patch you did to the Fiber class?

@asterite
Copy link
Member

asterite commented Jul 9, 2018

It's just setting @alive = true in Fiber, then in run ensure it's set to false.

@felixbuenemann
Copy link
Contributor Author

felixbuenemann commented Jul 11, 2018

@asterite I tried you implementation, but it just hangs.

Here's the test suite I used (quick conversion from minitest.cr to crystal spec):

require "spec"
require "../src/enumerator"

describe Enumerator do

  describe "#initialize" do
    it "returns an instance of Enumerator" do
      instance = Enumerator(String).new {}
      instance.should be_a(Enumerator(String))
    end
  end

  describe "#next" do
    it "returns the values in the order they were yielded" do
      enumerator = Enumerator(String?).new do |yielder|
        yielder << "hello"
        yielder << nil
        yielder << "world"
      end

      enumerator.next.should eq("hello")
      enumerator.next.should be_nil
      enumerator.next.should eq("world")
      enumerator.next.should be_a(Iterator::Stop)
    end
  end

  describe "#peek" do
    it "peeks at the next value without affecting next" do
      next # Not implemented
      enumerator = Enumerator(String?).new do |yielder|
        yielder << "hello"
        yielder << "world"
      end

      enumerator.peek.should eq("hello")
      enumerator.peek.should eq("hello")
      enumerator.next.should eq("hello")
      enumerator.peek.should eq("world")
      enumerator.peek.should eq("world")
      enumerator.next.should eq("world")
      enumerator.peek.should be_a(Iterator::Stop)
      enumerator.next.should be_a(Iterator::Stop)
      enumerator.rewind
      enumerator.peek.should eq("hello")
    end
  end

  describe "#each" do
    it "iterates the yielded values" do
      enumerator = Enumerator(String).new do |yielder|
        yielder << "hello"
        yielder << "world"
      end

      values = [] of String
      enumerator.each { |value| values << value }
      values.should eq(["hello", "world"])
    end
  end

  describe "Enumerable" do
    it "responds to Enumerable methods" do
      enumerator = Enumerator(String).new do |yielder|
        yielder << "hello"
        yielder << "world"
      end

      enumerator.to_a.should eq(["hello", "world"])
    end
  end

  describe "#rewind" do
    it "rewinds the iterator" do
      next # Not implemented
      enumerator = Enumerator(String).new do |yielder|
        yielder << "hello"
        yielder << "world"
      end
      # We rewind first to check that block isn't called multiple
      # times if it was not enumerated before calling #rewind.
      enumerator.rewind.should be_a(Enumerator(String))

      enumerator.to_a.should eq(["hello", "world"])

      # Must be rewound, should be empty.
      enumerator.to_a.should eq([] of String)
      enumerator.rewind.should be_a(Enumerator(String))
      enumerator.to_a.should eq(["hello", "world"])

      # And make sure rewind properly resets state.
      enumerator.to_a.should eq([] of String)
      enumerator.rewind.should be_a(Enumerator(String))
      enumerator.rewind.should be_a(Enumerator(String))
      enumerator.to_a.should eq(["hello", "world"])
    end
  end

  describe "Nil type support" do
    it "works when the type is Nil" do
      enumerator = Enumerator(Nil).new do |yielder|
        yielder << nil
        yielder << nil
      end

      enumerator.to_a.should eq([nil, nil])
    end
  end

  describe "yielder chaining" do
    it "allows to chain yielder calls" do
      enumerator = Enumerator(String).new do |yielder|
        yielder << "hello" << "world"
      end

      enumerator.to_a.should eq(["hello", "world"])
    end
  end

  describe "Fibonacci enumerator" do
    it "generates the first 10 numbers in the Fibonacci sequence" do
      # Example adapted from Ruby Enumerator docs.
      fib = Enumerator(UInt32).new do |y|
        a = b = 1_u32
        loop do
          y << a
          a, b = b, a + b
        end
      end

      fib.first(10).to_a.should eq([1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
    end
  end

end

@asterite
Copy link
Member

I know it hangs. I'm not sure cooperative and non-cooperative fibers can coexist. I believe maybe @waj will know how to implement this, but I don't think he'll have time.

@felixbuenemann
Copy link
Contributor Author

felixbuenemann commented Jul 12, 2018

@asterite I got it working!

The problem with your implementation was that it didin't resume the @current_fiber after the fiber had finished.

I came up with the following implementation based on you code, but using a Deque for storage, so that it works with nil values. I also added #rewind and #peek and replaced the Fiber#alive? checks with a simple @done flag.

class Enumerator(T)
  include Iterator(T)

  def initialize(&@block : Yielder(T) ->)
    @current_fiber = Fiber.current
    @stack = Deque(T).new(1)
    @yielder = Yielder(T).new(self)
    run
  end

  def next
    fetch_next
    @stack.shift { stop }
  end

  def rewind
    run if @done
    self
  end

  def peek
    fetch_next
    if @stack.empty?
      stop
    else
      @stack[0]
    end
  end

  protected def <<(value : T)
    @stack.push(value)
    @current_fiber.resume
  end

  private def fetch_next : Nil
    @current_fiber = Fiber.current
    @done || @fiber.not_nil!.resume if @stack.empty?
  end

  private def run
    @stack.clear
    @done = false
    @fiber = Fiber.new do
      begin
        @block.call(@yielder.not_nil!)
      ensure
        @done = true
        @current_fiber.resume
      end
    end
  end

  private struct Yielder(T)
    def initialize(@enumerator : Enumerator(T))
    end

    def <<(value : T)
      @enumerator << value
      self
    end
  end
end

Results:

Iterator 333.58M (   3.0ns) (± 3.15%)  0 B/op        fastest
   Fiber  31.65M (  31.6ns) (± 4.48%)  0 B/op  10.54× slower
 Channel  19.21M ( 52.07ns) (± 5.16%)  0 B/op  17.37× slower

Let me know what you think!

@asterite
Copy link
Member

Awesome! If it's just 10 times slower than an Iterator I think it's a really good candidate for the std.

By the way, this doesn't work as expected:

e = Enumerator(Int32).new do |y|
  y << 1
end

spawn do
  p e.next
end

Fiber.yield

The problem is that the current fiber is saved in initialized, but when calling next we are in another fiber. The solution is simple: set @current_fiber = Fiber.current upon entering next. I think that will always work, but I'm not sure.

I'd say let's go for this, if it's solid. Sometimes writing an Iterator is very boring, and the 10x slowness of some parts of the code will probably not be the bottleneck of a program, and this boosts productivity a lot.

@felixbuenemann
Copy link
Contributor Author

Thanks for the explanation. I was wondering why you were setting @current_fiber inside #next.

I think I'll have some time this weekend to package it all up into a PR and add some polish to the specs.

@felixbuenemann
Copy link
Contributor Author

I've updated the code above to set @current_fiber = Fiber.current inside #fetch_next so your example now works with both #peek and #next.

@asterite
Copy link
Member

By the way, maybe a Deque is a bit too overkill for this. We can use a union type for @value, or maybe even uninitialized if we really need too. I'm also not sure what's the use of peek, or why we'd like to add it, given that Iterator doesn't have a peek method. We could always have a PeekIterator(T) type that peeks a value and stores it.

Then, rewind doesn't seem to work completely well. It works only if you already consumed the entire iterator. For example this is how it works in Ruby:

e = Enumerator.new do |y|
  y << 1
  y << 2
  y << 3
end

p e.next # => 1
p e.next # => 2

e.rewind

p e.next # => 1
p e.next # => 2

But I'm not sure how we would implement it. So maybe for now rewind can raise until we figure it out.

@asterite
Copy link
Member

There's another problem. This code doesn't free the memory:

loop do
  e1 = Enumerator(Int32).new do |y|
    3.times do |i|
      y << i
    end
  end
  e1.next
end

Fibers are created and maintained in a stack. I think these particular fibers should not be placed in that stack, so when they are not referenced anymore they are GCed. The source code in fiber.cr should probably be modified for this...

@felixbuenemann
Copy link
Contributor Author

Peek is useful, when you don't yet know what's the next value is and you want to do different stuff based on that. I agree that it would make sense to have this for all Iterators.

Regarding the rewind edge case: I think we can handle it by adding another flag that signals to the yielder that it should abort execution.

I didn't know that you can check if an instance variable is initialized. Can you post an example?

@RX14
Copy link
Contributor

RX14 commented Jul 12, 2018

The fiber linked list is required for the GC, you can't just remove a fiber from it: https://github.com/crystal-lang/crystal/blob/master/src/fiber.cr#L325

You need to do this with def finalize.

@felixbuenemann
Copy link
Contributor Author

felixbuenemann commented Jul 12, 2018

The leak described by @asterite is actually the same problem as #rewind when the block isn't @done.

I plan to fix it by raising a YielderStopped exception in Yielder#<< which is captured in the Fiber.new block. The method that stops the yielder could then also be called from Enumerator#finalize.

The user could still do something stupid like rescuing from that exception inside the yielder block, but we could at least raise an exception in that case, since it can be detected by checking if the yielder transitioned to done after being told to stop by the enumerator.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jul 12, 2018

@RX14 But the GC will find the fiber reference in that linked list, thus never try to collect it, and will never invoke finalize. A fiber that may never finish will never be collected.

We can't just keep a list of active stack roots, instead of fibers, because the GC could then collect all fibers (unless I'm wrong?).

Maybe we need 2 lists: one for "attached" fibers (default), and one to keep the stack roots of "detached" fibers? Once the fiber is no longer referenced, the GC collects it and invokes its finalizer that should return the stack root back to the pool.

@asterite
Copy link
Member

@ysbaddaden I think he meant a finalize on Enumerator: once the enumerator isn't referenced anymore, it will delete the Fiber and remove it from that linked list.

@RX14
Copy link
Contributor

RX14 commented Jul 12, 2018

You just need to make sure the fiber doesn't have any references to the enumerator, which is... difficult. You'll probably need to employ WeakRef(T)

@asterite
Copy link
Member

Why would the fiber have a reference to the Enumerator?

@felixbuenemann
Copy link
Contributor Author

Probably because the @yielder is passed to the @block that the @fiber wraps and the yielder has a reference to the @enumerator.

@asterite
Copy link
Member

I'm also not sure what happens if you call an IO method inside Enumerator. That would make the fiber be rescheduled, not sure when and how will that resume. I did a few tests and it seems to work, but mixing cooperative and non-cooperative fibers will be tough.

@RX14
Copy link
Contributor

RX14 commented Jul 12, 2018

@asterite but all fibers are by definition cooperative because there's no method to preempt them. I'm not sure what you mean. I don't see much special about the way this fiber is used or scheduled.

@asterite
Copy link
Member

@RX14 In Ruby, Fiber#resume stars/resumes a fiber, and Fiber.yield yields control back to the fiber that resumed that fiber. That's known as cooperative fibers. In Crystal it's different because there's no Fiber#resume. Well, there is, but it's for implementation purposes, you are not supposed to use it like that. And Fiber.yield just tries to find a fiber that can run, it's not resuming the one that called resume.

Anyway, Enumerator has indeed a reference to the fiber, which in turn has a reference to the enumerator through a closure, so finalize is never invoked. I'm not sure how it can be implemented, because the reference from fiber to enumerator happens through a closure, not something we can change to be WeakRef. Or maybe it needs to be a special Fiber, I don't know.

@RX14
Copy link
Contributor

RX14 commented Jul 12, 2018

the fiber has the reference to the yielder through the closure, the weakref should be from the yielder to the enumerator, surely?

And to me, anything that's not preemptible is cooperative. Making crystal's fibers cooperative, just not the same implementation as ruby. It's just a terminology thing though, I get what you mean.

@asterite
Copy link
Member

@RX14 The fiber is in the linked list, so it can't be GCed. Then the fiber has a reference to the Enumerator through the closure, which can't be turned into a WeakRef. So I can't see how to implement this.

I guess it would be a nice to have, but it seems it's kind of impossible to implement this right now...

@RX14
Copy link
Contributor

RX14 commented Jul 12, 2018

@asterite why does the fiber have a reference to the enumerator and not just the yielder?

@felixbuenemann
Copy link
Contributor Author

To make discussion of the implementation easier, I pushed a first version in #6385.

Let me know what you think!

I would also be grateful for a short example that can only be implemented with Enumerator.

I tried to describe the use cases for both Iterator and Enumerator as best as I can, but the current Fibonacci generator example is not so great, since it can also be implemented with Iterator.

@asterite
Copy link
Member

@felixbuenemann There's no such example. Anything can be implemented with Iterator, and without needing channels at all. That's what C# does with IEnumerable and automatic rewriting of yield. So Enumerator is just a convenience when you don't want to write that Iterator. So... I'm not sure about the documentation you used for Enumerator. Channels, blocking, etc., shouldn't be mentioned at all. It should just mentioned that it uses a fiber under the hood, and context switching, which is the reason of the slowness.

@straight-shoota
Copy link
Member

straight-shoota commented Jul 15, 2018

@asterite With your patch, this ensure block won't execute:

loop do
  e = Enumerator(Int32).new do |y|
    y << 1
  ensure
    puts "ensure"
  end
  e.next
  sleep 0.001
end

Such ensures are often used for cleaning up open resource handled, such as closing files, network connections etc.

The enumerating fiber shouldn't just be terminated without unwinding it's stack. It should work similarly like an exception was raised in Yielder#<< if the fiber is cancelled.

@asterite
Copy link
Member

@straight-shoota I just tried it in Ruby, and the ensure is never called either. I don't think what you want can be implemented (otherwise, why Ruby doesn't implement it?)

For reference:

loop do
  e = Enumerator.new do |y|
    begin
      y << 1
    ensure
      puts "BYE"
    end
  end
  e.next
end

Or:

loop do
  e = Enumerator.new do |y|
    begin
      y << 1
    ensure
      puts "BYE"
    end
  end
end

@asterite
Copy link
Member

Well, someone could create an issue in Ruby and see what they answer :-)

@straight-shoota
Copy link
Member

There's at least already an issue for Crystal: #3561 =)

@asterite
Copy link
Member

asterite commented Jul 15, 2018

Not gonna happen. There's no way to do it in Go, so I doubt we'll manage to do it.

@felixbuenemann
Copy link
Contributor Author

After all the discussion, I was unsure how useful an addition to the Crystal language this would be.

Since most simple examples are not useful for that I wen't ahead and converted the Enumerator code in the ETL app I'm currently working on to use Iterator instead.

I must say looking at the code side-by-side I much prefer the Enumerator version – it is much easier to comprehend.

It also took me about 4.5 hours to convert the code until the tests were green and another 2.5 hours to reimplement the early abort logic in the underlying XML pull parser to get the Iterator implementation to the same performance level as the Enumerator one (these took maybe 15 minutes to implement initially in the Enumerator version).

Here's an example of one of the Enumerators I replaced. The other one not shown here is powering the commands method in the example, but would be much more verbose to show, since I had to split it up from one into several methods.

Enumerator Version:

def documents(languages = ["de"])
  Enumerator(Document).new do |documents|
    attributes = {} of String => Attribute
    classification_groups = {} of String => ClassificationGroup
    commands(["Attribute", "ClassificationGroup", "Product"]).each do |reader|
      case reader.name
      when "Attribute"
        attribute = AttributeParser.new(reader, languages)
        attributes[attribute.id] = attribute.parse
      when "ClassificationGroup"
        classification_group = ClassificationGroupParser.new(reader, languages)
        classification_groups[classification_group.id] = classification_group.parse
      when "Product"
        product = ProductParser.new(reader, languages).parse
        languages.each do |language|
          document = DocumentBuilder.new(
            attributes,
            classification_groups,
            language,
            product
          ).build
          documents << document
        end
      end
    end
  end
end

Iterator Version:

def documents(languages = ["de"])
  return Iterator.of(Iterator.stop) if languages.empty?
  attributes = {} of String => Attribute
  classification_groups = {} of String => ClassificationGroup
  _commands = commands(["Attribute", "ClassificationGroup", "Product"])
  _commands.each do |reader|
    case reader.name
    when "Attribute"
      attribute = AttributeParser.new(reader, languages)
      attributes[attribute.id] = attribute.parse
    when "ClassificationGroup"
      classification_group = ClassificationGroupParser.new(reader, languages)
      classification_groups[classification_group.id] = classification_group.parse
    when "Product"
      product = ProductParser.new(reader, languages).parse
      products = Iterator.of do
        reader = _commands.next
        if reader.is_a?(Iterator::Stop)
          Iterator.stop
        else
          ProductParser.new(reader.as(XML::Reader), languages).parse
        end
      end
      _languages = languages.each
      return Iterator.of do
        language = _languages.next
        if language.is_a?(Iterator::Stop)
          language = _languages.rewind.next
          product = products.next
        end
        if product.is_a?(Iterator::Stop)
          Iterator.stop
        else
          DocumentBuilder.new(
            attributes,
            classification_groups,
            language.as(String),
            product.as(Product)
          ).build
        end
      end
    end
  end
  return Iterator.of(Iterator.stop)
end

As can be seen the iterator version involves several internal Iterator::SingletonProc instances and hard to follow logic to switch the different parsing phases (Attributes / ClassificationGroups / Products). It also might be interesting that the underlying XML::Reader parses through several gigabytes of XML that is uncompressed on the fly, so it cannot rewind the IO and start over.

So overall, I think it is very worthwhile to have this in the standard library, if only for the productivity gains (and me keeping my sanity ;-).

@rdp
Copy link
Contributor

rdp commented Feb 1, 2022

You could make your own Enumerator class perhaps? Make it a library?

@eregon
Copy link

eregon commented Nov 22, 2022

A small note that unless one needs next/peek, Enumerator doesn't use or need a Fiber (and it still has all the Enumerable methods of course).

@straight-shoota
Copy link
Member

@eregon Are you sure about that?
How would this work without a separate fiber to keep track of the iteration cursor between both calls to Enumerator#take?:

top_ten = enumerator.take(10)
top_twenty = top_ten + enumerator.take(10)

In essence, all Enumerator methods are constructed on top of #next. #take(n) is basically n.times { next }.

Only if all you want to do is a single (partial) iteration, there is no need to keep track of iteration status and thus no need for a fiber.

@eregon
Copy link

eregon commented Nov 22, 2022

Yes I'm sure:

enumerator = Enumerator.new do |y|
  i = 0
  loop {
    y << i += 1
  }
end
p enumerator.take(10) # => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
p enumerator.take(10) # => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

So the "suspend" and the need for a Fiber is only for next/peek (and next_values/peek_values) which are stateful, all other Enumerator/Enumerable methods are not stateful and always "restart from the start" of the Enumerator.
As a note #zip uses Enumerator#next internally when given non-array arguments.

@asterite
Copy link
Member

I think the difference is that Enumerator in Ruby is not lazy. If you write take(10) it consumes the entire thing.

In Crystal we'd like take(10) to return another "Enumerator" (it would be an Iterator) that will only consume the data once you call to_a, etc.

@eregon
Copy link

eregon commented Nov 22, 2022

Enumerator::Lazy also doesn't create Fibers:

p Fiber.current # => #<Fiber:0x00007fb8fc423328 (resumed)>
enumerator = Enumerator.new do |y|
  i = 0
  loop {
    p Fiber.current # => #<Fiber:0x00007fb8fc423328 (resumed)>
    y << i += 1
  }
end.lazy
p enumerator.take(3) # => #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator: #<Enumerator::Generator:0x00007fb8fc3ecee0>:each>>:take(3)>

p enumerator.take(3).to_a # => [1, 2, 3]

@asterite
Copy link
Member

@eregon So how is this implemented in Ruby if it's not using fibers under the hood, or having the VM suspend the execution at some places?

@eregon
Copy link

eregon commented Nov 22, 2022

Basically it just remembers what it has to do in the Enumerator::Lazy object(s) and then finally eager evaluate and possibly using (non-local) break/return.
You can see it at https://github.com/oracle/truffleruby/blob/ddaa0759a460fd2a23655c3f5fa5d03bdcd05207/src/main/ruby/truffleruby/core/enumerator.rb#L282

@asterite
Copy link
Member

@eregon Would you mind trying to implement this for Crystal? I'm almost sure it's impossible.

This is how far I got:

class Enumerator(T)
  class Yielder(T)
    def <<(value : T)
    end
  end

  def initialize(&@block : Yielder(T) ->)
  end

  def each
    yielder = Yielder(T).new
    # @block.call(yielder)
  end
end

e = Enumerator(Int32).new do |y|
  a = b = 1
  loop do
    y << a
    a, b = b, a + b
  end
end

e.each do |x|
  puts x
  break if x > 10
end

How do you implement Enumerator#each? It somehow needs to call the block given to the initialize method, and when a value is appended to the yielder it somehow needs to know to yield something back to the caller.

@eregon
Copy link

eregon commented Nov 22, 2022

This works, I wrote it in Ruby because it's easier for me.
It's basically/exactly what you had, except the Yielder saves the block given to each and calls it on <<.

class MyEnumerator
  class Yielder
    def initialize(&block)
      @block = block
    end
    
    def << value
      @block.call(value)
    end
  end

  def initialize(&block)
    @block = block
  end

  def each(&block)
    yielder = Yielder.new(&block)
    @block.call(yielder)
  end
end

e = MyEnumerator.new do |y|
  a = b = 1
  loop do
    y << a
    a, b = b, a + b
  end
end

e.each do |x|
  p x
  break if x > 10
end

@asterite
Copy link
Member

Nice, thanks!

This works:

class Enumerator(T)
  include Enumerable(T)

  class Yielder(T)
    def initialize(@block : T ->)
    end

    def <<(value : T)
      @block.call(value)
    end
  end

  def initialize(&@block : Yielder(T) ->)
  end

  def each(&block : T ->)
    yielder = Yielder(T).new(block)
    @block.call(yielder)
  end
end

e = Enumerator(Int32).new do |y|
  a = b = 1
  10.times do
    y << a
    a, b = b, a + b
  end
end

e.each do |x|
  p x
end

I changed the loop to 10.times because... break doesn't exist in Crystal inside captured blocks. So that functionality will be missing.

But then we can't make Enumerator be Enumerable because that relies on yield, not on captured blocks.

So in the end Enumerator done this way isn't very useful (I think)... we really need to be able to call next on Enumerator and have each do yield.

@straight-shoota
Copy link
Member

straight-shoota commented Nov 22, 2022

Just to clarify: In Crystal there is a clear separation between captured blocks (which becomes procs and can be stored) and yielded blocks (which are inlined into the yielding method and don't exist as a separate item at runtime).

@asterite
Copy link
Member

That said, when we were originally thinking about Crystal we really wanted to be able to break from a captured block. The way that would work is that procs always return two values: the actual value and a second value that determines whether you want to break, next or return. Then the method calling the proc could check that second value, etc.

But we never got to do it. It's probably a lot of work.

@eregon
Copy link

eregon commented Nov 22, 2022

I changed the loop to 10.times because... break doesn't exist in Crystal inside captured blocks. So that functionality will be missing.

Does non-local return exist in Crystal? That would be a workaround, just wrapping the e.each do |x| in a method and use return to escape it.

Regarding non-local break/return they are typically implemented with an exception, unwind or a similar mechanism (and some ID to know how far to unwind), and that doesn't need to check the return value on every return.

But then we can't make Enumerator be Enumerable because that relies on yield, not on captured blocks.

I don't know Crystal, but Ruby Enumerable only depends on each, so the code above should work without problems with Ruby Enumerable's semantics. There are some additional complications for a full Enumerator implementation, that's captured in the TruffleRuby code (notably for Enumerator#size and there are some additional indirections).
Maybe Crystal could have a way to capture a "yielded" block into a Proc object?

@eregon
Copy link

eregon commented Nov 22, 2022

have each do yield.

Would this work maybe?

class Enumerator(T)
  def each(&block : T ->)
    yielder = Yielder(T).new(Proc.new { |x| yield x })
    @block.call(yielder)
  end
end

@asterite
Copy link
Member

Does non-local return exist in Crystal?

No. Or at least not yet.

Regarding non-local break/return they are typically implemented with an exception, unwind or a similar mechanism (and some ID to know how far to unwind), and that doesn't need to check the return value on every return.

We could eventually do that in Crystal! See #6357 (comment) . But I think exceptions are slow so I'm not sure I'd like to introduce slow features.

Would this work maybe?

No because you can't yield from a captured block.

To be honest, I'm not exactly happy about the yield vs. "captured block" semantic we have in Crystal, but then Crystal isn't interpreted and doing those things in a compiled-to-native-code language is very hard, and maybe also slow.

@straight-shoota
Copy link
Member

Rust is adding gen fn which seems to serve a similar purpose. https://github.com/oli-obk/rfcs/blob/iter_fn/text/0000-gen-fn.md

rust-lang/rfcs#3513

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants