Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add yielding line number to {String,IO}#each_line #7275

Conversation

straight-shoota
Copy link
Member

@straight-shoota straight-shoota commented Jan 5, 2019

When iterating through lines you often need a line number. This PR adds them as a second (optional) argument to the block provided to String#each_line, IO#each_line and File.each_line.
In the updated API docs I removed unrelated features upcase and reverse from the examples.

Alternatively, it could be a separate method called #each_line_with_number but I think the proposed option is better.

A second commit in this PR refactors related specs in io_spec.cr which I noticed could be simplified while adding the specs for yielding a line number.

src/io.cr Outdated Show resolved Hide resolved
src/io.cr Show resolved Hide resolved
src/string.cr Outdated Show resolved Hide resolved
@straight-shoota straight-shoota force-pushed the jm/feature/each-line-lino branch from c2c1a53 to 76e77ef Compare January 5, 2019 15:15
src/io.cr Show resolved Hide resolved
src/io.cr Outdated Show resolved Hide resolved
src/io.cr Outdated Show resolved Hide resolved
@straight-shoota straight-shoota force-pushed the jm/feature/each-line-lino branch from 76e77ef to eb6aa57 Compare January 5, 2019 16:45
# haiku = "the first cold shower
# even the monkey seems to want
# a little coat of straw"
# haiku.each_line do |stanza, lino|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# haiku.each_line do |stanza, lino|
# haiku.each_line do |stanza, line_number|

There's also still some lino left in the specs and in other examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, conventially this shortcut is written as lineno which I'd suggest using instead of lino.

Copy link
Contributor

@wooster0 wooster0 Jan 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index is also an option. Let's please use fully-written names.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jan 5, 2019

What happens if you have a file with more than Int32::MAX lines? Or connect a TCP socket to a server that will eventually stream more than this number of lines?

I guess overflow exception, whatever if we actually use the line_number variable.

@Sija
Copy link
Contributor

Sija commented Jan 5, 2019

Having unified Integer type would be a bliss...

@asterite
Copy link
Member

asterite commented Jan 5, 2019

I played a bit with implementing a unified Integer type, it's super slow, or at least like 1000 times slower than primitives. Maybe it could be further optimized, I don't know, but I think it's very unlikely we'll have something like that unless someone starts to really play with the idea and finds a way to make it fast (the gist is a type that holds either an int or a big int under the hood)

@RX14
Copy link
Contributor

RX14 commented Jan 5, 2019

you can always use offset: 1_i64 to stop any overflows.

@ysbaddaden
Copy link
Contributor

@RX14 It makes it less likely, but can still happen, if I create a connection and write messages for some weeks. I can specify a u128, but then io.read_line(offset: 0_u128) becomes more complex/confusing than while line = io.gets; end which is simpler.

I'm not saying it's a bad idea to pass the iteration line number, but... aren't we repeating the loop yields the iteration count idea (not bad) but that we eventually dropped because of this very reason (overflow)?

@straight-shoota
Copy link
Member Author

We could also make this a separate method #each_line_with_number.

Another option is to simply count on the offending counter being optimized out by LLVM when the line number block arg is not used 😏

@asterite
Copy link
Member

asterite commented Jan 6, 2019

I don't think overflowing is a problem. I'd probably use the line number when iterating a file's lines, and file lines are super unlikely to overflow.

That said, iterating and using an extra local variable for this not-that-common use case is fine too. Otherwise, I'd name it each_line_with_index like we have in other places.

@straight-shoota
Copy link
Member Author

straight-shoota commented Jan 6, 2019

Yes, when using files you're probably safe, but @ysbaddaden's example of a long running server connection doesn't sound too far-fetched. And it's definitely a use case to count the number of received lines.

@RX14
Copy link
Contributor

RX14 commented Jan 6, 2019

@ysbaddaden if you managed to overflow a 64-bit signed integer of lines, you've received more than 2^63 bytes, or 9.22 exabytes.

@bcardiff
Copy link
Member

bcardiff commented Jan 7, 2019

Adding #each_line_with_number sounds good enough for me.

@straight-shoota straight-shoota force-pushed the jm/feature/each-line-lino branch from eb6aa57 to caa0466 Compare January 7, 2019 16:48
@straight-shoota
Copy link
Member Author

Separated into each_line_with_number

@RX14
Copy link
Contributor

RX14 commented Jan 7, 2019

Then I don't see the point. Is this really used commonly enough to justify a new method? At least each_with_index, however stupid I think it is, is commonly used.

@straight-shoota
Copy link
Member Author

I know I have used this quite a few times, that's why I'd welcome it as a helpful addition.
There is one use case in stdlib, #7252 shows another one.

If that's enough to justify adding another method (or actually three), I don't know.

This method accompanies String#each_line, IO#each_line and
File.each_line and yields a second argument to the block containing
the line number.
It is implemented as a separate method instead of an optional block
argument to each_line to avoid potential (although unlikely) overflow
when reading large amounts of lines from an IO.
@straight-shoota straight-shoota force-pushed the jm/feature/each-line-lino branch from caa0466 to c415fc6 Compare February 12, 2019 16:11
@straight-shoota
Copy link
Member Author

I have extracted the uncontested parts of this PR to #7419.

@ysbaddaden
Copy link
Contributor

No need to add another method.

I wanted to emphasize that maybe it's not a good idea to introduce a counter in a loop function, which we did once and eventually reverted.

If we expect #each_line to be mostly for files, then it's fine to have a counter, even an Int32 should be fine. I just wish it wouldn't raise on index overflow for other cases, especially if they don't use the index.

@j8r
Copy link
Contributor

j8r commented Feb 13, 2019

Is it that complicate to create our own counter? This extends the API to do a simple thing, I don't see the need.

Having the control of the counter, one can decide to be Int32 or even UInt64 to reduce the chance of overflow, depending of the use case.

@sudo-nice
Copy link

This works already:

File.open("some_file").each_line.with_index(1) do |line, idx|
  puts "#{idx}: #{line}"
end

@straight-shoota
Copy link
Member Author

@sudo-nice Using an iterator is less efficient than #each_line with a block.

@ysbaddaden I agree that adding a separate method is not a great idea. Either we integrate line numbers into #each_line or don't.

How about using a wrapping counter? This way it won't raise but simply overflow. I figure this would be fine because when using line numbers makes any sense, they're gonna be far smaller than Int32::MAX. But this leaves the option to use #each_line without line numbers for infinite lines and it won't raise.

@ysbaddaden
Copy link
Contributor

A wrapping on overflow counter would kill the whole point of having integer overflow checks in the first place. IMO the iterator is the best solution: no need for yet-another-method, no need for always having a counter.

@ysbaddaden ysbaddaden closed this Feb 13, 2019
@ysbaddaden ysbaddaden reopened this Feb 13, 2019
@asterite
Copy link
Member

We have many variants of each, each_with_index in the std, why not have each_line and each_line_with_index then? The "problem" is that the index starts at zero, but it's just about doing each_line_with_index(1) and that's it. Then it's consistent with the rest of the std.

@jhass
Copy link
Member

jhass commented Apr 19, 2020

I have to agree here that the usecases are too few to really warrant the extra API. Using the iterator or a closured var is simple enough. Also there hasn't been any interest in this anymore for over a year, so the pain can't be big. I vote to close this :)

@straight-shoota straight-shoota deleted the jm/feature/each-line-lino branch April 19, 2020 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants