-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/csv: add the ability to get the line number of a record #44221
Comments
Your proposal
You seem to be using Your straw-person implementation reflects the confusion.
The method documentation should be precise. Add indexing information from the Revised method
|
I actually suggested renaming the |
As @mvdan said, my initial version of the proposal used |
Maybe it should be // Pos returns the position of the most recently read record. It seems like if you want one you might want both? |
This proposal has been added to the active column of the proposals project |
I've never seen any requirement for finding the last line in particular. What would be more useful however would be the ability to find the line and column for any given field on the current line. I'm not sure: is the CSV encoding reversible? If so, maybe it would be possible to determine that given the current line number and the current field values, which could provide the last line number as a natural consequence. |
Even if you don't need the endLine normally, having Pos return both makes clear that they might be different, which could be useful. |
How about something like this?
Then the equivalent of my original |
@rogpeppe, can you confirm that the API you suggested in the previous comment is implementable? Does anyone object to the FieldPos API? |
I'm pretty sure it's implementable. My main concern would be whether it can be implemented while adding negligible runtime overhead. |
It occurred to me at some point over the past week that the Reader just needs a []int that it reuses across each decoding. That can't possibly be much overhead. |
Doesn't it need to be a slice of (line, column) pairs? Still not much overhead, but it might be non-negligible for some. |
Based on the discussion above, this proposal seems like a likely accept. |
One question here: what is the correct unit to use for the column: bytes or runes? Currently the error counts columns in runes, but as a counter-example, the Go compiler's positions count columns in bytes. Counting in bytes is cheaper and easier, and I'd be inclined to do that except for the currently documented and exported semantics of the Would it be too bad if I was writing some code that calculates rune offsets lazily when |
Runes is pointless; please do bytes. :-) |
What do you think about changing the existing behaviour of the column in the |
I did a quick scan of my Go corpus from spring 2020. I found zero uses of the Column field, so I think it is safe to change.
If we made the change I think it would be easy to approach the biogo and dolt folks to get their implementations fixed. Here's the actual instances: Correctly computing rune-based Column (would be affected)
Has tests checking Column but they're ASCII only, and the implementation is wrapping encoding/csv (would be unaffected):
Hard-coded Column: 1 (unaffected):
Thinks Column is CSV field index (unaffected):
|
Change https://golang.org/cl/291290 mentions this issue: |
OK, I've updated https://go-review.googlesource.com/c/go/+/291290 to implement the above discussed semantics, and it also changes the semantics of Unfortunately the performance difference is noticeable (I think that's probably because we're appending 128 bits, not 64 bits to the |
No change in consensus, so accepted. 🎉 |
[This comment](golang/go#44221 (comment)) related to columns in the encoding/csv package made me realise that we should do byte offset, not rune offset.
I am struggling to avoid a noticeable performance overhead from making this change. |
Roger and I briefly discussed this offline. Rebased on the latest master, I could only reproduce a smaller perf loss, mainly for large fields:
The single commit includes a refactor moving some code to parseField, so it's hard to tell where the perf loss is, or if it's to blame on the extra func refactor itself. I think it would be a good idea to split the work into two commits, where the first commit does the extra func refactor with no behavioral changes, to get a better idea. That aside, a 2-5% perf hit on some edge cases seems reasonable to me. Though I'm still curious why large fields in particular get slower, since that intuitively doesn't make sense - keeping track of positions shouldn't be slower for large fields. |
This slowdown seems fine. Perhaps we will find more optimizations later. |
Change https://golang.org/cl/314774 mentions this issue: |
Change https://golang.org/cl/323349 mentions this issue: |
For #44221 For #44513 Change-Id: I2d2d1c55255f4411c11fd51f0f3ae726cbf4d136 Reviewed-on: https://go-review.googlesource.com/c/go/+/323349 Trust: Ian Lance Taylor <[email protected]> Reviewed-by: roger peppe <[email protected]>
Issue #26679 was closed as being a niche feature, but it's actually quite common to need this information.
For example, after a record has been read, the syntax of its fields might be checked, and could be invalid. In this case, it's useful to be able to produce a good error message that points, at least roughly, to where the problem is. When CSV files can be millions of lines long, this is very important.
The workaround proposed there is awkward to implement (it's easy to get an
io.Reader
implementation wrong) and doesn't work currently anyway. The only other current alternative is to wrap the underlying reader with a reader that counts lines and only reads a line at a time, which is again awkward to implement, and introduces a second unecessary layer of buffering to slow things down.As
csv.Reader
already keeps track of line numbers, providing that information in the API seems natural.We propose a new method:
There's a straw-man implementation here.
Thanks to @mvdan for consultation on this issue.
The text was updated successfully, but these errors were encountered: