Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv fails in readr 2.0.1 if a comment contains an un-terminated quote #1307

Closed
a-hurst opened this issue Sep 23, 2021 · 3 comments
Closed
Labels
bug an unexpected problem or unintended behavior

Comments

@a-hurst
Copy link

a-hurst commented Sep 23, 2021

Hi there,

I think I've run into a pretty weird edge case bug: after updating to readr 2.0.1, I can no longer import files where the header comments contain an un-terminated quote.

For example, running read_csv(filename, comment = "#") on the following example file returns 0 rows and 0 columns:

# This is a header
# This is a quote: "

col_a,col_b,col_c
1,2,3
4,5,6

However, removing the quote character from the second line (or adding a second one) allows the file to import just fine. This works perfectly fine if I use the edition functions to use the old readr backend.

The reason this is an issue is that I have a ton of data files with headers specifying the runtime information of the program that collected the data, including the computer's screen size in inches:

# EXPERIMENT SETTINGS
#  > Trials Per Block: 20
#  > Blocks Per Experiment: 5
#
# SYSTEM INFO
#  > Operating System: macOS 10.14.6
#  > Python Version: 2.7.15
#
# DISPLAY INFO
#  > Screen Size: 24" diagonal
#  > Resolution: 1920x1080 @ 60Hz
#  > View Distance: 57 cm

Thanks in advance, and thank you for maintaining this excellent package!

@jimhester jimhester added the bug an unexpected problem or unintended behavior label Sep 24, 2021
@jimhester
Copy link
Collaborator

Thank you for opening the issue and including a reproducible example!

I can confirm the issue, a possible workaround would be to turn off quoting with quote = "" when reading these files, assuming it doesn't use quoting in the data.

We may not fix this bug immediately, as we are currently shifting focus to other packages, but I assure you we will get to it in time.

@untergeekDE
Copy link

untergeekDE commented Sep 28, 2021

Can confirm, came here to report that very issue. Sample file with German election data here, this stops after line 932 without any error message or warning - had a very interesting night on Sunday. :) File read with

read_delim(fname, delim = ";",
                                 escape_double = FALSE,
                                 locale = locale(date_names = "de",
                                                 decimal_mark = ",",
                                                 grouping_mark = ".",
                                                 encoding = "WINDOWS-1252"),
                                 trim_ws = TRUE  
                                 skip = 1)

Side issue: If you import a text cell containing matching quotes, the first will go missing:
Dieses Wahlergebnis enthält Stimmen aus: Hirschhagen, Gasthaus "Zum Rohrbachtal"
is returned as
"Dieses Wahlergebnis enth\xe4lt Stimmen aus: Hirschhagen, Gasthaus Zum Rohrbachtal\"\""

@jimhester
Copy link
Collaborator

Should be fixed in the next release version of vroom.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue May 1, 2022
# vroom 1.5.7

* Jenny Bryan is now the official maintainer.

* Fix uninitialized bool detected by CRAN's UBSAN check
  (tidyverse/vroom#386)

* Fix buffer overflow when trying to parse an integer field that is
  over 64 characters long
  (tidyverse/readr#1326)

* Fix subset indexing when indexes span a file boundary multiple times
  (#383)

# vroom 1.5.6

* `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381)

* `vroom(n_max=)` now correctly handles cases when reading from a
  connection and the file does _not_ end with a newline
  (tidyverse/readr#1321)

* `vroom()` no longer issues a spurious warning when the parsing needs
* to be restarted due to the presence of embedded newlines
* (tidyverse/readr#1313) Fix performance
* issue when materializing subsetted vectors (#378)

* `vroom_format()` now uses the same internal multi-threaded code as
  `vroom_write()`, improving its performance in most cases (#377)

* `vroom_fwf()` no longer omits the last line if it does _not_ end
  with a newline (tidyverse/readr#1293)

* Empty files or files with only a header line and no data no longer
  cause a crash if read with multiple files
  (tidyverse/readr#1297)

* Files with a header but no contents, or a empty file if `col_names =
  FALSE` no longer cause a hang when `progress = TRUE`
  (tidyverse/readr#1297)

* Commented lines with comments at the end of lines no longer hang R
  (tidyverse/readr#1309)

* Comment lines containing unpaired quotes are no longer treated as
  unterminated quotations
  (tidyverse/readr#1307)

* Values with only a `Inf` or `NaN` prefix but additional data
  afterwards, like `Inform` or no longer inappropriately guessed as
  doubles (tidyverse/readr#1319)

* Time types now support `%h` format to denote hour durations greater
  than 24, like readr (tidyverse/readr#1312)

* Fix performance issue when materializing subsetted vectors (#378)


# vroom 1.5.5

* `vroom()` now supports files with only carriage return newlines
  (`\r`). (#360, tidyverse/readr#1236)

* `vroom()` now parses single digit datetimes more consistently as
  readr has done (tidyverse/readr#1276)

* `vroom()` now parses `Inf` values as doubles
  (tidyverse/readr#1283)

* `vroom()` now parses `NaN` values as doubles
  (tidyverse/readr#1277)

* `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports
  scientific notation (#364)

* `vroom()` now works around specifying a `\n` as the delimiter (#365,
  tidyverse/dplyr#5977)

* `vroom()` no longer crashes if given a `col_name` and `col_type`
  both less than the number of columns
  (tidyverse/readr#1271)

* `vroom()` no longer hangs if given an empty value for
  `locale(grouping_mark=)`
  (tidyverse/readr#1241)

* Fix performance regression when guessing with large numbers of rows
  (tidyverse/readr#1267)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants