read_csv fails in readr 2.0.1 if a comment contains an un-terminated quote #1307

a-hurst · 2021-09-23T19:04:17Z

Hi there,

I think I've run into a pretty weird edge case bug: after updating to readr 2.0.1, I can no longer import files where the header comments contain an un-terminated quote.

For example, running read_csv(filename, comment = "#") on the following example file returns 0 rows and 0 columns:

# This is a header
# This is a quote: "

col_a,col_b,col_c
1,2,3
4,5,6

However, removing the quote character from the second line (or adding a second one) allows the file to import just fine. This works perfectly fine if I use the edition functions to use the old readr backend.

The reason this is an issue is that I have a ton of data files with headers specifying the runtime information of the program that collected the data, including the computer's screen size in inches:

# EXPERIMENT SETTINGS
#  > Trials Per Block: 20
#  > Blocks Per Experiment: 5
#
# SYSTEM INFO
#  > Operating System: macOS 10.14.6
#  > Python Version: 2.7.15
#
# DISPLAY INFO
#  > Screen Size: 24" diagonal
#  > Resolution: 1920x1080 @ 60Hz
#  > View Distance: 57 cm

Thanks in advance, and thank you for maintaining this excellent package!

The text was updated successfully, but these errors were encountered:

jimhester · 2021-09-24T13:06:03Z

Thank you for opening the issue and including a reproducible example!

I can confirm the issue, a possible workaround would be to turn off quoting with quote = "" when reading these files, assuming it doesn't use quoting in the data.

We may not fix this bug immediately, as we are currently shifting focus to other packages, but I assure you we will get to it in time.

untergeekDE · 2021-09-28T14:06:56Z

Can confirm, came here to report that very issue. Sample file with German election data here, this stops after line 932 without any error message or warning - had a very interesting night on Sunday. :) File read with

read_delim(fname, delim = ";",
                                 escape_double = FALSE,
                                 locale = locale(date_names = "de",
                                                 decimal_mark = ",",
                                                 grouping_mark = ".",
                                                 encoding = "WINDOWS-1252"),
                                 trim_ws = TRUE  
                                 skip = 1)

Side issue: If you import a text cell containing matching quotes, the first will go missing:
Dieses Wahlergebnis enthält Stimmen aus: Hirschhagen, Gasthaus "Zum Rohrbachtal"
is returned as
"Dieses Wahlergebnis enth\xe4lt Stimmen aus: Hirschhagen, Gasthaus Zum Rohrbachtal\"\""

jimhester · 2021-11-05T15:37:32Z

Should be fixed in the next release version of vroom.

# vroom 1.5.7 * Jenny Bryan is now the official maintainer. * Fix uninitialized bool detected by CRAN's UBSAN check (tidyverse/vroom#386) * Fix buffer overflow when trying to parse an integer field that is over 64 characters long (tidyverse/readr#1326) * Fix subset indexing when indexes span a file boundary multiple times (#383) # vroom 1.5.6 * `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381) * `vroom(n_max=)` now correctly handles cases when reading from a connection and the file does _not_ end with a newline (tidyverse/readr#1321) * `vroom()` no longer issues a spurious warning when the parsing needs * to be restarted due to the presence of embedded newlines * (tidyverse/readr#1313) Fix performance * issue when materializing subsetted vectors (#378) * `vroom_format()` now uses the same internal multi-threaded code as `vroom_write()`, improving its performance in most cases (#377) * `vroom_fwf()` no longer omits the last line if it does _not_ end with a newline (tidyverse/readr#1293) * Empty files or files with only a header line and no data no longer cause a crash if read with multiple files (tidyverse/readr#1297) * Files with a header but no contents, or a empty file if `col_names = FALSE` no longer cause a hang when `progress = TRUE` (tidyverse/readr#1297) * Commented lines with comments at the end of lines no longer hang R (tidyverse/readr#1309) * Comment lines containing unpaired quotes are no longer treated as unterminated quotations (tidyverse/readr#1307) * Values with only a `Inf` or `NaN` prefix but additional data afterwards, like `Inform` or no longer inappropriately guessed as doubles (tidyverse/readr#1319) * Time types now support `%h` format to denote hour durations greater than 24, like readr (tidyverse/readr#1312) * Fix performance issue when materializing subsetted vectors (#378) # vroom 1.5.5 * `vroom()` now supports files with only carriage return newlines (`\r`). (#360, tidyverse/readr#1236) * `vroom()` now parses single digit datetimes more consistently as readr has done (tidyverse/readr#1276) * `vroom()` now parses `Inf` values as doubles (tidyverse/readr#1283) * `vroom()` now parses `NaN` values as doubles (tidyverse/readr#1277) * `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports scientific notation (#364) * `vroom()` now works around specifying a `\n` as the delimiter (#365, tidyverse/dplyr#5977) * `vroom()` no longer crashes if given a `col_name` and `col_type` both less than the number of columns (tidyverse/readr#1271) * `vroom()` no longer hangs if given an empty value for `locale(grouping_mark=)` (tidyverse/readr#1241) * Fix performance regression when guessing with large numbers of rows (tidyverse/readr#1267)

jimhester added the bug an unexpected problem or unintended behavior label Sep 24, 2021

jimhester closed this as completed in tidyverse/vroom@33c1a6b Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv fails in readr 2.0.1 if a comment contains an un-terminated quote #1307

read_csv fails in readr 2.0.1 if a comment contains an un-terminated quote #1307

a-hurst commented Sep 23, 2021

jimhester commented Sep 24, 2021

untergeekDE commented Sep 28, 2021 •

edited

Loading

jimhester commented Nov 5, 2021

read_csv fails in readr 2.0.1 if a comment contains an un-terminated quote #1307

read_csv fails in readr 2.0.1 if a comment contains an un-terminated quote #1307

Comments

a-hurst commented Sep 23, 2021

jimhester commented Sep 24, 2021

untergeekDE commented Sep 28, 2021 • edited Loading

jimhester commented Nov 5, 2021

untergeekDE commented Sep 28, 2021 •

edited

Loading