In v2, quoted NaN values force a column to character #1277

slodge · 2021-08-18T17:00:43Z

Between v1.4 and v2 it seems like we've hit an issue with the way "NaN" is read and interpreted.

Previously if a quoted numeric column included "NaN" values, then values would be read as numeric - but now values are being read as character.

A reproducible example is the "Value" column below - it's parsed as dbl in v1.4 and chr in 2.0.1

Is this something that can be changed in readr? Or is this something we need to code around as users?

text <- '"Key","Id","DataDate","ReleaseDate","Value"
"First","-2147483648","1900-01-01","1900-01-01","0.5"
"Second","543","2021-08-13","2021-08-13","NaN"
"First","730","2021-08-13","2021-08-13","0"
'
readr::read_csv(text)

Output readr v1.4.0

> text <- '"Key","Id","DataDate","ReleaseDate","Value"
+ "First","-2147483648","1900-01-01","1900-01-01","0.5"
+ "Second","543","2021-08-13","2021-08-13","NaN"
+ "First","730","2021-08-13","2021-08-13","0"
+ '
> readr::read_csv(text)
# A tibble: 3 x 5
  Key             Id DataDate   ReleaseDate Value
  <chr>        <dbl> <date>     <date>      <dbl>
1 First  -2147483648 1900-01-01 1900-01-01    0.5
2 Second         543 2021-08-13 2021-08-13  NaN  
3 First          730 2021-08-13 2021-08-13    0

Output readr v2.0.1

> text <- '"Key","Id","DataDate","ReleaseDate","Value"
+ "First","-2147483648","1900-01-01","1900-01-01","0.5"
+ "Second","543","2021-08-13","2021-08-13","NaN"
+ "First","730","2021-08-13","2021-08-13","0"
+ '
> readr::read_csv(text)
Rows: 3 Columns: 5                                                                                                                
-- Column specification -----------------------------------------------------------------------
Delimiter: ","
chr  (2): Key, Value
dbl  (1): Id
date (2): DataDate, ReleaseDate

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 x 5
  Key             Id DataDate   ReleaseDate Value
  <chr>        <dbl> <date>     <date>      <chr>
1 First  -2147483648 1900-01-01 1900-01-01  0.5  
2 Second         543 2021-08-13 2021-08-13  NaN  
3 First          730 2021-08-13 2021-08-13  0

Tested on R 4.0.2 on Windows

The text was updated successfully, but these errors were encountered:

slodge · 2021-08-18T17:05:27Z

Possibly linked to #1225 (although in this case we don't specify the column types)

Fixes #tidyverse/readr#1277

jimhester · 2021-08-18T17:30:11Z

Thanks for opening the issue and including a reproducible example!

The double parser in vroom did not handle NaN values specially. This should now be fixed.

Fixed by tidyverse/vroom@f520d37

slodge · 2021-08-19T06:54:36Z

Thanks!

# vroom 1.5.7 * Jenny Bryan is now the official maintainer. * Fix uninitialized bool detected by CRAN's UBSAN check (tidyverse/vroom#386) * Fix buffer overflow when trying to parse an integer field that is over 64 characters long (tidyverse/readr#1326) * Fix subset indexing when indexes span a file boundary multiple times (#383) # vroom 1.5.6 * `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381) * `vroom(n_max=)` now correctly handles cases when reading from a connection and the file does _not_ end with a newline (tidyverse/readr#1321) * `vroom()` no longer issues a spurious warning when the parsing needs * to be restarted due to the presence of embedded newlines * (tidyverse/readr#1313) Fix performance * issue when materializing subsetted vectors (#378) * `vroom_format()` now uses the same internal multi-threaded code as `vroom_write()`, improving its performance in most cases (#377) * `vroom_fwf()` no longer omits the last line if it does _not_ end with a newline (tidyverse/readr#1293) * Empty files or files with only a header line and no data no longer cause a crash if read with multiple files (tidyverse/readr#1297) * Files with a header but no contents, or a empty file if `col_names = FALSE` no longer cause a hang when `progress = TRUE` (tidyverse/readr#1297) * Commented lines with comments at the end of lines no longer hang R (tidyverse/readr#1309) * Comment lines containing unpaired quotes are no longer treated as unterminated quotations (tidyverse/readr#1307) * Values with only a `Inf` or `NaN` prefix but additional data afterwards, like `Inform` or no longer inappropriately guessed as doubles (tidyverse/readr#1319) * Time types now support `%h` format to denote hour durations greater than 24, like readr (tidyverse/readr#1312) * Fix performance issue when materializing subsetted vectors (#378) # vroom 1.5.5 * `vroom()` now supports files with only carriage return newlines (`\r`). (#360, tidyverse/readr#1236) * `vroom()` now parses single digit datetimes more consistently as readr has done (tidyverse/readr#1276) * `vroom()` now parses `Inf` values as doubles (tidyverse/readr#1283) * `vroom()` now parses `NaN` values as doubles (tidyverse/readr#1277) * `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports scientific notation (#364) * `vroom()` now works around specifying a `\n` as the delimiter (#365, tidyverse/dplyr#5977) * `vroom()` no longer crashes if given a `col_name` and `col_type` both less than the number of columns (tidyverse/readr#1271) * `vroom()` no longer hangs if given an empty value for `locale(grouping_mark=)` (tidyverse/readr#1241) * Fix performance regression when guessing with large numbers of rows (tidyverse/readr#1267)

jimhester added a commit to tidyverse/vroom that referenced this issue Aug 18, 2021

Parse NaN values as doubles

f520d37

Fixes #tidyverse/readr#1277

jimhester closed this as completed Aug 18, 2021

khusmann mentioned this issue Dec 7, 2023

type_convert() does not parse IEEE 754 double values (NaN, Inf, -Inf) #1526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In v2, quoted NaN values force a column to character #1277

In v2, quoted NaN values force a column to character #1277

slodge commented Aug 18, 2021

slodge commented Aug 18, 2021

jimhester commented Aug 18, 2021

slodge commented Aug 19, 2021

In v2, quoted NaN values force a column to character #1277

In v2, quoted NaN values force a column to character #1277

Comments

slodge commented Aug 18, 2021

slodge commented Aug 18, 2021

jimhester commented Aug 18, 2021

slodge commented Aug 19, 2021