Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In v2, quoted NaN values force a column to character #1277

Closed
slodge opened this issue Aug 18, 2021 · 3 comments
Closed

In v2, quoted NaN values force a column to character #1277

slodge opened this issue Aug 18, 2021 · 3 comments

Comments

@slodge
Copy link

slodge commented Aug 18, 2021

Between v1.4 and v2 it seems like we've hit an issue with the way "NaN" is read and interpreted.

Previously if a quoted numeric column included "NaN" values, then values would be read as numeric - but now values are being read as character.

A reproducible example is the "Value" column below - it's parsed as dbl in v1.4 and chr in 2.0.1

Is this something that can be changed in readr? Or is this something we need to code around as users?

text <- '"Key","Id","DataDate","ReleaseDate","Value"
"First","-2147483648","1900-01-01","1900-01-01","0.5"
"Second","543","2021-08-13","2021-08-13","NaN"
"First","730","2021-08-13","2021-08-13","0"
'
readr::read_csv(text)

Output readr v1.4.0

> text <- '"Key","Id","DataDate","ReleaseDate","Value"
+ "First","-2147483648","1900-01-01","1900-01-01","0.5"
+ "Second","543","2021-08-13","2021-08-13","NaN"
+ "First","730","2021-08-13","2021-08-13","0"
+ '
> readr::read_csv(text)
# A tibble: 3 x 5
  Key             Id DataDate   ReleaseDate Value
  <chr>        <dbl> <date>     <date>      <dbl>
1 First  -2147483648 1900-01-01 1900-01-01    0.5
2 Second         543 2021-08-13 2021-08-13  NaN  
3 First          730 2021-08-13 2021-08-13    0  

Output readr v2.0.1

> text <- '"Key","Id","DataDate","ReleaseDate","Value"
+ "First","-2147483648","1900-01-01","1900-01-01","0.5"
+ "Second","543","2021-08-13","2021-08-13","NaN"
+ "First","730","2021-08-13","2021-08-13","0"
+ '
> readr::read_csv(text)
Rows: 3 Columns: 5                                                                                                                
-- Column specification -----------------------------------------------------------------------
Delimiter: ","
chr  (2): Key, Value
dbl  (1): Id
date (2): DataDate, ReleaseDate

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 x 5
  Key             Id DataDate   ReleaseDate Value
  <chr>        <dbl> <date>     <date>      <chr>
1 First  -2147483648 1900-01-01 1900-01-01  0.5  
2 Second         543 2021-08-13 2021-08-13  NaN  
3 First          730 2021-08-13 2021-08-13  0    

Tested on R 4.0.2 on Windows

@slodge
Copy link
Author

slodge commented Aug 18, 2021

Possibly linked to #1225 (although in this case we don't specify the column types)

jimhester added a commit to tidyverse/vroom that referenced this issue Aug 18, 2021
@jimhester
Copy link
Collaborator

Thanks for opening the issue and including a reproducible example!

The double parser in vroom did not handle NaN values specially. This should now be fixed.

Fixed by tidyverse/vroom@f520d37

@slodge
Copy link
Author

slodge commented Aug 19, 2021

Thanks!

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue May 1, 2022
# vroom 1.5.7

* Jenny Bryan is now the official maintainer.

* Fix uninitialized bool detected by CRAN's UBSAN check
  (tidyverse/vroom#386)

* Fix buffer overflow when trying to parse an integer field that is
  over 64 characters long
  (tidyverse/readr#1326)

* Fix subset indexing when indexes span a file boundary multiple times
  (#383)

# vroom 1.5.6

* `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381)

* `vroom(n_max=)` now correctly handles cases when reading from a
  connection and the file does _not_ end with a newline
  (tidyverse/readr#1321)

* `vroom()` no longer issues a spurious warning when the parsing needs
* to be restarted due to the presence of embedded newlines
* (tidyverse/readr#1313) Fix performance
* issue when materializing subsetted vectors (#378)

* `vroom_format()` now uses the same internal multi-threaded code as
  `vroom_write()`, improving its performance in most cases (#377)

* `vroom_fwf()` no longer omits the last line if it does _not_ end
  with a newline (tidyverse/readr#1293)

* Empty files or files with only a header line and no data no longer
  cause a crash if read with multiple files
  (tidyverse/readr#1297)

* Files with a header but no contents, or a empty file if `col_names =
  FALSE` no longer cause a hang when `progress = TRUE`
  (tidyverse/readr#1297)

* Commented lines with comments at the end of lines no longer hang R
  (tidyverse/readr#1309)

* Comment lines containing unpaired quotes are no longer treated as
  unterminated quotations
  (tidyverse/readr#1307)

* Values with only a `Inf` or `NaN` prefix but additional data
  afterwards, like `Inform` or no longer inappropriately guessed as
  doubles (tidyverse/readr#1319)

* Time types now support `%h` format to denote hour durations greater
  than 24, like readr (tidyverse/readr#1312)

* Fix performance issue when materializing subsetted vectors (#378)


# vroom 1.5.5

* `vroom()` now supports files with only carriage return newlines
  (`\r`). (#360, tidyverse/readr#1236)

* `vroom()` now parses single digit datetimes more consistently as
  readr has done (tidyverse/readr#1276)

* `vroom()` now parses `Inf` values as doubles
  (tidyverse/readr#1283)

* `vroom()` now parses `NaN` values as doubles
  (tidyverse/readr#1277)

* `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports
  scientific notation (#364)

* `vroom()` now works around specifying a `\n` as the delimiter (#365,
  tidyverse/dplyr#5977)

* `vroom()` no longer crashes if given a `col_name` and `col_type`
  both less than the number of columns
  (tidyverse/readr#1271)

* `vroom()` no longer hangs if given an empty value for
  `locale(grouping_mark=)`
  (tidyverse/readr#1241)

* Fix performance regression when guessing with large numbers of rows
  (tidyverse/readr#1267)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants