Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_number can't handle special space character as grouping_mark locale argument #796

Closed
ghost opened this issue Feb 16, 2018 · 4 comments
Labels
feature a feature request or enhancement multibyte 🦋

Comments

@ghost
Copy link

ghost commented Feb 16, 2018

Description

Using the parse_number function with a locale including special space character (e.g. : non breaking space) as an argument of locale(grouping_mark = '\U202F') don't lead to a properly parsed number

This kind of space is often used in french as a separator (eg: 1 234,00 for the UK version of 1,234.00). An example of csv files containing this kind of character can be found following this link.

Minimal reproducible example

library(readr)
number_space <- "1 234"
parse_number(number_space)
# Get '1', which is expected because no decimal_mark defined
parse_number(number_space, locale = locale(grouping_mark = ' '))
# Get as expected '1234'

# Define a non breaking space
special_space <-  "\U00A0"

number_special_space <- paste0(1, special_space, 234)

number_special_space
# "1 234"
parse_number(number_special_space, locale = locale(grouping_mark = ' '))
# Get '1', which is expected because no decimal_mark well defined
parse_number(number_special_space, locale = locale(grouping_mark = special_space))
# Get 1 although '1234' was expected

What I already tried

R can reconize this character and a possible workaround is to gsub these character before proceding.

special_space <-  "\U00A0"
number_special_space <- paste0(1, special_space, 234)
gsub(pattern = special_space, replacement = '', x = number_special_space)
# Get as expected '1234'
gsub(pattern = ' ', replacement = '', x = number_special_space)
# Get as expected '1 234'

Seems to come from internal method in readr. I tried to deep inside the code but my knowledge is too limited. readr:::parse_vector_ call a function readr_parse_vector_ i can't find in the source code. It's maybe in the CPP part of the code but I have little knowledge on cpp.

Session info

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Unicode_10.0.0-1 magrittr_1.5     lubridate_1.7.2  bindrcpp_0.2    
[5] dplyr_0.7.4      readr_1.1.1     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15     utf8_1.1.3       crayon_1.3.4     assertthat_0.2.0
 [5] R6_2.2.2         pillar_1.1.0     stringi_1.1.6    cli_1.0.0       
 [9] rlang_0.1.6      rstudioapi_0.7   tools_3.4.3      stringr_1.2.0   
[13] glue_1.2.0       hms_0.4.1        yaml_2.1.16      compiler_3.4.3  
[17] pkgconfig_2.0.1  bindr_0.1        tibble_1.4.2  
@ghost ghost changed the title parse_number can't handle special space character as grouping_mark argument parse_number can't handle special space character as grouping_mark locale argument Feb 16, 2018
@cderv
Copy link
Contributor

cderv commented Feb 17, 2018

@jomuller another possible workaround waiting for a fix is

  • parsing your columns as character (setting col_character()for these columns in cols argument),
  • use deprecated function tidyr::extract_numeric() in a mutate_at call for these columns as this function still works, even with \U00A0. parse_number, more powerful, is replacing tidyr::extract_numeric and a depreciation message says that. Know that tidyr::extract_numeric worked only with . as decimal

Minimal reprex:

(num <- paste(1, 234.5, sep = "\U00A0"))
#> [1] "1 234.5"
library(readr)
parse_number(num, locale = locale(grouping_mark = "\U00A0"))
#> [1] 1
tidyr::extract_numeric(num)
#> extract_numeric() is deprecated: please use readr::parse_number() instead
#> [1] 1234.5

Created on 2018-02-17 by the reprex package (v0.2.0).

@ghost
Copy link
Author

ghost commented Feb 19, 2018

@cderv Thank you for this workaround. Currently I'm using simply a regex to remove any non numeric character gsub(pattern = '[^0-9\\.]', '', x, perl = TRUE) which is very close to extract_numeric definition as.numeric(gsub("[^0-9.-]+", "", as.character(x))). Maybe not elegant but does the job. Using deprecated function could be dangerous for the futur of the code I'm writing.

@pachevalier
Copy link

pachevalier commented Apr 9, 2018

I've got two wrapper function in the tricky package. parse_French_number() use parse_number() and unfrench_formatting() use the gsub approach.

@jimhester jimhester added the feature a feature request or enhancement label May 4, 2018
@jimhester jimhester added this to the backlog milestone Nov 19, 2018
@jimhester jimhester removed this from the backlog milestone May 11, 2021
@jimhester
Copy link
Collaborator

Closed by tidyverse/vroom@959b4b7, this should now be supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement multibyte 🦋
Projects
None yet
Development

No branches or pull requests

3 participants