`parse_number` can't handle special space character as grouping_mark locale argument #796

ghost · 2018-02-16T12:25:38Z

Description

Using the parse_number function with a locale including special space character (e.g. : non breaking space) as an argument of locale(grouping_mark = '\U202F') don't lead to a properly parsed number

This kind of space is often used in french as a separator (eg: 1 234,00 for the UK version of 1,234.00). An example of csv files containing this kind of character can be found following this link.

Minimal reproducible example

library(readr)
number_space <- "1 234"
parse_number(number_space)
# Get '1', which is expected because no decimal_mark defined
parse_number(number_space, locale = locale(grouping_mark = ' '))
# Get as expected '1234'

# Define a non breaking space
special_space <-  "\U00A0"

number_special_space <- paste0(1, special_space, 234)

number_special_space
# "1 234"
parse_number(number_special_space, locale = locale(grouping_mark = ' '))
# Get '1', which is expected because no decimal_mark well defined
parse_number(number_special_space, locale = locale(grouping_mark = special_space))
# Get 1 although '1234' was expected

What I already tried

R can reconize this character and a possible workaround is to gsub these character before proceding.

special_space <-  "\U00A0"
number_special_space <- paste0(1, special_space, 234)
gsub(pattern = special_space, replacement = '', x = number_special_space)
# Get as expected '1234'
gsub(pattern = ' ', replacement = '', x = number_special_space)
# Get as expected '1 234'

Seems to come from internal method in readr. I tried to deep inside the code but my knowledge is too limited. readr:::parse_vector_ call a function readr_parse_vector_ i can't find in the source code. It's maybe in the CPP part of the code but I have little knowledge on cpp.

Session info

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Unicode_10.0.0-1 magrittr_1.5     lubridate_1.7.2  bindrcpp_0.2    
[5] dplyr_0.7.4      readr_1.1.1     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15     utf8_1.1.3       crayon_1.3.4     assertthat_0.2.0
 [5] R6_2.2.2         pillar_1.1.0     stringi_1.1.6    cli_1.0.0       
 [9] rlang_0.1.6      rstudioapi_0.7   tools_3.4.3      stringr_1.2.0   
[13] glue_1.2.0       hms_0.4.1        yaml_2.1.16      compiler_3.4.3  
[17] pkgconfig_2.0.1  bindr_0.1        tibble_1.4.2

The text was updated successfully, but these errors were encountered:

cderv · 2018-02-17T10:14:29Z

@jomuller another possible workaround waiting for a fix is

parsing your columns as character (setting col_character()for these columns in cols argument),
use deprecated function tidyr::extract_numeric() in a mutate_at call for these columns as this function still works, even with \U00A0. parse_number, more powerful, is replacing tidyr::extract_numeric and a depreciation message says that. Know that tidyr::extract_numeric worked only with . as decimal

Minimal reprex:

(num <- paste(1, 234.5, sep = "\U00A0"))
#> [1] "1 234.5"
library(readr)
parse_number(num, locale = locale(grouping_mark = "\U00A0"))
#> [1] 1
tidyr::extract_numeric(num)
#> extract_numeric() is deprecated: please use readr::parse_number() instead
#> [1] 1234.5

Created on 2018-02-17 by the reprex package (v0.2.0).

ghost · 2018-02-19T14:51:59Z

@cderv Thank you for this workaround. Currently I'm using simply a regex to remove any non numeric character gsub(pattern = '[^0-9\\.]', '', x, perl = TRUE) which is very close to extract_numeric definition as.numeric(gsub("[^0-9.-]+", "", as.character(x))). Maybe not elegant but does the job. Using deprecated function could be dangerous for the futur of the code I'm writing.

pachevalier · 2018-04-09T14:10:09Z

I've got two wrapper function in the tricky package. parse_French_number() use parse_number() and unfrench_formatting() use the gsub approach.

jimhester · 2021-05-25T15:00:38Z

Closed by tidyverse/vroom@959b4b7, this should now be supported.

ghost changed the title ~~parse_number can't handle special space character as grouping_mark argument~~ parse_number can't handle special space character as grouping_mark locale argument Feb 16, 2018

pachevalier mentioned this issue Apr 9, 2018

Failure to parse French numbers with parse_double() and parse_integer() #827

Closed

jimhester added the feature a feature request or enhancement label May 4, 2018

jimhester added the multibyte 🦋 label Nov 15, 2018

jimhester added this to the backlog milestone Nov 19, 2018

jimhester removed this from the backlog milestone May 11, 2021

jimhester closed this as completed in tidyverse/vroom@959b4b7 May 25, 2021

cjyetman mentioned this issue Jan 13, 2023

multi-byte grouping_mark doesn't work when source file is different encoding #1459

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`parse_number` can't handle special space character as grouping_mark locale argument #796

`parse_number` can't handle special space character as grouping_mark locale argument #796

ghost commented Feb 16, 2018 •

edited by ghost

Loading

cderv commented Feb 17, 2018 •

edited

Loading

ghost commented Feb 19, 2018

pachevalier commented Apr 9, 2018 •

edited

Loading

jimhester commented May 25, 2021

parse_number can't handle special space character as grouping_mark locale argument #796

parse_number can't handle special space character as grouping_mark locale argument #796

Comments

ghost commented Feb 16, 2018 • edited by ghost Loading

Description

Minimal reproducible example

What I already tried

Session info

cderv commented Feb 17, 2018 • edited Loading

ghost commented Feb 19, 2018

pachevalier commented Apr 9, 2018 • edited Loading

jimhester commented May 25, 2021

`parse_number` can't handle special space character as grouping_mark locale argument #796

`parse_number` can't handle special space character as grouping_mark locale argument #796

ghost commented Feb 16, 2018 •

edited by ghost

Loading

cderv commented Feb 17, 2018 •

edited

Loading

pachevalier commented Apr 9, 2018 •

edited

Loading