Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How lenient should labelled be when combining different labels that map to the same value? #667

Closed
manhnguyen48 opened this issue Mar 19, 2022 · 4 comments

Comments

@manhnguyen48
Copy link

manhnguyen48 commented Mar 19, 2022

It seems pivot_longer would lose the value labels (produced by labelled). This can be avoided if transformed to factors first. Is this expected behaviour? It'd be nice to at least have some warnings.


Brief description of the problem

library(labelled)
library(tidyverse)
example_data <- tibble::tibble(serial = 1:100, 
                       age = sample(1:6, size = 100, replace = T) |>
                         labelled::set_variable_labels(.labels = "Age") |> 
                         labelled::set_value_labels(.labels = c("18-24"=1, "25-34"=2,"35-44"=3,
                                                                "45-54"=4,"55-64"=5,"65+"=6)), 
                       gender = sample(1:2, size = 100, replace=T) |> 
                         labelled::set_variable_labels(.labels = "Gender") |> 
                         labelled::set_value_labels(.labels = c("Male" = 1, "Female"=2)))

#Value labels for Gender is lost
tidyr::pivot_longer(example_data, c(age,gender)) |> 
  dplyr::count(name,value)
#> # A tibble: 8 x 3
#>   name       value     n
#>   <chr>  <int+lbl> <int>
#> 1 age    1 [18-24]    16
#> 2 age    2 [25-34]    17
#> 3 age    3 [35-44]    16
#> 4 age    4 [45-54]    14
#> 5 age    5 [55-64]    18
#> 6 age    6 [65+]      19
#> 7 gender 1 [18-24]    48
#> 8 gender 2 [25-34]    52
#It works if you convert to factors
tidyr::pivot_longer(example_data, c(age,gender), values_transform = forcats::as_factor) |> 
  dplyr::count(name,value)
#> # A tibble: 8 x 3
#>   name   value      n
#>   <chr>  <fct>  <int>
#> 1 age    18-24     16
#> 2 age    25-34     17
#> 3 age    35-44     16
#> 4 age    45-54     14
#> 5 age    55-64     18
#> 6 age    65+       19
#> 7 gender Male      48
#> 8 gender Female    52

Created on 2022-03-19 by the reprex package (v2.0.1)

@manhnguyen48 manhnguyen48 changed the title Lost value labels when pivot_longer Lost value labels when pivot_longer() Mar 19, 2022
@DavisVaughan
Copy link
Member

This isn't a pivot_longer() issue, this is just how haven's labelled class works when you combine vectors with different labels together. It seems to prefer the labels of the LHS right, then the RHS is used for anything else.

@hadley it seems like this is expected (sort of like what we do with time zones), but I could also see where haven could be more strict and not allow you to combine vectors that have different labels that map to the same value (i.e. 18-24 -> 1 and Male -> 1)

library(haven)

age <- labelled(
  1:6,
  labels = c("18-24"=1, "25-34"=2,"35-44"=3, "45-54"=4,"55-64"=5,"65+"=6),
  label = "Age"
)

gender <- labelled(
  1:2,
  labels = c("Male" = 1, "Female"=2),
  label = "Gender"
)

age
#> <labelled<integer>[6]>: Age
#> [1] 1 2 3 4 5 6
#> 
#> Labels:
#>  value label
#>      1 18-24
#>      2 25-34
#>      3 35-44
#>      4 45-54
#>      5 55-64
#>      6   65+

gender
#> <labelled<integer>[2]>: Gender
#> [1] 1 2
#> 
#> Labels:
#>  value  label
#>      1   Male
#>      2 Female

# Uses labels of LHS then RHS
c(age, gender)
#> <labelled<integer>[8]>: Age
#> [1] 1 2 3 4 5 6 1 2
#> 
#> Labels:
#>  value label
#>      1 18-24
#>      2 25-34
#>      3 35-44
#>      4 45-54
#>      5 55-64
#>      6   65+

c(gender, age)
#> <labelled<integer>[8]>: Gender
#> [1] 1 2 1 2 3 4 5 6
#> 
#> Labels:
#>  value  label
#>      1   Male
#>      2 Female
#>      3  35-44
#>      4  45-54
#>      5  55-64
#>      6    65+

@DavisVaughan DavisVaughan transferred this issue from tidyverse/tidyr Mar 19, 2022
@DavisVaughan DavisVaughan changed the title Lost value labels when pivot_longer() How lenient should labelled be when combining different labels that map to the same value? Mar 19, 2022
@gorcha
Copy link
Member

gorcha commented Mar 21, 2022

Thanks @DavisVaughan!

Although not ideal this is by design, since there's no easy way to reconcile mismatched labels and this is a path of least resistance (i.e. least likely to throw errors when combining vectors while still supporting common operations). See #543 for a brief discussion and a bit of context around the current more permissive stance.

@manhnguyen48 as noted in the conversion semantics vignette and a few other spots the labelled class is mostly intended as an intermediate class between stats packages and R, so the correct approach is to convert to factors as you've done above or remove labels using zap_labels() before doing any more complex processing/reshaping.

Agreed that a warning would be good when two labelled vectors with conflicting labels are combined.

@gorcha
Copy link
Member

gorcha commented Mar 23, 2022

I'm thinking something like this:

library(haven)
library(labelled)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
example_data <- tibble(
  serial = 1:100,
  age = labelled_spss(
    sample(1:6, size = 100, replace = TRUE),
    c(
      "18-24" = 1,
      "25-34" = 2,
      "35-44" = 3,
      "45-54" = 4,
      "55-64" = 5,
      "65+" = 6,
      "Unknown" = 99
    ),
    na_values = 99
  ),
  gender = labelled(
    sample(1:3, size = 100, replace = TRUE),
    c("Male" = 1, "Female" = 2, "Other" = 3)
  ),
  q1 = labelled(
    sample(1:2, size = 100, replace = TRUE),
    c("Yes" = 1, "No" = 2)
  )
)

pivot_longer(example_data, c(gender, age, q1)) %>%
  count(name, value)
#> Warning: `gender` and `age` have conflicting value labels.
#> ℹ Labels for these values will be taken from `gender`
#> x Values: 1, 2, 3
#> Warning: `gender` and `q1` have conflicting value labels.
#> ℹ Labels for these values will be taken from `gender`
#> x Values: 1, 2
#> # A tibble: 11 × 3
#>    name        value     n
#>    <chr>   <int+lbl> <int>
#>  1 age    1 [Male]      16
#>  2 age    2 [Female]    12
#>  3 age    3 [Other]     17
#>  4 age    4 [45-54]     17
#>  5 age    5 [55-64]     21
#>  6 age    6 [65+]       17
#>  7 gender 1 [Male]      31
#>  8 gender 2 [Female]    39
#>  9 gender 3 [Other]     30
#> 10 q1     1 [Male]      46
#> 11 q1     2 [Female]    54

Created on 2022-03-24 by the reprex package (v2.0.1)

@hadley, any thoughts on the warning message? Too verbose?

@hadley
Copy link
Member

hadley commented Mar 23, 2022

@gorcha that warning looks great to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants