Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposal: enhanced base::factor() with coherance warnings #299

Closed
DanChaltiel opened this issue Feb 13, 2022 · 12 comments · Fixed by #305
Closed

Feature proposal: enhanced base::factor() with coherance warnings #299

DanChaltiel opened this issue Feb 13, 2022 · 12 comments · Fixed by #305
Labels
feature a feature request or enhancement

Comments

@DanChaltiel
Copy link

The more I review my code and others', the more I realize that base::factor() causes a lot of problems by not throwing warnings when encountering an unknown level. Instead, it silently generates NA, which can cause heavy misunderstanding later.

Since it makes little sense to me to specify levels that do not exist in the actual input, I would expect some verbosity about it.

Here is an example:

library(glue)
library(rlang)
library(magrittr)

df_labels = read.table(header=TRUE, text="
    level label
    setsa SETO                #typo
    verssicolor VERSICO #typo
    virginica VIRGINI
")

x=as.character(iris$Species)

f1 = factor(x, levels=df_labels$level, labels=df_labels$label)
table(f1)
#> f1
#>    SETO VERSICO VIRGINI 
#>       0       0      50

Before calling table(), the user has no idea that the previous call had "failing" cases.

Here is the function I'm using instead:

fct = function(x=character(), levels, labels=levels, ...){
    miss_x = !x %in% levels
    if(any(miss_x)){
        miss_x_s = unique(x[miss_x]) %>% glue_collapse(", ")
        warn(c("Unknown factor level in `x`, NA generated.", 
               x=glue("Unknown levels: {miss_x_s}")))
    }
    factor(x, levels, labels, ...)
}
f2 = fct(x, levels=df_labels$level, labels=df_labels$label)
#> Warning: Unknown factor level in `x`, NA generated.
#> x Unknown levels: setosa, versicolor
table(f2)
#> f2
#>    SETO VERSICO VIRGINI 
#>       0       0      50

Created on 2022-02-13 by the reprex package (v2.0.1)

As more and more people rely on tidyverse to write cleaner code, I would guess this could belong here.

@hadley hadley added the feature a feature request or enhancement label Mar 2, 2022
@hadley
Copy link
Member

hadley commented Mar 2, 2022

I think this seems like a good idea in principle, but there are some tricky details, like what should this function be called, and should it warn or error? (warnings are easily missed).

@DanChaltiel
Copy link
Author

Indeed, I cannot see any case when this "replace unknown with NA" is a wanted behavior, so an error is definitely more appropriate. Maybe an argument could be added for more flexibility, something like fail_if_unknown=TRUE.

For the name, I liked fct(), as this function would be the very base of forcat API, but otherwise you might like fct_create(), fct_encode(), fct_coerce(), ... A bolder but nice move would be to even override base and bluntly call it factor(). Any code that breaks because of this was most likely rather bad code, don't you think?

@huftis
Copy link

huftis commented Mar 3, 2022

I like the name fct(). It’s easy to remember that using fct() is safer than using factor(). And I don’t thinking overriding the base function is a good idea. It will break tons of code (sometimes in a good way, but breaking code is a bad idea).

@DanChaltiel
Copy link
Author

Here is another idea for fct() that I really miss from base::factor(): use the ellipsis to propose a level=label syntax:

fct = function(x=character(), ..., levels, labels=levels)
fct(iris$Species, setosa="S", versicolor="Ve", virginica="Vi")
fct(iris$Species, setosa="S", versicolor="Ve") #warning, level "virginica" not labelled

This also asks the more global question: should a missed level be set missing like base::factor() or should it be kept unchanged?

@jennybc
Copy link
Member

jennybc commented Mar 18, 2022

By coincidence, we're looking at factor-related issues in vroom/readr right now.

FWIW, in terms of what's going on elsewhere in the tidyverse, parse_factor() and col_factor() throw a warning and report problems if levels are explicitly specified and some element of the input is not found in those levels (and also does not match an na string).

I'm linking to the dev documentation, because I happened to work on it today:

https://readr.tidyverse.org/dev/reference/parse_factor.html


I want to push back a bit on this particular framing:

Since it makes little sense to me to specify levels that do not exist in the actual input ...

Often there is external knowledge that informs explicit factor levels, but there's no reason to believe that every instance of such a factor must exhibit at least one specimen of each level. Examples: birth month, day of week, blood type.

I frame it more like so: if you care and know enough to set factor levels explicitly, you probably want to know if the input data exhibits values outside that set. I don't think we disagree about the goal; this is more a comment about wording.

Also relevant: most of the functions we're talking about here also have a way to explicitly declare values that should become NA: factor(exclude =), read.table(na.strings =), readr::read_*(na =), readr::parse_factor(na =).

@hadley
Copy link
Member

hadley commented May 4, 2022

Something like this maybe:

fct <- function(x = character(), levels = unique(x), na = character()) {
  x[x %in% na] <- NA
  
  invalid <- setdiff(x, c(levels, NA))
  
  if (length(invalid) > 0 ) {
    cli::cli_abort(c(
      "Values of {.arg x} must be members of {.arg levels}", 
      i = "Invalid value{?s}: {.str {invalid}}"
    ))
  }
  factor(x, levels = levels, exclude = NULL)
}

fct(c("x", "y", "z"))
#> [1] x y z
#> Levels: x y z
fct(c("x", "y", "z"), c("x", "y"))
#> Error in `fct()`:
#> ! Values of `x` must be members of `levels`
#> ℹ Invalid value: "z"
fct(c("x", "y", "z"), c("x", "y"), na = "z")
#> [1] x    y    <NA>
#> Levels: x y
fct(c("x", "y", "z"), c("x", "y", NA), na = "z")
#> [1] x    y    <NA>
#> Levels: x y <NA>

Created on 2022-05-04 by the reprex package (v2.0.1)

@hadley
Copy link
Member

hadley commented May 4, 2022

Note to self: if this implemented before R4DS 2e is published, need to replace readr::parse_factor() with fctr().

@hadley
Copy link
Member

hadley commented May 4, 2022

@jennybc do you think the idea of a set of valid levels is such a key part of factors that you should be forced to supply it on creation?

@jennybc
Copy link
Member

jennybc commented May 5, 2022

do you think the idea of a set of valid levels is such a key part of factors that you should be forced to supply it on creation?

No, that feels too extreme to me. Level discovery seems super useful most of the time.

Looking at the proposal above I felt compelled to check this example:

fct(c("x", "y", "z"), na = "z")
#> [1] x    y    <NA>
#> Levels: x y <NA>

because I predicted that "z" would be among the levels and that the last element of the return value would be NA. I was wrong. Because the default of unique(x) doesn't get forced until after x[x %in% na] <- NA.

The outcome feels correct, but it does feel a bit magical / subtle. If the user specifies unique(x) for themselves, they get a different outcome:

x <- c("x", "y", "z")
fct(x, levels = unique(x), na = "z")
#> [1] x    y    <NA>
#> Levels: x y z

fct( c("x", "y", "z"), levels = unique(c("x", "y", "z")), na = "z")
#> [1] x    y    <NA>
#> Levels: x y z

@hadley
Copy link
Member

hadley commented May 5, 2022

So maybe make levels default to NULL and then compute explicitly?

Otherwise, I think if you're supplying both levels and na, it's up to you do so consistently.

@jennybc
Copy link
Member

jennybc commented May 5, 2022

So maybe make levels default to NULL and then compute explicitly?

Yeah I think that does what people expect and has less potential for confusion.

Otherwise, I think if you're supplying both levels and na, it's up to you do so consistently.

Agreed.

@hadley
Copy link
Member

hadley commented May 5, 2022

Plus a bit more argument checking:

fct <- function(x = character(), levels = NULL, na = character()) {
  if (!is.character(x)) {
    cli::cli_abort("{.arg x} must be a character vector")
  }
  if (!is.character(na)) {
    cli::cli_abort("{.arg na} must be a character vector")
  }
  
  x[x %in% na] <- NA
  
  if (is.null(levels)) {
    levels <- unique(x)
  } else if (!is.character(levels)) {
    abort("`{.arg levels} must be a character vector")
  }
  
  invalid <- setdiff(x, c(levels, NA))
  
  if (length(invalid) > 0 ) {
    cli::cli_abort(c(
      "Values of {.arg x} must be members of {.arg levels}", 
      i = "Invalid value{?s}: {.str {invalid}}"
    ))
  }
  factor(x, levels = levels, exclude = NULL)
}

fct(c("x", "y", "z"))
fct(c("x", "y", "z"), c("x", "y"))
fct(c("x", "y", "z"), na = "z")
fct(c("x", "y", "z"), c("x", "y"), na = "z")
fct(c("x", "y", "z"), c("x", "y", NA), na = "z")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants