-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature proposal: enhanced base::factor() with coherance warnings #299
Comments
I think this seems like a good idea in principle, but there are some tricky details, like what should this function be called, and should it warn or error? (warnings are easily missed). |
Indeed, I cannot see any case when this "replace unknown with NA" is a wanted behavior, so an error is definitely more appropriate. Maybe an argument could be added for more flexibility, something like For the name, I liked |
I like the name |
Here is another idea for fct = function(x=character(), ..., levels, labels=levels)
fct(iris$Species, setosa="S", versicolor="Ve", virginica="Vi")
fct(iris$Species, setosa="S", versicolor="Ve") #warning, level "virginica" not labelled This also asks the more global question: should a missed level be set missing like |
By coincidence, we're looking at factor-related issues in vroom/readr right now. FWIW, in terms of what's going on elsewhere in the tidyverse, I'm linking to the dev documentation, because I happened to work on it today: https://readr.tidyverse.org/dev/reference/parse_factor.html I want to push back a bit on this particular framing:
Often there is external knowledge that informs explicit factor levels, but there's no reason to believe that every instance of such a factor must exhibit at least one specimen of each level. Examples: birth month, day of week, blood type. I frame it more like so: if you care and know enough to set factor levels explicitly, you probably want to know if the input data exhibits values outside that set. I don't think we disagree about the goal; this is more a comment about wording. Also relevant: most of the functions we're talking about here also have a way to explicitly declare values that should become |
Something like this maybe: fct <- function(x = character(), levels = unique(x), na = character()) {
x[x %in% na] <- NA
invalid <- setdiff(x, c(levels, NA))
if (length(invalid) > 0 ) {
cli::cli_abort(c(
"Values of {.arg x} must be members of {.arg levels}",
i = "Invalid value{?s}: {.str {invalid}}"
))
}
factor(x, levels = levels, exclude = NULL)
}
fct(c("x", "y", "z"))
#> [1] x y z
#> Levels: x y z
fct(c("x", "y", "z"), c("x", "y"))
#> Error in `fct()`:
#> ! Values of `x` must be members of `levels`
#> ℹ Invalid value: "z"
fct(c("x", "y", "z"), c("x", "y"), na = "z")
#> [1] x y <NA>
#> Levels: x y
fct(c("x", "y", "z"), c("x", "y", NA), na = "z")
#> [1] x y <NA>
#> Levels: x y <NA> Created on 2022-05-04 by the reprex package (v2.0.1) |
Note to self: if this implemented before R4DS 2e is published, need to replace |
@jennybc do you think the idea of a set of valid levels is such a key part of factors that you should be forced to supply it on creation? |
No, that feels too extreme to me. Level discovery seems super useful most of the time. Looking at the proposal above I felt compelled to check this example: fct(c("x", "y", "z"), na = "z")
#> [1] x y <NA>
#> Levels: x y <NA> because I predicted that The outcome feels correct, but it does feel a bit magical / subtle. If the user specifies x <- c("x", "y", "z")
fct(x, levels = unique(x), na = "z")
#> [1] x y <NA>
#> Levels: x y z
fct( c("x", "y", "z"), levels = unique(c("x", "y", "z")), na = "z")
#> [1] x y <NA>
#> Levels: x y z |
So maybe make Otherwise, I think if you're supplying both |
Yeah I think that does what people expect and has less potential for confusion.
Agreed. |
Plus a bit more argument checking: fct <- function(x = character(), levels = NULL, na = character()) {
if (!is.character(x)) {
cli::cli_abort("{.arg x} must be a character vector")
}
if (!is.character(na)) {
cli::cli_abort("{.arg na} must be a character vector")
}
x[x %in% na] <- NA
if (is.null(levels)) {
levels <- unique(x)
} else if (!is.character(levels)) {
abort("`{.arg levels} must be a character vector")
}
invalid <- setdiff(x, c(levels, NA))
if (length(invalid) > 0 ) {
cli::cli_abort(c(
"Values of {.arg x} must be members of {.arg levels}",
i = "Invalid value{?s}: {.str {invalid}}"
))
}
factor(x, levels = levels, exclude = NULL)
}
fct(c("x", "y", "z"))
fct(c("x", "y", "z"), c("x", "y"))
fct(c("x", "y", "z"), na = "z")
fct(c("x", "y", "z"), c("x", "y"), na = "z")
fct(c("x", "y", "z"), c("x", "y", NA), na = "z") |
The more I review my code and others', the more I realize that
base::factor()
causes a lot of problems by not throwing warnings when encountering an unknown level. Instead, it silently generatesNA
, which can cause heavy misunderstanding later.Since it makes little sense to me to specify levels that do not exist in the actual input, I would expect some verbosity about it.
Here is an example:
Before calling
table()
, the user has no idea that the previous call had "failing" cases.Here is the function I'm using instead:
Created on 2022-02-13 by the reprex package (v2.0.1)
As more and more people rely on
tidyverse
to write cleaner code, I would guess this could belong here.The text was updated successfully, but these errors were encountered: