-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selecting count of missing values in the result of fct_count() #151
Comments
I think the forcats philosophy is to consistent making explicitly missing factors (i.e. where NA occurs in the levels). So before we can change |
I'm going to work on this at tidyverse developer day 2019. |
I have applied most of the functions in the forcats package (all of the ones I thought relevant) to the example used on the forcats home GitHub (month abbreviations) with missing values added. The fct_count() and fct_explicit_na() functions seem to be the only ones that actually acknowledge NA values in the input vector, and fct_count() seems to be the only one that allows for and creates an NA factor level. So, it does seem the philosophy is inconsistent throughout the package. Since fct_explicit_na() seems built specifically for handling missing values, I agree with Hadley that fct_count() should probably be the function changed to not create the NA level as opposed to every other function. Users, then, must use fct_explicit_na() to address missing values before doing anything else with forcats. #### Test NA Level Consistency Across forcats Functions ####
library(forcats)
x1 <- c(rep(c("Dec", "Apr", "Jan", "Mar"), each = 4), NA, NA)
x2 <- factor(x1)
#### as_factor ####
y1 <- as_factor(x1)
y1
#> [1] Dec Dec Dec Dec Apr Apr Apr Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar <NA> <NA>
#> Levels: Dec Apr Jan Mar
levels(y1)
#> [1] "Dec" "Apr" "Jan" "Mar"
x2
#> [1] Dec Dec Dec Dec Apr Apr Apr Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar <NA> <NA>
#> Levels: Apr Dec Jan Mar
y2 <- as_factor(x2)
levels(y2)
#> [1] "Apr" "Dec" "Jan" "Mar"
# NA has not become a level of y1 or y2
#### fct_anon ####
y2 <- fct_anon(x2)
y2
#> [1] 3 3 3 3 4 4 4 4 2 2 2 2 1 1
#> [15] 1 1 <NA> <NA>
#> Levels: 1 2 3 4
levels(y2)
#> [1] "1" "2" "3" "4"
# NA is still not included as a level
#### fct_c ####
y3 <- fct_c(x2, x3)
#> Error in rlang::dots_list(...): object 'x3' not found
levels(y3)
#> Error in levels(y3): object 'y3' not found
# NA is still not included as a level
#### fct_collapse ####
fct_collapse(x2,
missing = "NA",
winter = c("Dec", "Jan"),
spring = c("Apr", "Mar"))
#> Warning: Unknown levels in `f`: NA
#> [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA> <NA>
#> Levels: spring winter
fct_collapse(x1,
missing = "NA",
winter = c("Dec", "Jan"),
spring = c("Apr", "Mar"))
#> Warning: Unknown levels in `f`: NA
#> [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA> <NA>
#> Levels: spring winter
# Warning here about no NA level
fct_collapse(x1,
missing = NA,
winter = c("Dec", "Jan"),
spring = c("Apr", "Mar"))
#> Warning: Unknown levels in `f`: NA
#> [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA> <NA>
#> Levels: spring winter
# Same warning about no NA level
fct_collapse(x2,
winter = c("Dec", "Jan"),
spring = c("Apr", "Mar"))
#> [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA> <NA>
#> Levels: spring winter
fct_collapse(x1,
winter = c("Dec", "Jan"),
spring = c("Apr", "Mar"))
#> [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA> <NA>
#> Levels: spring winter
# Collapse works, but NA values persist with no NA level
#### fct_count ####
y <- fct_count(x2)
y
#> # A tibble: 5 x 2
#> f n
#> <fct> <int>
#> 1 Apr 4
#> 2 Dec 4
#> 3 Jan 4
#> 4 Mar 4
#> 5 <NA> 2
y$f
#> [1] Apr Dec Jan Mar <NA>
#> Levels: Apr Dec Jan Mar <NA>
# NA has become a factor level here. The level is NA and not "NA"
y2 <- fct_count(x1)
y2
#> # A tibble: 5 x 2
#> f n
#> <fct> <int>
#> 1 Apr 4
#> 2 Dec 4
#> 3 Jan 4
#> 4 Mar 4
#> 5 <NA> 2
y2$f
#> [1] Apr Dec Jan Mar <NA>
#> Levels: Apr Dec Jan Mar <NA>
# NA has become a factor level here. The level is NA and not "NA"
#### fct_expand ####
fct_expand(x2, NA)
#> Error: Can't convert a logical vector to a character vector
fct_expand(x1, NA)
#> Error: Can't convert a logical vector to a character vector
# Error here: Can't convert a logical vector to a character vector. The NA
# factor level in y and y2 above are not character (in quotes); so this fct_expand
# might have been expected to work.
#### fct_explicit_na ####
fct_explicit_na(x2)
#> [1] Dec Dec Dec Dec Apr Apr Apr
#> [8] Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar (Missing) (Missing)
#> Levels: Apr Dec Jan Mar (Missing)
fct_explicit_na(x1)
#> [1] Dec Dec Dec Dec Apr Apr Apr
#> [8] Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar (Missing) (Missing)
#> Levels: Apr Dec Jan Mar (Missing)
# This works, but the missing values and factor level are character -- not NA
fct_explicit_na(x2, na_level = NA)
#> Error: Can't convert a logical vector to a character vector
# Error: Can't convert a logical vector to a character vector. We want to do the
# same thing as above, but actually use NA as the factor level like fct_count()
# does. This doesn't work.
fct_explicit_na(y$f)
#> [1] Apr Dec Jan Mar (Missing)
#> Levels: Apr Dec Jan Mar (Missing)
# This overwrites both the NA values and the NA factor level
#### fct_inorder ####
fct_inorder(x2)
#> [1] Dec Dec Dec Dec Apr Apr Apr Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar <NA> <NA>
#> Levels: Dec Apr Jan Mar
# NA is not a level here
fct_inorder(x1)
#> [1] Dec Dec Dec Dec Apr Apr Apr Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar <NA> <NA>
#> Levels: Dec Apr Jan Mar
# NA is not a level here
#### fct_infreq ####
fct_infreq(x2)
#> [1] Dec Dec Dec Dec Apr Apr Apr Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar <NA> <NA>
#> Levels: Apr Dec Jan Mar
# NA is not a level here
fct_infreq(x1)
#> [1] Dec Dec Dec Dec Apr Apr Apr Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar <NA> <NA>
#> Levels: Apr Dec Jan Mar
# NA is not a level here
#### fct_lump ####
fct_lump(x2)
#> [1] Dec Dec Dec Dec Apr Apr Apr Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar <NA> <NA>
#> Levels: Apr Dec Jan Mar
# NA values not included in lumping and NA still not a factor level
fct_lump(x1)
#> [1] Dec Dec Dec Dec Apr Apr Apr Apr Jan Jan Jan Jan Mar Mar
#> [15] Mar Mar <NA> <NA>
#> Levels: Apr Dec Jan Mar
# NA values not included in lumping and NA still not a factor level
#### fct_unique ####
fct_unique(x2)
#> [1] Apr Dec Jan Mar
#> Levels: Apr Dec Jan Mar
# NA values not included here and NA still not a factor level Created on 2019-01-19 by the reprex package (v0.2.1) |
Thanks for the investigation @hglanz! |
Unlike in
dplyr::count()
, it's not possible to select count of missing values in the result offct_count()
unless filtered with levels.Created on 2018-10-19 by the reprex package (v0.2.0).
I don't understand the magic behind dplyr and why it works, but maybe including
na.rm = FALSE
option and changingf2 <- addNA(f, ifany = TRUE)
infct_count()
into something like this could be useful?The text was updated successfully, but these errors were encountered: