Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting count of missing values in the result of fct_count() #151

Closed
mdjeric opened this issue Oct 19, 2018 · 4 comments
Closed

Selecting count of missing values in the result of fct_count() #151

mdjeric opened this issue Oct 19, 2018 · 4 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@mdjeric
Copy link

mdjeric commented Oct 19, 2018

Unlike in dplyr::count(), it's not possible to select count of missing values in the result of fct_count() unless filtered with levels.

library(tidyverse)

f1 <- c("A", "A", "B", NA)

fct_count(f1) %>% 
  filter(!is.na(f))
#> # A tibble: 3 x 2
#>   f         n
#>   <fct> <int>
#> 1 A         2
#> 2 B         1
#> 3 <NA>      1

fct_count(f1) %>% 
  filter(!is.na(levels(f)))
#> # A tibble: 2 x 2
#>   f         n
#>   <fct> <int>
#> 1 A         2
#> 2 B         1

Created on 2018-10-19 by the reprex package (v0.2.0).

I don't understand the magic behind dplyr and why it works, but maybe including na.rm = FALSE option and changing f2 <- addNA(f, ifany = TRUE) in fct_count() into something like this could be useful?

    if (!na.rm) {
      f2 <- addNA(f, ifany = TRUE)
    } else {
      f2 <- f
    }
@hadley
Copy link
Member

hadley commented Jan 4, 2019

I think the forcats philosophy is to consistent making explicitly missing factors (i.e. where NA occurs in the levels). So before we can change fct_count() I think it would require explicitly writing up this philosophy, and checking that it's being consistently applied. If it's not we could consider changing it.

@hadley hadley added the feature a feature request or enhancement label Jan 4, 2019
@hglanz
Copy link

hglanz commented Jan 19, 2019

I'm going to work on this at tidyverse developer day 2019.

@hglanz
Copy link

hglanz commented Jan 19, 2019

I have applied most of the functions in the forcats package (all of the ones I thought relevant) to the example used on the forcats home GitHub (month abbreviations) with missing values added. The fct_count() and fct_explicit_na() functions seem to be the only ones that actually acknowledge NA values in the input vector, and fct_count() seems to be the only one that allows for and creates an NA factor level.

So, it does seem the philosophy is inconsistent throughout the package. Since fct_explicit_na() seems built specifically for handling missing values, I agree with Hadley that fct_count() should probably be the function changed to not create the NA level as opposed to every other function. Users, then, must use fct_explicit_na() to address missing values before doing anything else with forcats.

#### Test NA Level Consistency Across forcats Functions ####
library(forcats)


x1 <- c(rep(c("Dec", "Apr", "Jan", "Mar"), each = 4), NA, NA)
x2 <- factor(x1)


#### as_factor ####
y1 <- as_factor(x1)
y1
#>  [1] Dec  Dec  Dec  Dec  Apr  Apr  Apr  Apr  Jan  Jan  Jan  Jan  Mar  Mar 
#> [15] Mar  Mar  <NA> <NA>
#> Levels: Dec Apr Jan Mar
levels(y1)
#> [1] "Dec" "Apr" "Jan" "Mar"

x2
#>  [1] Dec  Dec  Dec  Dec  Apr  Apr  Apr  Apr  Jan  Jan  Jan  Jan  Mar  Mar 
#> [15] Mar  Mar  <NA> <NA>
#> Levels: Apr Dec Jan Mar
y2 <- as_factor(x2)
levels(y2)
#> [1] "Apr" "Dec" "Jan" "Mar"

# NA has not become a level of y1 or y2

#### fct_anon ####
y2 <- fct_anon(x2)
y2
#>  [1] 3    3    3    3    4    4    4    4    2    2    2    2    1    1   
#> [15] 1    1    <NA> <NA>
#> Levels: 1 2 3 4
levels(y2)
#> [1] "1" "2" "3" "4"

# NA is still not included as a level

#### fct_c ####
y3 <- fct_c(x2, x3)
#> Error in rlang::dots_list(...): object 'x3' not found
levels(y3)
#> Error in levels(y3): object 'y3' not found

# NA is still not included as a level

#### fct_collapse ####
fct_collapse(x2,
             missing = "NA",
             winter = c("Dec", "Jan"),
             spring = c("Apr", "Mar"))
#> Warning: Unknown levels in `f`: NA
#>  [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA>   <NA>  
#> Levels: spring winter

fct_collapse(x1,
             missing = "NA",
             winter = c("Dec", "Jan"),
             spring = c("Apr", "Mar"))
#> Warning: Unknown levels in `f`: NA
#>  [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA>   <NA>  
#> Levels: spring winter

# Warning here about no NA level

fct_collapse(x1,
             missing = NA,
             winter = c("Dec", "Jan"),
             spring = c("Apr", "Mar"))
#> Warning: Unknown levels in `f`: NA
#>  [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA>   <NA>  
#> Levels: spring winter

# Same warning about no NA level

fct_collapse(x2,
             winter = c("Dec", "Jan"),
             spring = c("Apr", "Mar"))
#>  [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA>   <NA>  
#> Levels: spring winter

fct_collapse(x1,
             winter = c("Dec", "Jan"),
             spring = c("Apr", "Mar"))
#>  [1] winter winter winter winter spring spring spring spring winter winter
#> [11] winter winter spring spring spring spring <NA>   <NA>  
#> Levels: spring winter


# Collapse works, but NA values persist with no NA level

#### fct_count ####
y <- fct_count(x2)
y
#> # A tibble: 5 x 2
#>   f         n
#>   <fct> <int>
#> 1 Apr       4
#> 2 Dec       4
#> 3 Jan       4
#> 4 Mar       4
#> 5 <NA>      2
y$f
#> [1] Apr  Dec  Jan  Mar  <NA>
#> Levels: Apr Dec Jan Mar <NA>

# NA has become a factor level here. The level is NA and not "NA"

y2 <- fct_count(x1)
y2
#> # A tibble: 5 x 2
#>   f         n
#>   <fct> <int>
#> 1 Apr       4
#> 2 Dec       4
#> 3 Jan       4
#> 4 Mar       4
#> 5 <NA>      2
y2$f
#> [1] Apr  Dec  Jan  Mar  <NA>
#> Levels: Apr Dec Jan Mar <NA>

# NA has become a factor level here. The level is NA and not "NA"


#### fct_expand ####
fct_expand(x2, NA)
#> Error: Can't convert a logical vector to a character vector

fct_expand(x1, NA)
#> Error: Can't convert a logical vector to a character vector

# Error here: Can't convert a logical vector to a character vector. The NA
# factor level in y and y2 above are not character (in quotes); so this fct_expand
# might have been expected to work.


#### fct_explicit_na ####
fct_explicit_na(x2)
#>  [1] Dec       Dec       Dec       Dec       Apr       Apr       Apr      
#>  [8] Apr       Jan       Jan       Jan       Jan       Mar       Mar      
#> [15] Mar       Mar       (Missing) (Missing)
#> Levels: Apr Dec Jan Mar (Missing)

fct_explicit_na(x1)
#>  [1] Dec       Dec       Dec       Dec       Apr       Apr       Apr      
#>  [8] Apr       Jan       Jan       Jan       Jan       Mar       Mar      
#> [15] Mar       Mar       (Missing) (Missing)
#> Levels: Apr Dec Jan Mar (Missing)

# This works, but the missing values and factor level are character -- not NA

fct_explicit_na(x2, na_level = NA)
#> Error: Can't convert a logical vector to a character vector

# Error: Can't convert a logical vector to a character vector. We want to do the
# same thing as above, but actually use NA as the factor level like fct_count()
# does. This doesn't work.

fct_explicit_na(y$f)
#> [1] Apr       Dec       Jan       Mar       (Missing)
#> Levels: Apr Dec Jan Mar (Missing)

# This overwrites both the NA values and the NA factor level

#### fct_inorder ####
fct_inorder(x2)
#>  [1] Dec  Dec  Dec  Dec  Apr  Apr  Apr  Apr  Jan  Jan  Jan  Jan  Mar  Mar 
#> [15] Mar  Mar  <NA> <NA>
#> Levels: Dec Apr Jan Mar

# NA is not a level here

fct_inorder(x1)
#>  [1] Dec  Dec  Dec  Dec  Apr  Apr  Apr  Apr  Jan  Jan  Jan  Jan  Mar  Mar 
#> [15] Mar  Mar  <NA> <NA>
#> Levels: Dec Apr Jan Mar

# NA is not a level here

#### fct_infreq ####
fct_infreq(x2)
#>  [1] Dec  Dec  Dec  Dec  Apr  Apr  Apr  Apr  Jan  Jan  Jan  Jan  Mar  Mar 
#> [15] Mar  Mar  <NA> <NA>
#> Levels: Apr Dec Jan Mar

# NA is not a level here

fct_infreq(x1)
#>  [1] Dec  Dec  Dec  Dec  Apr  Apr  Apr  Apr  Jan  Jan  Jan  Jan  Mar  Mar 
#> [15] Mar  Mar  <NA> <NA>
#> Levels: Apr Dec Jan Mar

# NA is not a level here

#### fct_lump ####
fct_lump(x2)
#>  [1] Dec  Dec  Dec  Dec  Apr  Apr  Apr  Apr  Jan  Jan  Jan  Jan  Mar  Mar 
#> [15] Mar  Mar  <NA> <NA>
#> Levels: Apr Dec Jan Mar

# NA values not included in lumping and NA still not a factor level

fct_lump(x1)
#>  [1] Dec  Dec  Dec  Dec  Apr  Apr  Apr  Apr  Jan  Jan  Jan  Jan  Mar  Mar 
#> [15] Mar  Mar  <NA> <NA>
#> Levels: Apr Dec Jan Mar

# NA values not included in lumping and NA still not a factor level

#### fct_unique ####
fct_unique(x2)
#> [1] Apr Dec Jan Mar
#> Levels: Apr Dec Jan Mar

# NA values not included here and NA still not a factor level

Created on 2019-01-19 by the reprex package (v0.2.1)

@hadley hadley added bug an unexpected problem or unintended behavior and removed feature a feature request or enhancement labels Feb 28, 2020
@hadley
Copy link
Member

hadley commented Feb 28, 2020

Thanks for the investigation @hglanz!

@hadley hadley closed this as completed in 4220934 Feb 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants