-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve zero-length groups #341
Comments
Thanks for opening up this issue Hadley. |
👍 ran into the same issue today, |
Any idea on the time frame for putting a .drop = FALSE equivalent into dplyr? I need this for certain rCharts to render correctly. In the mean time I DID get the answer in your link to work. I grouped by two variables. |
+1 for option to not drop empty groups |
Not dropping empty groups would be very useful. Often needed when creating summary tables. |
+1 - this is a deal-breaker for many analyses |
I agree with all the above--would be very useful. |
@romainfrancois Currently t1 <- data_frame(
x = runif(10),
g1 = rep(1:2, each = 5),
g2 = factor(g1, 1:3)
)
g1 <- grouped_df(t1, list(quote(g2)), drop = FALSE)
attr(g1, "group_size")
# should be c(5L, 5L, 0L)
attr(g1, "indices")
# shoud be list(0:4, 5:9, integer(0)) The drop attribute only applies when grouping by a factor, in which case we need to have one group per factor level, regardless of whether or not the level actually applies to the data. This will also affect the single table verbs in the following ways:
Eventually, Does that make sense? If it's a lot of work, we can push off to 0.4 @statwonk, @wsurles, @jennybc, @slackline, @mcfrank, @eipi10 If you'd like to help, the best thing to do would be to work on a set of test cases that exercises all the ways the different verbs might interact with zero-length groups. |
Ah. I think I just did not know what |
I have opened pull request #833 which tests whether the single table verbs above handle zero-length groups correctly. Most of the tests are commented out, because dplyr currently fails them, of course. |
+1 , any status updates here? love summarise, need to keep empty levels! |
@ebergelson, Here is my current hack to get zero-length groups. I often need this so my bar charts will stack. Here df has 3 columns: name, group, and metric
|
I do something similar–check for missing groups, then if any generate all combinations and Unfortunately, it doesn't seem like this issue is getting much love...perhaps because there is this straightforward workaround. |
Just to add and agree with everyone above - this is a super critical aspect of many analyses. Would love to see an implementation. |
Some more details needed here: If I have this:
And I group by now, what if I have this:
and I want to group by |
Both |
Shouldn’t it agree with
I would expect |
Does |
All factors levels and combinations of factors levels must be preserved by default. This behavior can be controled by parameters such as df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
df %>% dplyr::count(x, y)
#> # A tibble: 4 x 3
#> x y n
#> <int> <fct> <int>
#> 1 1 1 1
#> 2 2 1 1
#> 3 1 2 0
#> 4 2 2 0 Zero length groups (combinations of groups) can be filtered later. But for exploratory analysis we must see the full picture.
|
2: yes definitely |
We might get away with this by expanding the data after the fact, something like this: library(tidyverse)
truly_group_by <- function(data, ...){
dots <- quos(...)
data <- group_by( data, !!!dots )
labels <- attr( data, "labels" )
labnames <- names(labels)
labels <- mutate( labels, ..index.. = attr(data, "indices") )
expanded <- labels %>%
tidyr::expand( !!!dots ) %>%
left_join( labels, by = labnames ) %>%
mutate( ..index.. = map(..index.., ~if(is.null(.x)) integer() else .x ) )
indices <- pull( expanded, ..index..)
group_sizes <- map_int( indices, length)
labels <- select( expanded, -..index..)
attr(data, "labels") <- labels
attr(data, "indices") <- indices
attr(data, "group_sizes") <- group_sizes
data
}
df <- data_frame(
x = 1:2,
y = factor(c(1, 1), levels = 1:2)
)
tally( truly_group_by(df, x, y) )
#> # A tibble: 4 x 3
#> # Groups: x [?]
#> x y n
#> <int> <fct> <int>
#> 1 1 1 1
#> 2 1 2 0
#> 3 2 1 1
#> 4 2 2 0
tally( truly_group_by(df, y, x) )
#> # A tibble: 4 x 3
#> # Groups: y [?]
#> y x n
#> <fct> <int> <int>
#> 1 1 1 1
#> 2 1 2 1
#> 3 2 1 0
#> 4 2 2 0 obviously down the line, this would be handled internally, sans using tidyr or purrr. |
This seems to take care of the original question on so: > df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
> df$b = factor(df$b, levels=1:3)
> df %>%
+ group_by(b) %>%
+ summarise(count_a=length(a), .drop=FALSE)
# A tibble: 2 x 3
b count_a .drop
<fct> <int> <lgl>
1 1 6 FALSE
2 2 6 FALSE
> df %>%
+ truly_group_by(b) %>%
+ summarise(count_a=length(a), .drop=FALSE)
# A tibble: 3 x 3
b count_a .drop
<fct> <int> <lgl>
1 1 6 FALSE
2 2 6 FALSE
3 3 0 FALSE |
The key here being this
which means expanding all possibilities regardless of the variables being factors or not. I'd say we either:
perhaps have a function to toggle dropness. This is a relatively cheap operation I'd say because it only involves manipulating the metadata, so perhaps it is less risky to do this in R first ? |
Did you mean Looking at the internals, do you agree that we "only" need to change Can we perhaps start with expanding only factors with group_by(data, crossing(col1, col2), col3) Semantics: Using all combinations of |
Yes, I'd say this only affects The "only expanding factors" part of this discussion is what took so long. What would be the results of these: library(dplyr)
d <- data_frame(
f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
x = 1:8,
y = rep( 1:4, each = 2)
)
f <- function(data, ...){
group_by(data, !!!quos(...)) %>%
tally()
}
f(d, f1, f2, x)
f(d, x, f1, f2)
f(d, f1, f2, x, y)
f(d, x, f1, f2, y) |
I think Also interesting: f(d, f2, x, f1, y)
d %>% sample_frac(0.3) %>% f(...) I like the idea of implementing full expansion only for factors. For non-character data (including logicals), we could define/use a factor-like class that inherits the respective data type. Perhaps provided by forcats? This makes it more difficult to shoot yourself in the foot. |
implementation in progress in #3492 library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )
( res1 <- tally(group_by(df,f,x, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups: f [?]
#> f x n
#> <fct> <dbl> <int>
#> 1 1 1. 1
#> 2 1 2. 1
#> 3 1 4. 0
#> 4 2 1. 1
#> 5 2 2. 0
#> 6 2 4. 1
#> 7 3 1. 0
#> 8 3 2. 0
#> 9 3 4. 0
( res2 <- tally(group_by(df,x,f, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups: x [?]
#> x f n
#> <dbl> <fct> <int>
#> 1 1. 1 1
#> 2 1. 2 1
#> 3 1. 3 0
#> 4 2. 1 1
#> 5 2. 2 0
#> 6 2. 3 0
#> 7 4. 1 0
#> 8 4. 2 1
#> 9 4. 3 0
all.equal( res1, arrange(res2, f, x) )
#> [1] TRUE
all.equal( filter(res1, n>0), tally(group_by(df, f, x)) )
#> [1] TRUE
all.equal( filter(res2, n>0), tally(group_by(df, x, f)) )
#> [1] TRUE Created on 2018-04-10 by the reprex package (v0.2.0). |
As for whether
|
@kenahoo I'm not sure I understand. This is what you get with the current dev version. So the only thing that you don't get is the warning from library(dplyr)
data.frame(x=factor(1, levels=1:2), y=4:5) %>%
group_by(x) %>%
summarize(min=min(y), sum=sum(y), prod=prod(y))
#> # A tibble: 2 x 4
#> x min sum prod
#> <fct> <dbl> <int> <dbl>
#> 1 1 4 9 20
#> 2 2 Inf 0 1
min(integer())
#> Warning in min(integer()): no non-missing arguments to min; returning Inf
#> [1] Inf
sum(integer())
#> [1] 0
prod(integer())
#> [1] 1 Created on 2018-05-15 by the reprex package (v0.2.0). |
@romainfrancois Oh cool, I didn't realize you were already so far along on this implementation. Looks great! |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
http://stackoverflow.com/questions/22523131
Not sure what the interface to this should be - probably should default to drop = FALSE.
The text was updated successfully, but these errors were encountered: