-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty grouping levels in the "groups" attribute #5830
Comments
Let's have a look. library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
e = 1,
f = factor(c(1, 1, 2, 2), levels = 1:3),
g = c(1, 1, 2, 2),
x = c(1, 2, 1, 4)
)
df
#> e f g x
#> 1 1 1 1 1
#> 2 1 1 1 2
#> 3 1 2 2 1
#> 4 1 2 2 4 When you
at this stage, you'd get this: g <- group_by(df, e, f, .drop = FALSE)
group_split(g)
#> <list_of<
#> tbl_df<
#> e: double
#> f: factor<d3bfc>
#> g: double
#> x: double
#> >
#> >[3]>
#> [[1]]
#> # A tibble: 2 x 4
#> e f g x
#> <dbl> <fct> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 1 1 1 2
#>
#> [[2]]
#> # A tibble: 2 x 4
#> e f g x
#> <dbl> <fct> <dbl> <dbl>
#> 1 1 2 2 1
#> 2 1 2 2 4
#>
#> [[3]]
#> # A tibble: 0 x 4
#> # … with 4 variables: e <dbl>, f <fct>, g <dbl>, x <dbl>
group_data(g)
#> # A tibble: 3 x 3
#> e f .rows
#> <dbl> <fct> <list<int>>
#> 1 1 1 [2]
#> 2 1 2 [2]
#> 3 1 3 [0]
The first group:
had only The second group:
only has The third group is empty.
but still corresponds to The notion of "valid group in the data" is governed by the presence of factors and where they appear in the sequence of grouping vars. The group |
We've worked through this at length in the past and this is the best we've been able to come up with. It's reasonable that while not perfect, works for most cases, and I don't think we want to reconsider our past decisions at this point. |
@romainfrancois thanks for the explanation. It makes much more sense now and I understand the comment in the code about recursively splitting. @hadley I'm sure it has been discussed - but where? This kind of insight into how the package works is really useful to understand. |
@nathaneastwood probably some in issues and some in our private team chat. Unfortunately we don't have the time to clearly explain every development decision we make. |
I recently posted this question on StackOverflow but I didn't get very satisfactory answers. I believe that this topic has been discussed in several places already when it comes to the
.drop
argument (#4392, #341 for example) but maybe not about the"groups"
attribute itself. I repeat the StackOverflow question below for brevity.Taking an example from the dplyr tests:
I don't quite understand why or how the "groups" attribute is defined as such
The third row doesn't make any sense to me, it's not a valid group within the original data. I'd have thought the result would be:
The consensus seems to be that dplyr is recycling the single valued column
e
based on dplyr's recycling rules.So my first question is, is this the case? And could you please point me to the documentation where it says as much such that I can better educate myself about the rule?
Secondly, if this is true, I don't quite understand why this would be the case (see suggested alternative). It makes a huge assumption about the data in question which is that the column
e
cannot - and does not - take on any other value. It also assumes that the combination of factor level3
for columnf
along with the value1
from columne
is a valid combination. To me, the result should either create all possible combinations of missing levels (i.e. including values which are both available and missing) frome
,f
andg
or simply return only "known missing data", i.e. known factor levels (again, see suggested alternative).The text was updated successfully, but these errors were encountered: