-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tidy grouped data attributes #3489
Comments
I think so. Added to the list of breaking changes which we can start working on right after the upcoming release. |
Maybe this is part of a bigger change on how we do groupings. Having something tidy could perhaps open the door to other kinds of groupings. Way back we discussed bootstrap groupings for example. The way things are done now with grouped and rowwise is less than ideal |
Seems related to #2311 then? |
In the long run, I think the right way to handle this sort of namespacing for variables (i.e. we need to use library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
gdf <- group_by(df, x, y)
attr <- attributes(gdf)[ c("labels", "indices", "group_sizes")]
df <- data.frame(size = attr$group_sizes)
df$label <- attr$labels
df$index <- attr$indices
df <- df[c(2, 3, 1)]
str(df)
#> 'data.frame': 2 obs. of 3 variables:
#> $ label:'data.frame': 2 obs. of 2 variables:
#> ..$ x: int 1 2
#> ..$ y: Factor w/ 2 levels "1","2": 1 1
#> ..- attr(*, "vars")= chr "x" "y"
#> ..- attr(*, "drop")= logi TRUE
#> $ index:List of 2
#> ..$ : int 0
#> ..$ : int 1
#> $ size : int 1 1 This will require substantial work throughout dplyr so this is a note for the future, rather than something we should try and do now. |
I have something like this: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# all information is in the labels attribute
group_by(mtcars, am, cyl) %>% attr("labels")
#> # A tibble: 6 x 3
#> am cyl ..indices..
#> <dbl> <dbl> <list>
#> 1 0 4 <int [3]>
#> 2 0 6 <int [4]>
#> 3 0 8 <int [12]>
#> 4 1 4 <int [8]>
#> 5 1 6 <int [3]>
#> 6 1 8 <int [2]>
# some operations make lazy grouped_df
# in that case, "labels" is a character vector
df1 <- data_frame(a = 1:3) %>% group_by(a)
df2 <- data_frame(a = rep(1:4, 2)) %>% group_by(a)
res <- left_join(df1, df2, by = "a")
attr(res, "labels")
#> [1] "a"
# which is materialised into a tibble when needed
# this happens by reference :scream:
group_size(res)
#> [1] 2 2 2
attr(res, "labels")
#> # A tibble: 3 x 2
#> a ..indices..
#> <int> <list>
#> 1 1 <int [2]>
#> 2 2 <int [2]>
#> 3 3 <int [2]> not sure we want to keep the laziness feature |
questions:
|
joins are the only producers of lazy grouped data frame now since #3492, through the reconstruct_join <- function(out, x, vars) {
if (is_grouped_df(x)) {
groups_in_old <- match(group_vars(x), tbl_vars(x))
groups_in_alias <- match(groups_in_old, vars$x)
out <- grouped_df_impl(out, vars$alias[groups_in_alias], FALSE)
}
out
} I'd argue we should get rid of it and altogether and only have materialized grouping structures. |
How do we update the grouping structure in a join? How about |
|
preserving in the context of joins might be tricky because:
|
ˋrows()` then would be a good name to extract .rows |
with @hadley naming suggestions 👌 + library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(g=c(1,1,2,2), x = 1:4)
# all information is in the `groups` attribute
group_by(df, g) %>% attr("groups")
#> # A tibble: 2 x 2
#> g .rows
#> <dbl> <list>
#> 1 1 <int [2]>
#> 2 2 <int [2]>
# can also extract it with group_data() and rows()
group_by(df, g) %>% group_data()
#> # A tibble: 2 x 2
#> g .rows
#> <dbl> <list>
#> 1 1 <int [2]>
#> 2 2 <int [2]>
group_by(df, g) %>% rows()
#> [[1]]
#> [1] 0 1
#>
#> [[2]]
#> [1] 2 3
# works also on ungrouped data frames
group_data(df)
#> # A tibble: 1 x 1
#> .rows
#> <list>
#> 1 <int [4]>
rows(df)
#> [[1]]
#> [1] 0 1 2 3
# ... and rowwise
group_data(rowwise(df))
#> # A tibble: 4 x 1
#> .rows
#> <list>
#> 1 <int [1]>
#> 2 <int [1]>
#> 3 <int [1]>
#> 4 <int [1]>
rows(rowwise(df))
#> [[1]]
#> [1] 0
#>
#> [[2]]
#> [1] 1
#>
#> [[3]]
#> [1] 2
#>
#> [[4]]
#> [1] 3 Created on 2018-05-10 by the reprex package (v0.2.0). |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
Would it make sense to tidy attributes that we use internally for grouped data frame, we could e.g. have labels, indices and indices sizes in the same data frame, which would be convenient for manipulation because then it's tidy. (inspired by working on #341).
Created on 2018-04-09 by the reprex package (v0.2.0).
The text was updated successfully, but these errors were encountered: