tidy grouped data attributes #3489

romainfrancois · 2018-04-09T15:34:37Z

Would it make sense to tidy attributes that we use internally for grouped data frame, we could e.g. have labels, indices and indices sizes in the same data frame, which would be convenient for manipulation because then it's tidy. (inspired by working on #341).

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
gdf <- group_by(df, x, y)

attributes(gdf)[ c("labels", "indices", "group_sizes")]
#> $labels
#>   x y
#> 1 1 1
#> 2 2 1
#> 
#> $indices
#> $indices[[1]]
#> [1] 0
#> 
#> $indices[[2]]
#> [1] 1
#> 
#> 
#> $group_sizes
#> [1] 1 1
as_tibble(
  mutate( attr(gdf, "labels"), 
    ..index.. = attr(gdf, "indices"), 
    ..size.. = attr(gdf, "group_sizes")
  )
)
#> # A tibble: 2 x 4
#>       x y     ..index.. ..size..
#>   <int> <fct> <list>       <int>
#> 1     1 1     <int [1]>        1
#> 2     2 1     <int [1]>        1

Created on 2018-04-09 by the reprex package (v0.2.0).

krlmlr · 2018-04-10T00:36:36Z

I think so. Added to the list of breaking changes which we can start working on right after the upcoming release.

romainfrancois · 2018-04-10T08:10:49Z

Maybe this is part of a bigger change on how we do groupings. Having something tidy could perhaps open the door to other kinds of groupings. Way back we discussed bootstrap groupings for example.

The way things are done now with grouped and rowwise is less than ideal

krlmlr · 2018-04-10T08:14:01Z

Seems related to #2311 then?

hadley · 2018-05-07T15:34:29Z

In the long run, I think the right way to handle this sort of namespacing for variables (i.e. we need to use ..index to avoid clashing with a grouping variable called index) is to use a df-column:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
gdf <- group_by(df, x, y)

attr <- attributes(gdf)[ c("labels", "indices", "group_sizes")]

df <- data.frame(size = attr$group_sizes)
df$label <- attr$labels
df$index <- attr$indices
df <- df[c(2, 3, 1)]
str(df)
#> 'data.frame':    2 obs. of  3 variables:
#>  $ label:'data.frame':   2 obs. of  2 variables:
#>   ..$ x: int  1 2
#>   ..$ y: Factor w/ 2 levels "1","2": 1 1
#>   ..- attr(*, "vars")= chr  "x" "y"
#>   ..- attr(*, "drop")= logi TRUE
#>  $ index:List of 2
#>   ..$ : int 0
#>   ..$ : int 1
#>  $ size : int  1 1

This will require substantial work throughout dplyr so this is a note for the future, rather than something we should try and do now.

…bels attribute. #3489

romainfrancois · 2018-05-09T14:08:17Z

I have something like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# all information is in the labels attribute
group_by(mtcars, am, cyl) %>% attr("labels")
#> # A tibble: 6 x 3
#>      am   cyl ..indices..
#>   <dbl> <dbl> <list>     
#> 1     0     4 <int [3]>  
#> 2     0     6 <int [4]>  
#> 3     0     8 <int [12]> 
#> 4     1     4 <int [8]>  
#> 5     1     6 <int [3]>  
#> 6     1     8 <int [2]>

# some operations make lazy grouped_df
# in that case, "labels" is a character vector
df1 <- data_frame(a = 1:3) %>% group_by(a)
df2 <- data_frame(a = rep(1:4, 2)) %>% group_by(a)
res <- left_join(df1, df2, by = "a")
attr(res, "labels")
#> [1] "a"

# which is materialised into a tibble when needed
# this happens by reference :scream: 
group_size(res)
#> [1] 2 2 2
attr(res, "labels")
#> # A tibble: 3 x 2
#>       a ..indices..
#>   <int> <list>     
#> 1     1 <int [2]>  
#> 2     2 <int [2]>  
#> 3     3 <int [2]>

not sure we want to keep the laziness feature

romainfrancois · 2018-05-09T14:10:03Z

questions:

I guess we need some way to access "labels"
The name ..indices.. is until someone has a better idea. It mostly internally only cares that this is the last column.
should the ..indices.. column be a list of 1-based indices ? it is currently 0-based because previously this was meant to be only used internally

romainfrancois · 2018-05-09T14:28:10Z

joins are the only producers of lazy grouped data frame now since #3492, through the reconstruct_join function:

reconstruct_join <- function(out, x, vars) {
  if (is_grouped_df(x)) {
    groups_in_old <- match(group_vars(x), tbl_vars(x))
    groups_in_alias <- match(groups_in_old, vars$x)
    out <- grouped_df_impl(out, vars$alias[groups_in_alias], FALSE)
  }
  out
}

I'd argue we should get rid of it and altogether and only have materialized grouping structures.

krlmlr · 2018-05-09T15:19:27Z

How do we update the grouping structure in a join?

How about group_labels() ?

hadley · 2018-05-09T15:29:46Z

Yes, we probably should use 1 based indices, but it might be better to do that in a separate PR
I think we should call the attribute groups
I think we should call the column .rows
I think it's ok to get rid of the lazy grouping, although we should file an issue to consider preserving as part of the join process

romainfrancois · 2018-05-09T15:45:11Z

preserving in the context of joins might be tricky because:

mixed grouping, e.g when the lhs is a factor and the rhs a chr
for right joins and any joins where the lhs is not in control of the group population.

romainfrancois · 2018-05-09T16:10:26Z

ˋrows()` then would be a good name to extract .rows

romainfrancois · 2018-05-09T22:28:02Z

with @hadley naming suggestions 👌 +rows() and group_data(). Too bad groups() is taken.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(g=c(1,1,2,2), x = 1:4)

# all information is in the `groups` attribute
group_by(df, g) %>% attr("groups")
#> # A tibble: 2 x 2
#>       g .rows    
#>   <dbl> <list>   
#> 1     1 <int [2]>
#> 2     2 <int [2]>

# can also extract it with group_data() and rows()
group_by(df, g) %>% group_data()
#> # A tibble: 2 x 2
#>       g .rows    
#>   <dbl> <list>   
#> 1     1 <int [2]>
#> 2     2 <int [2]>
group_by(df, g) %>% rows()
#> [[1]]
#> [1] 0 1
#> 
#> [[2]]
#> [1] 2 3

# works also on ungrouped data frames
group_data(df)
#> # A tibble: 1 x 1
#>   .rows    
#>   <list>   
#> 1 <int [4]>
rows(df)
#> [[1]]
#> [1] 0 1 2 3

# ... and rowwise
group_data(rowwise(df))
#> # A tibble: 4 x 1
#>   .rows    
#>   <list>   
#> 1 <int [1]>
#> 2 <int [1]>
#> 3 <int [1]>
#> 4 <int [1]>
rows(rowwise(df))
#> [[1]]
#> [1] 0
#> 
#> [[2]]
#> [1] 1
#> 
#> [[3]]
#> [1] 2
#> 
#> [[4]]
#> [1] 3

Created on 2018-05-10 by the reprex package (v0.2.0).

…bels attribute. #3489

lock · 2018-11-24T08:02:44Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

krlmlr mentioned this issue Apr 10, 2018

Breaking (and other) changes for dplyr 0.9.0 #2790

Closed

15 tasks

This was referenced Apr 10, 2018

Preserve zero-length groups #341

Closed

group_indices on rowwise data #3491

Closed

grouped_df internal structure: okay to use? #3362

Closed

krlmlr added data frame breaking change ☠️ API change likely to affect existing code labels Apr 18, 2018

romainfrancois mentioned this issue May 3, 2018

WIP: empty groups #3492

Merged

romainfrancois modified the milestones: 0.8.0, 0.7.3 May 5, 2018

romainfrancois added a commit that referenced this issue May 8, 2018

➖ group_sizes from the metadata #3489

f352bed

romainfrancois added a commit that referenced this issue May 8, 2018

➡️ "indices" attribute of group metadata into a list column of the la…

baaf2de

…bels attribute. #3489

romainfrancois referenced this issue in r-spatial/sf May 9, 2018

should address #714

9a6f033

romainfrancois added a commit that referenced this issue May 9, 2018

squash labels, vars and indices together in one tidy tibble. #3489

08c1eaa

romainfrancois mentioned this issue May 9, 2018

joins should not make lazy grouped_df #3566

Closed

romainfrancois added a commit that referenced this issue May 14, 2018

➖ group_sizes from the metadata #3489

28a6631

romainfrancois added a commit that referenced this issue May 14, 2018

➡️ "indices" attribute of group metadata into a list column of the la…

b1298cf

…bels attribute. #3489

romainfrancois added a commit that referenced this issue May 14, 2018

squash labels, vars and indices together in one tidy tibble. #3489

066f581

romainfrancois closed this as completed May 28, 2018

This was referenced May 28, 2018

move to 1-based indices for .rows #3605

Closed

Using 1-based indices in grouping metadata #3610

Merged

lock bot locked and limited conversation to collaborators Nov 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tidy grouped data attributes #3489

tidy grouped data attributes #3489

romainfrancois commented Apr 9, 2018

krlmlr commented Apr 10, 2018

romainfrancois commented Apr 10, 2018

krlmlr commented Apr 10, 2018

hadley commented May 7, 2018

romainfrancois commented May 9, 2018

romainfrancois commented May 9, 2018 •

edited

Loading

romainfrancois commented May 9, 2018

krlmlr commented May 9, 2018

hadley commented May 9, 2018

romainfrancois commented May 9, 2018

romainfrancois commented May 9, 2018

romainfrancois commented May 9, 2018

lock bot commented Nov 24, 2018

tidy grouped data attributes #3489

tidy grouped data attributes #3489

Comments

romainfrancois commented Apr 9, 2018

krlmlr commented Apr 10, 2018

romainfrancois commented Apr 10, 2018

krlmlr commented Apr 10, 2018

hadley commented May 7, 2018

romainfrancois commented May 9, 2018

romainfrancois commented May 9, 2018 • edited Loading

romainfrancois commented May 9, 2018

krlmlr commented May 9, 2018

hadley commented May 9, 2018

romainfrancois commented May 9, 2018

romainfrancois commented May 9, 2018

romainfrancois commented May 9, 2018

lock bot commented Nov 24, 2018

romainfrancois commented May 9, 2018 •

edited

Loading