-
Notifications
You must be signed in to change notification settings - Fork 213
/
Copy pathcreating-summary-lines.Rmd
435 lines (360 loc) · 18.9 KB
/
creating-summary-lines.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
---
title: "Creating Summary Lines"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Creating Summary Lines}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r options, message=FALSE, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(gt)
library(dplyr)
```
We can use `gt::summary_rows()` to insert summary rows into a table. There are two main types of summary rows: (1) group-wise summaries, and (2) the grand summary. The group-wise summaries operate on one or more row groups, which can themselves be generated in a variety of ways:
- with the `gt()` function's `groupname_col` option
- using `tab_row_group()` on a gt object
- by passing in a grouped `tbl_df` object to `gt()` (using `dplyr::group_by()`)
We generate the summary data through specification of aggregation functions. You choose how to format the values in the resulting summary cells by use of a formatter function (e.g., `fmt_number()`) and any relevant options. These summary rows are automatically inserted within the relevant row groups (for group-wise summaries) or at the bottom of the table (as a grand summary), where each summary row is the result of a different aggregation function.
### Preparing the Input Data Table
Let's use the `exibble` dataset (included in the **gt** package) to demonstrate how summary rows can be added. That table contains categorical columns named `row` and `group`, along several columns with varying data types. Here is a preview of the `exibble` dataset using solely the `gt()` function with no other options:
```{r exibble_preview_gt}
exibble |> gt()
```
We'll create a table stub with both row labels (using the `row` column) and row groups (using the `group` column). The end result will be a table organized with labeled rows that are grouped together (the row group labels identify each of the row groups). To make the examples a bit easier to follow, some of the columns in exibble will first be dropped through a `dplyr::select()` statement.
```{r exibble_a}
# Create a gt table using the `exibble` dataset
exibble_a <-
exibble |>
select(-c(fctr, date, time, datetime)) |>
gt(rowname_col = "row", groupname_col = "group") |>
sub_missing()
exibble_a
```
There are two groups in this data table: `grp_a` and `grp_b`. This gives us flexibility to create both a grand summary and group-wise summary rows.
### Generating Group-wise Summary Rows
Group-wise summary rows can be generated by using `summary_rows()` and, importantly, through specifying the groups that will receive summary rows. We can provide a vector group names, as in `c("grp_a", "grp_b")`, or, use `everything()` to signify that all groups should receive summary rows. Aside from the selection of groups, there is control over which columns are to used for the summary. Since each call to `summary_rows()` only performs one set of aggregation functions, we may want specific aggregations for different subsets of columns.
To make any sort of summary, we need to use functions that will perform the aggregation. We can provide base functions such as `mean()`, `sum()`, `min()`, `max()`, and more, within a `list()`. Each function provided will result in a summary row for each `group`.
Because each function will yield a row, we need to be able to label that row. So, each summary row will receive a summary row label. We can provide our preferred names by naming the functions within the list (e.g, `list(average = "mean", total = "sum", SD = "sd")`).
```{r exibble_b}
# Create group-wise summary rows for both
# groups (using `groups = everything()`); use the
# `mean()`, `sum()`, and `sd()` functions
# (only for the `num` column)
exibble_b <-
exibble_a |>
summary_rows(
groups = everything(),
columns = num,
fns = list(
average = "mean",
total = "sum",
SD = "sd"
)
)
exibble_b
```
We can specify the aggregation functions by use of function names in quotes (e.g., `"sum"`), as bare functions (e.g., `sum`), or as one-sided R formulas using a leading `~`. In the formula representation, a `.` serves as the data to be summarized, so we can use `sum(., na.rm = TRUE)`. The use of named arguments is recommended as those names will serve as summary row labels (the labels can derived from the function names but only when not providing bare function names).
Now that `summary_rows()` has been somewhat explained, let’s look at how we can create group-wise summary rows for the `exibble_a` table. We'll create summaries for both available groups (`groups = everything()`) and use the `mean()`, `sum()`, and `sd()` functions with the _function-name-in-quotes_ method (and this will only pertain to the `num` column).
In the previous example we have an `NA` value in the `num/row_6` cell, and so we get `NA` outputs from `mean()`, `sum()`, and `sd()` in `grp_b`'s summary rows (which are replaced with em dashes, itself controllable through the `missing_text` option). To avoid this, let's rewrite the above using the _names-and-formulae_ method.
```{r exibble_c}
# Create group-wise summary rows for both
# groups (using `groups = everything()`); we will
# use names and formulas this time in `fns`
exibble_c <-
exibble_a |>
summary_rows(
groups = everything(),
columns = num,
fns = list(
avg = ~ mean(., na.rm = TRUE),
total = ~ sum(., na.rm = TRUE),
s.d. = ~ sd(., na.rm = TRUE)
)
)
exibble_c
```
Here we see that summary rows were created for both groups. However, the output of the summary row data is quite different than that of the cell data. The `formatter` argument (and extra `...` arguments) allows for use of any of the `fmt_*()` functions that we'd normally use to format cell data. In this example (another rewrite of the above), the cell data in the `num` column is formatted in scientific notation and the resulting summary cell data is also formatted in the same way (including the options of `decimals = 3`).
```{r exibble_d}
# Define a named list of aggregation
# functions and summary row labels
fns_labels <-
list(
avg = ~ mean(., na.rm = TRUE),
total = ~ sum(., na.rm = TRUE),
s.d. = ~ sd(., na.rm = TRUE)
)
# Create group-wise summary rows as
# before, supply `fns_labels` to `fns`,
# and format the cell summary data
exibble_d <-
exibble_a |>
fmt_scientific(
columns = num,
decimals = 3
) |>
summary_rows(
groups = everything(),
columns = num,
fns = fns_labels,
fmt = list(~ fmt_scientific(., decimals = 3))
)
exibble_d
```
The input to `fns` is very permissive in regard to how the functions are defined. It is entirely valid to provide functions in the various forms shown earlier such that `list("sum", avg = ~mean(., na.rm = TRUE), SD = "sd")` will be correctly interpreted. It is recommended to use formula notation.
The default for `formatter` is set to `fmt_number` which is a sensible default for many scenarios. The setting of argument values for a particular formatter can be done in the `...` area of the function call (as was done above for the `decimals` argument).
### Using Multiple Calls of `summary_rows()`
We can re-use summary row labels and fill the otherwise empty summary cells with similar aggregations but perhaps with different formatting options. Here's an example where the `currency` column contains aggregate values that share the same summary rows as for the `num` column, adds two more rows, and uses currency formatting:
```{r exibble_e}
# Create group-wise summary rows as
# before, supply `fns_labels` to `fns`,
# and format the cell summary data
exibble_e <-
exibble_a |>
fmt_scientific(
columns = num,
decimals = 3
) |>
fmt_currency(
columns = currency,
currency = "EUR"
) |>
summary_rows(
groups = everything(),
columns = num,
fns = fns_labels,
fmt = list(~ fmt_scientific(., decimals = 3))
) |>
summary_rows(
groups = "grp_a",
columns = currency,
fns = c(
fns_labels,
min = ~ min(.),
max = ~ max(.)
),
fmt = list(~ fmt_currency(., currency = "EUR"))
)
exibble_e
```
A thing to again note in the above example is that even though two independent calls of `summary_rows()` were made, summary data within common summary row names were 'squashed' together, thus avoiding the fragmentation of summary rows. Put another way, we don't create additional summary rows across separate calls if we are referencing the same summary row labels. If the summary row labels provided in `fns` were to be different across columns however, additional summary rows would be produced even if the types of data aggregations were to be functionally equivalent.
We can also store these argument values as local variables and pass them in both separate `fmt_*number*()` calls also to arguments within `summary_rows()` calls. This is useful for standardizing formatting parameters across different table cell types. Here's an example of that, which additional passes the `fr_BE` locale to all functions that take a `locale` value.
```{r exibble_f}
# Provide common formatting parameters to a list
# object named `formats`; the number of decimal
# places will be `2` and the locale is "fr_BE"
formats <-
list(
decimals = 3,
locale = "fr_BE",
currency = "EUR"
)
# Format the `num` and `currency` columns
# (using the values stored in `formats`);
# when generating summary rows we can also
# supply formatter options from this list
exibble_f <-
exibble_a |>
fmt_scientific(
columns = num,
decimals = formats$decimals,
locale = formats$locale
) |>
fmt_currency(
columns = currency,
currency = formats$currency,
locale = formats$locale
) |>
summary_rows(
groups = everything(),
columns = num,
fns = fns_labels,
fmt = list(~ fmt_scientific(.,
decimals = formats$decimals,
locale = formats$locale
))
) |>
summary_rows(
groups = "grp_a",
columns = currency,
fns = c(
fns_labels,
min = ~ min(.),
max = ~ max(.)
),
fmt = list(~ fmt_currency(.,
currency = formats$currency,
locale = formats$locale
))
)
exibble_f
```
Passing in parameters like this is useful, especially if there are larger numbers of columns. When we store formatting parameters outside of the `gt()` pipeline, we separate our concerns between data structuring and data formatting. Putting styles and options into objects becomes more important if we intend to centralize formatting options for reuse.
### Creating a Grand Summary
A grand summary aggregates column data regardless of the groups within the data. Grand summaries can also be created for **gt** tables that don't have row groups, or, don't have a stub. Finally, we can create a table that has both group-wise summaries and a grand summary.
Let's keep it simple and create grand summary rows on a table without a stub. We'll use `exibble` dataset for this once more. A few `exibble` columns are `select()`ed, passed to `gt()`, and then `grand_summary_rows()`. Notice that, in the resulting table, a stub is created just for the summary row labels (they have to go somewhere).
```{r exibble_g}
# Create a simple grand summary on a gt
# table that contains no stub
exibble_g <-
exibble |>
select(num, char, currency) |>
gt() |>
grand_summary_rows(
columns = c(num, currency),
fns = fns_labels
)
exibble_g
```
A grand summary can be used in conjunction with group-wise summaries. Here's an table where both types of summaries are present:
```{r exibble_h}
# Using the table in `exibble_f`, create
# grand summary rows (using two separate
# calls of `grand_summary_rows()` since
# the formatting will be different)
exibble_h <-
exibble_f |>
grand_summary_rows(
columns = num,
fns = fns_labels,
fmt = list(~ fmt_number(.,
suffixing = TRUE,
locale = formats$locale
))
) |>
grand_summary_rows(
columns = currency,
fns = fns_labels,
fmt = list(~ fmt_currency(.,
suffixing = TRUE,
locale = formats$locale
))
)
exibble_h
```
Note that the grand summary has a double line separating it from group-wise summary that's part of `grp_b`. If this default styling appears to be too subtle, we can elect to add further styling to both group-wise summaries and the grand summary by using `tab_options()`.
### Adding Some Style to the Summary Cells
While the summary cells (both group-wise and grand) have a distinct appearance that sets them apart from the data cells, there's always the option to modify their appearance. We can use `tab_options()` to perform these customizations. Here are the options specific to the summary cells (for group-wise summaries) and the grand summary cells:
- `summary_row.background.color` & `grand_summary_row.background.color`
- `summary_row.padding` & `grand_summary_row.padding`
- `summary_row.text_transform` & `grand_summary_row.text_transform`
We can also target the summary cells and grand summary cells with the location helper functions `cells_summary()` and `cells_grand_summary()`. This is important for adding footnotes with `tab_footnote()` and for setting styles with `tab_style()` (both have the `locations` argument).
Here is an example that uses multiple calls of `tab_options()` and `tab_footnote()`. The cell background color for both types of summary cells is modified and two footnotes are added.
```{r exibble_i}
# Using the gt table of `exibble_h` as a
# starting point, style summary cells with
# `tab_options()` and add two footnotes
exibble_i <-
exibble_h |>
tab_options(
summary_row.background.color = "lightblue",
grand_summary_row.background.color = "lightgreen"
) |>
tab_footnote(
footnote = md("Mean of all *num* values."),
locations = cells_grand_summary(
columns = "num", rows = "avg"
)
) |>
tab_footnote(
footnote = md("Highest `currency` value in **grp_a**"),
locations = cells_summary(
groups = "grp_a",
columns = "currency",
rows = "max"
)
)
exibble_i
```
### Extracting the Summary Data from the **gt** Table Object
For a reproducible workflow, we do not want to have situations where any data created or modified cannot be accessed. While having summarized values be created in a **gt** pipeline presents advantages to readability and intent of analysis, it is recognized that the output table itself is essentially 'read only', as the input data undergoes processing and movement to an entirely different format.
However, the object created still contains data and we can obtain the summary data from a **gt** table object with `extract_summary()`. Taking the `gt_summary` object, we get a list of tibbles containing the summary data while preserving the correct data types:
```{r summary_list}
# Extract the summary data from `exibble_d`
# to a list object
summary_list <- exibble_d |> extract_summary()
```
```{r summary_list_grp_a}
# Print out the summary for the `grp_a` group
summary_list$summary_df_data_list$grp_a
```
```{r summary_list_grp_b}
# Print out the summary for the `grp_b` group
summary_list$summary_df_data_list$grp_b
```
The output tibbles within the list always contain the `groupname` and `rowname` columns. The `groupname` column is filled with the name of the row group that was given to `summary_rows()`. The `rowname` column contains the descriptive stub labels for the summary rows (recall that values are either supplied explicitly in `summary_rows()`, or, are generated from the function names). The remaining columns are those from the original dataset.
The output data from `extract_summary()` can be reintroduced to a reproducible workflow and serve as downstream inputs or undergo validation. Perhaps interestingly, the output tibbles are structured in a way that facilitates direct input back to `gt()` (i.e., it has the magic `groupname` and `rowname` columns). This can produce a new, standalone summary table where the summary rows are now data rows:
```{r exibble_d_summary_list_gt}
# Take `exibble_d`, which internally has a list
# of summary data frames, extract the summaries,
# and then combine them; input that into `gt()`,
# and format the `num` column with `fmt_number()`
exibble_d |>
extract_summary() |>
unlist(recursive = FALSE) |>
bind_rows() |>
gt() |>
fmt_number(
columns = num,
decimals = 1
) |>
sub_missing(columns = c(char, currency, row, group))
```
### Providing Our Own Aggregation Functions to Generate Summary Rows
While many of the functions available in base R and within packages are useful as aggregate functions, we may occasionally have the need to create our own custom functions. When taking this approach the main things to keep in mind are that a vector of values is the main input, and, a single value should be returned. The return value can be pretty much any class (e.g., `numeric`, `character`, `logical`) and it's the `formatter` function that will handle any custom formatting while also converting to `character`.
Here, we'll define a function that takes a vector of numeric values and outputs the two highest values (sorted low to high) above a `threshold` value. The output from this function is always a formatted `character` string.
```{r define_agg_function}
# Define a function that gives the
# highest two values above a threshold
agg_highest_two_above_value <- function(x, threshold) {
# Get sorted values above threshold value
values <- sort(round(x[x >= threshold], 2))
# Return character string with 2 highest values above threshold
if (length(values) == 0) {
return(paste0("No values above ", threshold))
} else {
return(
paste(
formatC(
tail(
sort(round(x[x > threshold], 2)), 2
),
format = "f", digits = 2
),
collapse = ", "
)
)
}
}
# Let's test this function with some values
agg_highest_two_above_value(
x = c(0.73, 0.93, 0.75, 0.86, 0.23, 0.81),
threshold = 0.8
)
```
Because this is character value that's returned, we don't need formatting functions like `fmt_number()`, `fmt_percent()`, etc. However, a useful formatter (and we do need *some* formatter) is `fmt_passthrough()`. Like the name suggests, it to great extent passes values through but formats as `character` (like all the `fmt_*()` function do) and it provides the option to decorate the output with a `pattern`. Let's have a look at how `agg_highest_two_above_value()` can be used with `fmt_passthrough()`.
```{r exibble_j}
# Create a gt table with `exibble_a` and use
# the custom function with a threshold of `20`;
# `fmt_passthrough()` allows for minimal formatting
# of the aggregate values
exibble_j <-
exibble_a |>
grand_summary_rows(
columns = c(num, currency),
fns = list(high = ~ agg_highest_two_above_value(., 20)),
fmt = list(~ fmt_passthrough(., pattern = "({x})"))
)
exibble_j
```
We can extract the grand summary data from the `exibble_j` object. Note that columns `num` and `currency` are classed as character since it was character outputs that were generated by the `agg_highest_two_above_value()` function.
```{r}
# Extract the summary list from `exibble_j`
# and inspect using `str()`
exibble_j |>
extract_summary() |>
str()
```