vignettes/creating-summary-lines.Rmd

---
title: "Creating Summary Lines"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Creating Summary Lines}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
  
```{r options, message=FALSE, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(gt)
library(dplyr)
```

We can use `gt::summary_rows()` to insert summary rows into a table. There are two main types of summary rows: (1) group-wise summaries, and (2) the grand summary. The group-wise summaries operate on one or more row groups, which can themselves be generated in a variety of ways:

- with the `gt()` function's `groupname_col` option
- using `tab_row_group()` on a gt object
- by passing in a grouped `tbl_df` object to `gt()` (using `dplyr::group_by()`)

We generate the summary data through specification of aggregation functions. You choose how to format the values in the resulting summary cells by use of a formatter function (e.g., `fmt_number()`) and any relevant options. These summary rows are automatically inserted within the relevant row groups (for group-wise summaries) or at the bottom of the table (as a grand summary), where each summary row is the result of a different aggregation function. 

### Preparing the Input Data Table

Let's use the `exibble` dataset (included in the **gt** package) to demonstrate how summary rows can be added. That table contains categorical columns named `row` and `group`, along several columns with varying data types. Here is a preview of the `exibble` dataset using solely the `gt()` function with no other options: 

```{r exibble_preview_gt}
exibble |> gt()
```

We'll create a table stub with both row labels (using the `row` column) and row groups (using the `group` column). The end result will be a table organized with labeled rows that are grouped together (the row group labels identify each of the row groups). To make the examples a bit easier to follow, some of the columns in exibble will first be dropped through a `dplyr::select()` statement.

```{r exibble_a}
# Create a gt table using the `exibble` dataset
exibble_a <-
  exibble |>
  select(-c(fctr, date, time, datetime)) |>
  gt(rowname_col = "row", groupname_col = "group") |>
  sub_missing()

exibble_a
```

There are two groups in this data table: `grp_a` and `grp_b`. This gives us flexibility to create both a grand summary and group-wise summary rows.

### Generating Group-wise Summary Rows

Group-wise summary rows can be generated by using `summary_rows()` and, importantly, through specifying the groups that will receive summary rows. We can provide a vector group names, as in `c("grp_a", "grp_b")`, or, use `everything()` to signify that all groups should receive summary rows. Aside from the selection of groups, there is control over which columns are to used for the summary. Since each call to `summary_rows()` only performs one set of aggregation functions, we may want specific aggregations for different subsets of columns.

To make any sort of summary, we need to use functions that will perform the aggregation. We can provide base functions such as `mean()`, `sum()`, `min()`, `max()`, and more, within a `list()`. Each function provided will result in a summary row for each `group`.

Because each function will yield a row, we need to be able to label that row. So, each summary row will receive a summary row label. We can provide our preferred names by naming the functions within the list (e.g, `list(average = "mean", total = "sum", SD = "sd")`). 

```{r exibble_b}
# Create group-wise summary rows for both
# groups (using `groups = everything()`); use the
# `mean()`, `sum()`, and `sd()` functions
# (only for the `num` column)
exibble_b <-
  exibble_a |>
  summary_rows(
    groups = everything(),
    columns = num,
    fns = list(
      average = "mean",
      total = "sum",
      SD = "sd"
    )
  )

exibble_b
```

We can specify the aggregation functions by use of function names in quotes (e.g., `"sum"`), as bare functions (e.g., `sum`), or as one-sided R formulas using a leading `~`. In the formula representation, a `.` serves as the data to be summarized, so we can use `sum(., na.rm = TRUE)`. The use of named arguments is recommended as those names will serve as summary row labels (the labels can derived from the function names but only when not providing bare function names).

Now that `summary_rows()` has been somewhat explained, let’s look at how we can create group-wise summary rows for the `exibble_a` table. We'll create summaries for both available groups (`groups = everything()`) and use the `mean()`, `sum()`, and `sd()` functions with the _function-name-in-quotes_ method (and this will only pertain to the `num` column).

In the previous example we have an `NA` value in the `num/row_6` cell, and so we get `NA` outputs from `mean()`, `sum()`, and `sd()` in `grp_b`'s summary rows (which are replaced with em dashes, itself controllable through the `missing_text` option). To avoid this, let's rewrite the above using the _names-and-formulae_ method.

```{r exibble_c}
# Create group-wise summary rows for both
# groups (using `groups = everything()`); we will
# use names and formulas this time in `fns`
exibble_c <-
  exibble_a |>
  summary_rows(
    groups = everything(),
    columns = num,
    fns = list(
      avg = ~ mean(., na.rm = TRUE),
      total = ~ sum(., na.rm = TRUE),
      s.d. = ~ sd(., na.rm = TRUE)
    )
  )

exibble_c
```

Here we see that summary rows were created for both groups. However, the output of the summary row data is quite different than that of the cell data. The `formatter` argument (and extra `...` arguments) allows for use of any of the `fmt_*()` functions that we'd normally use to format cell data. In this example (another rewrite of the above), the cell data in the `num` column is formatted in scientific notation and the resulting summary cell data is also formatted in the same way (including the options of `decimals = 3`).

```{r exibble_d}
# Define a named list of aggregation
# functions and summary row labels
fns_labels <-
  list(
    avg = ~ mean(., na.rm = TRUE),
    total = ~ sum(., na.rm = TRUE),
    s.d. = ~ sd(., na.rm = TRUE)
  )

# Create group-wise summary rows as
# before, supply `fns_labels` to `fns`,
# and format the cell summary data
exibble_d <-
  exibble_a |>
  fmt_scientific(
    columns = num,
    decimals = 3
  ) |>
  summary_rows(
    groups = everything(),
    columns = num,
    fns = fns_labels,
    fmt = list(~ fmt_scientific(., decimals = 3))
  )

exibble_d
```

The input to `fns` is very permissive in regard to how the functions are defined. It is entirely valid to provide functions in the various forms shown earlier such that `list("sum", avg = ~mean(., na.rm = TRUE), SD = "sd")` will be correctly interpreted. It is recommended to use formula notation.

The default for `formatter` is set to `fmt_number` which is a sensible default for many scenarios. The setting of argument values for a particular formatter can be done in the `...` area of the function call (as was done above for the `decimals` argument).

### Using Multiple Calls of `summary_rows()` 

We can re-use summary row labels and fill the otherwise empty summary cells with similar aggregations but perhaps with different formatting options. Here's an example where the `currency` column contains aggregate values that share the same summary rows as for the `num` column, adds two more rows, and uses currency formatting:

```{r exibble_e}
# Create group-wise summary rows as
# before, supply `fns_labels` to `fns`,
# and format the cell summary data
exibble_e <-
  exibble_a |>
  fmt_scientific(
    columns = num,
    decimals = 3
  ) |>
  fmt_currency(
    columns = currency,
    currency = "EUR"
  ) |>
  summary_rows(
    groups = everything(),
    columns = num,
    fns = fns_labels,
    fmt = list(~ fmt_scientific(., decimals = 3))
  ) |>
  summary_rows(
    groups = "grp_a",
    columns = currency,
    fns = c(
      fns_labels,
      min = ~ min(.),
      max = ~ max(.)
    ),
    fmt = list(~ fmt_currency(., currency = "EUR"))
  )

exibble_e
```

A thing to again note in the above example is that even though two independent calls of `summary_rows()` were made, summary data within common summary row names were 'squashed' together, thus avoiding the fragmentation of summary rows. Put another way, we don't create additional summary rows across separate calls if we are referencing the same summary row labels. If the summary row labels provided in `fns` were to be different across columns however, additional summary rows would be produced even if the types of data aggregations were to be functionally equivalent.

We can also store these argument values as local variables and pass them in both separate `fmt_*number*()` calls also to arguments within `summary_rows()` calls. This is useful for standardizing formatting parameters across different table cell types. Here's an example of that, which additional passes the `fr_BE` locale to all functions that take a `locale` value.

```{r exibble_f}
# Provide common formatting parameters to a list
# object named `formats`; the number of decimal
# places will be `2` and the locale is "fr_BE"
formats <-
  list(
    decimals = 3,
    locale = "fr_BE",
    currency = "EUR"
  )

# Format the `num` and `currency` columns
# (using the values stored in `formats`);
# when generating summary rows we can also
# supply formatter options from this list
exibble_f <-
  exibble_a |>
  fmt_scientific(
    columns = num,
    decimals = formats$decimals,
    locale = formats$locale
  ) |>
  fmt_currency(
    columns = currency,
    currency = formats$currency,
    locale = formats$locale
  ) |>
  summary_rows(
    groups = everything(),
    columns = num,
    fns = fns_labels,
    fmt = list(~ fmt_scientific(.,
      decimals = formats$decimals,
      locale = formats$locale
    ))
  ) |>
  summary_rows(
    groups = "grp_a",
    columns = currency,
    fns = c(
      fns_labels,
      min = ~ min(.),
      max = ~ max(.)
    ),
    fmt = list(~ fmt_currency(.,
      currency = formats$currency,
      locale = formats$locale
    ))
  )

exibble_f
```

Passing in parameters like this is useful, especially if there are larger numbers of columns. When we store formatting parameters outside of the `gt()` pipeline, we separate our concerns between data structuring and data formatting. Putting styles and options into objects becomes more important if we intend to centralize formatting options for reuse.

### Creating a Grand Summary

A grand summary aggregates column data regardless of the groups within the data. Grand summaries can also be created for **gt** tables that don't have row groups, or, don't have a stub. Finally, we can create a table that has both group-wise summaries and a grand summary.

Let's keep it simple and create grand summary rows on a table without a stub. We'll use `exibble` dataset for this once more. A few `exibble` columns are `select()`ed, passed to `gt()`, and then `grand_summary_rows()`. Notice that, in the resulting table, a stub is created just for the summary row labels (they have to go somewhere).

```{r exibble_g}
# Create a simple grand summary on a gt
# table that contains no stub
exibble_g <-
  exibble |>
  select(num, char, currency) |>
  gt() |>
  grand_summary_rows(
    columns = c(num, currency),
    fns = fns_labels
  )

exibble_g
```

A grand summary can be used in conjunction with group-wise summaries. Here's an table where both types of summaries are present:

```{r exibble_h}
# Using the table in `exibble_f`, create
# grand summary rows (using two separate
# calls of `grand_summary_rows()` since
# the formatting will be different)
exibble_h <-
  exibble_f |>
  grand_summary_rows(
    columns = num,
    fns = fns_labels,
    fmt = list(~ fmt_number(.,
      suffixing = TRUE,
      locale = formats$locale
    ))
  ) |>
  grand_summary_rows(
    columns = currency,
    fns = fns_labels,
    fmt = list(~ fmt_currency(.,
      suffixing = TRUE,
      locale = formats$locale
    ))
  )

exibble_h
```

Note that the grand summary has a double line separating it from group-wise summary that's part of `grp_b`. If this default styling appears to be too subtle, we can elect to add further styling to both group-wise summaries and the grand summary by using `tab_options()`.

### Adding Some Style to the Summary Cells

While the summary cells (both group-wise and grand) have a distinct appearance that sets them apart from the data cells, there's always the option to modify their appearance. We can use `tab_options()` to perform these customizations. Here are the options specific to the summary cells (for group-wise summaries) and the grand summary cells:

- `summary_row.background.color` & `grand_summary_row.background.color`
- `summary_row.padding` & `grand_summary_row.padding`
- `summary_row.text_transform` & `grand_summary_row.text_transform`

We can also target the summary cells and grand summary cells with the location helper functions `cells_summary()` and `cells_grand_summary()`. This is important for adding footnotes with `tab_footnote()` and for setting styles with `tab_style()` (both have the `locations` argument).

Here is an example that uses multiple calls of `tab_options()` and `tab_footnote()`. The cell background color for both types of summary cells is modified and two footnotes are added.

```{r exibble_i}
# Using the gt table of `exibble_h` as a
# starting point, style summary cells with
# `tab_options()` and add two footnotes
exibble_i <-
  exibble_h |>
  tab_options(
    summary_row.background.color = "lightblue",
    grand_summary_row.background.color = "lightgreen"
  ) |>
  tab_footnote(
    footnote = md("Mean of all *num* values."),
    locations = cells_grand_summary(
      columns = "num", rows = "avg"
    )
  ) |>
  tab_footnote(
    footnote = md("Highest `currency` value in **grp_a**"),
    locations = cells_summary(
      groups = "grp_a",
      columns = "currency",
      rows = "max"
    )
  )

exibble_i
```

### Extracting the Summary Data from the **gt** Table Object

For a reproducible workflow, we do not want to have situations where any data created or modified cannot be accessed. While having summarized values be created in a **gt** pipeline presents advantages to readability and intent of analysis, it is recognized that the output table itself is essentially 'read only', as the input data undergoes processing and movement to an entirely different format.

However, the object created still contains data and we can obtain the summary data from a **gt** table object with `extract_summary()`. Taking the `gt_summary` object, we get a list of tibbles containing the summary data while preserving the correct data types:

```{r summary_list}
# Extract the summary data from `exibble_d`
# to a list  object
summary_list <- exibble_d |> extract_summary()
```

```{r summary_list_grp_a}
# Print out the summary for the `grp_a` group
summary_list$summary_df_data_list$grp_a
```

```{r summary_list_grp_b}
# Print out the summary for the `grp_b` group
summary_list$summary_df_data_list$grp_b
```

The output tibbles within the list always contain the `groupname` and `rowname` columns. The `groupname` column is filled with the name of the row group that was given to `summary_rows()`. The `rowname` column contains the descriptive stub labels for the summary rows (recall that values are either supplied explicitly in `summary_rows()`, or, are generated from the function names). The remaining columns are those from the original dataset.

The output data from `extract_summary()` can be reintroduced to a reproducible workflow and serve as downstream inputs or undergo validation. Perhaps interestingly, the output tibbles are structured in a way that facilitates direct input back to `gt()` (i.e., it has the magic `groupname` and `rowname` columns). This can produce a new, standalone summary table where the summary rows are now data rows:

```{r exibble_d_summary_list_gt}
# Take `exibble_d`, which internally has a list
# of summary data frames, extract the summaries,
# and then combine them; input that into `gt()`,
# and format the `num` column with `fmt_number()`
exibble_d |>
  extract_summary() |>
  unlist(recursive = FALSE) |>
  bind_rows() |>
  gt() |>
  fmt_number(
    columns = num,
    decimals = 1
  ) |>
  sub_missing(columns = c(char, currency, row, group))
```

### Providing Our Own Aggregation Functions to Generate Summary Rows

While many of the functions available in base R and within packages are useful as aggregate functions, we may occasionally have the need to create our own custom functions. When taking this approach the main things to keep in mind are that a vector of values is the main input, and, a single value should be returned. The return value can be pretty much any class (e.g., `numeric`, `character`, `logical`) and it's the `formatter` function that will handle any custom formatting while also converting to `character`.

Here, we'll define a function that takes a vector of numeric values and outputs the two highest values (sorted low to high) above a `threshold` value. The output from this function is always a formatted `character` string.

```{r define_agg_function}
# Define a function that gives the
# highest two values above a threshold
agg_highest_two_above_value <- function(x, threshold) {
  # Get sorted values above threshold value
  values <- sort(round(x[x >= threshold], 2))

  # Return character string with 2 highest values above threshold
  if (length(values) == 0) {
    return(paste0("No values above ", threshold))
  } else {
    return(
      paste(
        formatC(
          tail(
            sort(round(x[x > threshold], 2)), 2
          ),
          format = "f", digits = 2
        ),
        collapse = ", "
      )
    )
  }
}

# Let's test this function with some values
agg_highest_two_above_value(
  x = c(0.73, 0.93, 0.75, 0.86, 0.23, 0.81),
  threshold = 0.8
)
```

Because this is character value that's returned, we don't need formatting functions like `fmt_number()`, `fmt_percent()`, etc. However, a useful formatter (and we do need *some* formatter) is `fmt_passthrough()`. Like the name suggests, it to great extent passes values through but formats as `character` (like all the `fmt_*()` function do) and it provides the option to decorate the output with a `pattern`. Let's have a look at how `agg_highest_two_above_value()` can be used with `fmt_passthrough()`.

```{r exibble_j}
# Create a gt table with `exibble_a` and use
# the custom function with a threshold of `20`;
# `fmt_passthrough()` allows for minimal formatting
# of the aggregate values
exibble_j <-
  exibble_a |>
  grand_summary_rows(
    columns = c(num, currency),
    fns = list(high = ~ agg_highest_two_above_value(., 20)),
    fmt = list(~ fmt_passthrough(., pattern = "({x})"))
  )

exibble_j
```

We can extract the grand summary data from the `exibble_j` object. Note that columns `num` and `currency` are classed as character since it was character outputs that were generated by the `agg_highest_two_above_value()` function.

```{r}
# Extract the summary list from `exibble_j`
# and inspect using `str()`
exibble_j |>
  extract_summary() |>
  str()
```