-
Notifications
You must be signed in to change notification settings - Fork 213
/
Copy pathgt-datasets.Rmd
510 lines (445 loc) · 22 KB
/
gt-datasets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
---
title: "gt Datasets"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{gt Datasets}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r options, message=FALSE, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(gt)
library(dplyr)
library(tidyr)
```
The **gt** package comes with six built-in datasets for experimenting with the **gt** API: `countrypops`, `sza`, `gtcars`, `sp500`, `pizzaplace`, and `exibble`. While each dataset has different subject matter, all of them will be used to develop **gt** examples with consistent syntax.
<img src="../man/figures/countrypops.png" width=30%><img src="../man/figures/sza.svg" width=30%><img src="../man/figures/gtcars.svg" width=30%>
<img src="../man/figures/sp500.svg" width=30%><img src="../man/figures/pizzaplace.svg" width=30%><img src="../man/figures/exibble.svg" width=30%>
Each dataset is stored as a tibble, ranging from very small (like `exibble`, an example tibble of 8 rows) to quite large in size (e.g., at nearly 50,000 rows: `pizzaplace`). Larger datasets are typically impractical as **gt output tables** but they provide opportunities for demonstrating preprocessing using tidyverse tools like **dplyr** and **tidyr** (upstream of **gt**'s `gt()` entry point).
In many **gt** workflows, there will often be prep work like this where the creation of the input table and any salient information (e.g., metadata for footnotes) will be done alongside the preparation of the display table.
In the next few examples, we'll show how tables can be created with each of these datasets. Each example will be preceded with a set of requirements that serve as a design brief. This planning stage can be very useful in your own work for the purpose of organization. The hope is that this collection of simple examples will inspire the creation of much more interesting display tables with **gt**.
### `countrypops`
This dataset provides the total populations of 215 countries on a yearly basis, from 1960 to 2021. The source data comes from the [World Bank](https://data.worldbank.org/indicator/SP.POP.TOTL), where it has been cleaned and tidied up for inclusion into this package. Each row contains a population value for a country in a given year, where `NA` values for `population` indicate that the country did not exist in that particular year.
The `countrypops` dataset is a tibble with 13,330 rows and 5 variables. Here are explanations for each of the variables:
```{r countrypops_make_data_dictionary, echo=FALSE}
countrypops_ddtab <-
dplyr::tribble(
~Column, ~Type, ~Description,
"country_name", "chr", "Name of the country",
"country_code_2", "chr", "The 2-letter ISO 3166-1 country code",
"country_code_3", "chr", "The 3-letter ISO 3166-1 country code",
"year", "int", "The year for the population estimate",
"population", "int", "The population estimate, midway through the year",
) |>
gt() |>
tab_style(
style = cell_text(font = "Courier"),
locations = list(
cells_body(columns = "Column"),
cells_body(columns = "Type")
)
) |>
tab_options(
table.width = pct(80),
table.font.size = "smaller",
column_labels.font.size = "small"
)
```
```{r countrypops_ddtab, echo = FALSE}
countrypops_ddtab
```
#### A `countrypops` Example
The table that we'll create from `countrypops` will meet these requirements:
1. use countries from Oceania
2. countries in different regions of Oceania will be grouped together
3. provide populations for the 2000, 2010, and 2020 years only; they should appear as separate columns with a spanner group column stating that these columns refer to population values
4. format population figures to contain commas
5. provide a descriptive title
```{r countrypops}
# Get vectors of 2-letter country codes for each region of Oceania
Australasia <- c("AU", "NZ")
Melanesia <- c("NC", "PG", "SB", "VU")
Micronesia <- c("FM", "GU", "KI", "MH", "MP", "NR", "PW")
Polynesia <- c("PF", "WS", "TO", "TV")
# Create a gt table based on a preprocessed `countrypops`
countrypops |>
filter(country_code_2 %in% c(
Australasia, Melanesia, Micronesia, Polynesia)
) |>
filter(year %in% c(2000, 2010, 2020)) |>
mutate(region = case_when(
country_code_2 %in% Australasia ~ "Australasia",
country_code_2 %in% Melanesia ~ "Melanesia",
country_code_2 %in% Micronesia ~ "Micronesia",
country_code_2 %in% Polynesia ~ "Polynesia",
)) |>
pivot_wider(names_from = year, values_from = population) |>
arrange(region, desc(`2020`)) |>
select(-starts_with("country_code")) |>
gt(
rowname_col = "country_name",
groupname_col = "region"
) |>
tab_header(title = "Populations of Oceania's Countries in 2000, 2010, and 2020") |>
tab_spanner(
label = "Total Population",
columns = everything()
) |>
fmt_integer()
```
### `sza`
The solar zenith angle is one measure of the solar position. It can be thought of as 'height' of the sun in relation to an observer. A solar zenith angle of 0° indicates that the sun is directly overhead (a different solar angle, the *solar altitude angle*, is 90° in this instance). With the sun at the horizontal (e.g., during sunrise/sunset) we observe the solar zenith angle to be around 90° (there is the matter of atmospheric refraction). During nighttime, solar zenith angles in the range of 90–180 are possible (again, depending on the location of the observer).
The `sza` dataset has calculated values for the solar zenith angles every 30 minutes from 04:00 to 12:00 (true solar time). Temporally, these calculations are for the 1st of every month of the year. Spatially, the observer is located at somewhere along latitudes 20°N, 30°N, 40°N, and 50°N (because we are using true solar time, the longitude is unimportant). This is quite an extensive topic, and more information can be found by using `?sza` in the R console or by visiting [this page](https://en.wikipedia.org/wiki/Solar_zenith_angle).
The `sza` dataset is a tibble with 816 rows and 4 variables. Here are explanations for each of the variables:
```{r sza_make_data_dictionary, echo=FALSE}
sza_ddtab <-
dplyr::tribble(
~Column, ~Type, ~Description,
"latitude", "dbl", "The latitude in decimal degrees for the observations",
"month", "fct", "The measurement month; all calculations where conducted for the first day of each month",
"tst", "chr", "The true solar time at the given latitude and date (first of month) for which the solar zenith angle is calculated",
"sza", "dbl", "The solar zenith angle in degrees, where NAs indicate that sunrise hadn't yet occurred by the tst value",
) |>
gt() |>
tab_style(
style = cell_text(font = "Courier"),
locations = list(
cells_body(columns = "Column"),
cells_body(columns = "Type")
)
) |>
tab_options(
table.width = pct(80),
table.font.size = "smaller",
column_labels.font.size = "small"
)
```
```{r sza_ddtab, echo = FALSE}
sza_ddtab
```
#### An `sza` Example
The table that we'll create from `sza` will meet these requirements:
1. filter the data to just use the 20°N data and remove the `latitude` column
2. `NA` values from `sza` column are to be removed
3. reshape the table so that columns of `tst` (true solar time) contain angles in degrees (from the `sza` column)
4. the **gt output table** will have the `month`s as row labels in the stub
5. missing values will be replaced with an empty string (so that those cells are blank)
6. a stubhead label will state what's inside the stubs (months, at 20°N)
7. the table will have a heading decorated with the HTML *Black Sun with Rays* (`☀`) symbol
8. to fit the large amount of data in a small area, use some table options to reduce text size and row padding
```{r sza}
# Create a gt table based on a preprocessed `sza`
sza |>
filter(latitude == 20) |>
select(-latitude) |>
filter(!is.na(sza)) |>
pivot_wider(names_from = "tst", values_from = sza) |>
gt(rowname_col = "month") |>
sub_missing(missing_text = "") |>
tab_stubhead(label = md("month<br>(20°N)")) |>
tab_header(title = md("☀ Solar Zenith Angles ☀")) |>
tab_options(
column_labels.font.size = "smaller",
table.font.size = "smaller",
data_row.padding = px(3)
)
```
### `gtcars`
The `gtcars` dataset takes off where `mtcars` left off. It contains 47 cars from the 2014-2017 model years. Many of the `gtcars` vehicles are grand tourers. Indeed, many of these provide the ability to cross an entire continent at speed and in comfort yet, when it's called for, they will allow you to experience driving thrills. The chassis and suspension are in most cases top-notch and supply superb handling and roadholding on all routes one would conceivably encounter during the grand touring experience. The two plus two (2 + 2) seating configuration is smartly designed to deliver comfort for a driver and passenger, adequate space for luggage, and have room to spare.
The `gtcars` dataset is a tibble with 47 rows and 15 variables. Here are explanations for each of the variables:
```{r gtcars_make_data_dictionary, echo=FALSE}
gtcars_ddtab <-
dplyr::tribble(
~Column, ~Type, ~Description,
"mfr", "chr", "The name of the car manufacturer",
"model", "chr", "The car's model name",
"year", "int", "The car's model year",
"trim", "chr", "A short description of the car model's trim",
"bdy_style", "chr", "An identifier of the car's body style, which is either coupe, convertible, sedan, or hatchback",
"hp, hp_rpm", "int", "The car's horsepower and the associated RPM level",
"trq, trq_rpm", "int", "The car's torque and the associated RPM level",
"mpg_c, mpg_h", "int", "The miles per gallon fuel efficiency rating for city and highway driving",
"drivetrain", "chr", "The car's drivetrain which, for this dataset is either rwd (Rear Wheel Drive) or awd (All Wheel Drive)",
"trsmn", "chr", "The codified transmission type, where the number part is the number of gears; the car could have automatic transmission (a), manual transmission (m), an option to switch between both types (am), or, direct drive (dd)",
"ctry_origin", "chr", "The country name for where the vehicle manufacturer is headquartered",
) |>
gt() |>
tab_style(
style = cell_text(font = "Courier"),
locations = list(
cells_body(columns = "Column"),
cells_body(columns = "Type")
)
) |>
tab_options(
table.width = pct(80),
table.font.size = "smaller",
column_labels.font.size = "small"
)
```
```{r gtcars_ddtab, echo = FALSE}
gtcars_ddtab
```
#### A `gtcars` Example
The table that we'll create from `gtcars` will meet these requirements:
1. only include German cars
2. limit the dataset to the top two most expensive offerings from each German manufacturer
3. the information included will be the manufacturer (`mfr`), the car model (`model`), the `drivetrain`, and the price (`msrp`)
4. add a table title
5. combine the car make and model into a single column
6. capitalize the `drivetrain` text
7. format the prices as USD currency with commas and no decimal places shown
8. relabel the column headings to provide nicer labels
9. add two footnotes to explain the `drivetrain` abbreviations and to specify the currency of the `msrp` prices; ensure that the footnote marks are lowercase letters
```{r gtcars}
# Create a gt table based on a preprocessed `gtcars`
gtcars |>
filter(ctry_origin == "Germany") |>
slice_max(n = 2, msrp, by = mfr) |>
select(mfr, model, drivetrain, msrp) |>
gt() |>
tab_header(title = "Select German Automobiles") |>
cols_merge(columns = c(mfr, model)) |>
text_transform(
locations = cells_body(columns = drivetrain),
fn = function(x) toupper(x)
) |>
fmt_currency(decimals = 0) |>
cols_label(
mfr = "Car",
drivetrain = "Drivetrain",
msrp = "MSRP"
) |>
tab_footnote(
footnote = "Prices in USD.",
locations = cells_column_labels(columns = msrp)
) |>
tab_footnote(
footnote = "AWD = All Wheel Drive, RWD = Rear Wheel Drive.",
locations = cells_column_labels(columns = drivetrain)
) |>
opt_footnote_marks(marks = "letters")
```
### `sp500`
The S&P 500 is a capitalization-weighted index of about 500 leading companies (where bigger companies have more influence within the index) that have common stock listed in either the NYSE or NASDAQ markets. The companies chosen are intended to provide representation of the U.S. economy. This index is a managed list (managed by S&P Dow Jones Indices LLC) with occasional changes of the constituent companies based on their performance and changes in the economy.
There is daily S&P 500 data available in the `sp500` dataset, with daily indicators (price statistics, `volume`, etc.) from 1950 to 2015, inclusive. There are 16,607 rows in the dataset, and 7 variables:
```{r sp500_make_data_dictionary, echo=FALSE}
sp500_ddtab <-
dplyr::tribble(
~Column, ~Type, ~Description,
"date", "date", "The date expressed as `Date` values",
"open, high, low, close", "dbl", "The day's opening, high, low, and closing prices in USD; the close price is adjusted for splits",
"volume", "dbl", "The number of trades for the given `date`",
"adj_close", "dbl", "The close price adjusted for both dividends and splits",
) |>
gt() |>
tab_style(
style = cell_text(font = "Courier"),
locations = list(
cells_body(columns = "Column"),
cells_body(columns = "Type")
)
) |>
tab_options(
table.width = pct(80),
table.font.size = "smaller",
column_labels.font.size = "small"
)
```
```{r sp500_ddtab, echo = FALSE}
sp500_ddtab
```
#### An `sp500` Example
The table that we'll create from `sp500` will meet these requirements:
1. use only data from the period `2010-06-02` to `2010-06-15`
2. the adjusted close `adj_close` column won't be included
3. a title and subtitle will be added to describe the contents of the table
4. put the column labels in title case
5. format the `date` column to appear as '2 Jun 2010'
6. have the price columns (`open`, `high`, `low`, `close`) appear in USD
7. the large numbers in `volume` should be shown as billions (with the `B` suffix)
8. up- and down-pointing triangles (in green and red) will be added alongside the `close` price as appropriate
```{r sp500}
# Define the start and end dates for the data range
start_date <- "2010-06-02"
end_date <- "2010-06-15"
# The HTML decimal references for the black
# up- and down-pointing triangles are: #9650 and #9660;
# use an in-line style to apply color
up_arrow <- "<span style=\"color:green\">▲</span>"
down_arrow <- "<span style=\"color:red\">▼</span>"
# Create a gt table based on a preprocessed `sp500`
sp500 |>
filter(date >= start_date, date <= end_date) |>
select(-adj_close) |>
gt() |>
tab_header(
title = "S&P 500",
subtitle = glue::glue("{start_date} to {end_date}")
) |>
fmt_date(
columns = date,
date_style = "day_m_year"
) |>
fmt_currency(columns = c(open, high, low, close)) |>
fmt_number(columns = volume, suffixing = TRUE) |>
text_transform(
locations = cells_body(
columns = close,
rows = close > open
),
fn = function(x) paste(x, up_arrow)
) |>
text_transform(
locations = cells_body(
columns = close,
rows = close < open
),
fn = function(x) paste(x, down_arrow)
) |>
cols_label_with(
columns = everything(),
fn = ~ paste0(toupper(substr(., 1, 1)), substr(., 2, nchar(.)))
)
```
### `pizzaplace`
The `pizzaplace` dataset is unusual to say the least. It brings up more questions than answers. Why is it that the 'The Greek' pizza (`the_greek`) comes in `XL` and `XXL` sizes whilst (almost) all the other pizzas adhere to the `S`-`M`-`L` paradigm? Why is the 'Brie Carre' pizza (`brie_carre`) only small? Also, is any of this real (?), and, what is the nature of reality? (All of these questions are quite complicated... however, while I can doubt the existence of the material world, I cannot doubt the existence of myself as someone thinking about all the delicious pizzas on offer at `pizzaplace`.)
We have somehow obtained the 2015 sales from the pizzaplace, where each row is a pizza sold. There are 32 different types of pizza in 4 different categories: `classic`, `chicken`, `supreme`, and `veggie`. It was a great year of sales, personal problems notwithstanding. A kitchen fire in late September did not help with the morale situation. Nevertheless, $817,860 in sales for the year! That was indeed something to be cheerful about.
Let's learn more about how this fascinating dataset is structured:
```{r pizzaplace_make_data_dictionary, echo=FALSE}
pizzaplace_ddtab <-
dplyr::tribble(
~Column, ~Type, ~Description,
"id", "chr", "The ID for the order, which consists of one or more pizzas at a given `date` and `time`",
"date", "chr", "A character representation of the order `date`, expressed in the ISO 8601 date format (YYYY-MM-DD)",
"time", "chr", "A character representation of the order time, expressed as a 24-hour time the ISO 8601 extended time format (hh:mm:ss)",
"name", "chr", "The short name for the pizza",
"size", "chr", "The size of the pizza, which can either be S, M, L, XL (rare!), or XXL (even rarer!); most pizzas are available in the S, M, and L sizes but exceptions apply",
"type", "chr", "The category or type of pizza, which can either be `classic`, `chicken`, `supreme`, or `veggie`",
"price", "dbl", "The price of the pizza and the amount that it sold for (in USD)",
) |>
gt() |>
tab_style(
style = cell_text(font = "Courier"),
locations = list(
cells_body(columns = "Column"),
cells_body(columns = "Type")
)
) |>
tab_options(
table.width = pct(80),
table.font.size = "smaller",
column_labels.font.size = "small"
)
```
```{r pizzaplace_ddtab, echo = FALSE}
pizzaplace_ddtab
```
#### A `pizzaplace` Example
Let's make a reporting table from the `pizzaplace` dataset with these requirements:
1. obtain the total sale numbers and revenue from each `size` of pizza from each category (`type`)
2. create a gt table where each row represents a combination of `size`-`type` (`size` provides the row labels and `type` forms the row groups)
3. add a title to explain the contents of the table
4. format the numeric `sold` column to use commas and no decimal places
5. format the currency values (`income`) to be in `USD` currency
5. add a summary for each grouping that provides total sell counts and revenue amounts
6. color the row groups and summary cells to add a little pizzazz
```{r pizzaplace}
# Create a gt table based on a preprocessed `pizzaplace`
pizzaplace |>
dplyr::group_by(type, size) |>
dplyr::summarize(
sold = n(),
income = sum(price),
.groups = "drop_last"
) |>
gt(rowname_col = "size") |>
tab_header(title = "Pizzas Sold in 2015") |>
fmt_integer(columns = sold) |>
fmt_currency(columns = income) |>
summary_rows(
columns = sold,
fns = list(TOTAL = "sum"),
fmt = list(~ fmt_integer(.))
) |>
summary_rows(
columns = income,
fns = list(TOTAL = "sum"),
fmt = list(~ fmt_currency(.))
) |>
tab_options(
summary_row.background.color = "#ACEACE",
row_group.background.color = "#FFEFDB"
)
```
### `exibble`
The example tibble that's useful for **gt** is called `exibble`. It's 8 rows, has clear ordering of data, and the columns contain data that can be tested with the various **gt** formatter functions (`fmt*()`). Here is a table describing the columns of `exibble`:
```{r exibble_make_data_dictionary, echo=FALSE}
exibble_ddtab <-
dplyr::tribble(
~Column, ~Type, ~Description,
"num", "dbl", "A numeric column ordered with increasingly larger values",
"char", "chr", "A character column composed of names of fruits from a to h",
"fctr", "fct", "A factor column with numbers from 1 to 8, written out",
"date, time, datetime", "chr", "Character columns with dates, times, and datetimes",
"currency", "dbl", "A numeric column that is useful for testing currency-based formatting",
"row", "chr", "A character column in the format `row_X` which can be useful for testing with row label in a table stub",
"group", "chr", "A character column with four `grp_a` values and four `grp_b` values which can be useful for testing tables that contain row groups",
) |>
gt() |>
tab_style(
style = cell_text(font = "Courier"),
locations = list(
cells_body(columns = "Column"),
cells_body(columns = "Type")
)
) |>
tab_options(
table.width = pct(80),
table.font.size = "smaller",
column_labels.font.size = "small"
)
```
```{r exibble_ddtab, echo = FALSE}
exibble_ddtab
```
#### An `exibble` Example
Let's test as many formatter functions as possible with `exibble` while also using row labels and row groups (furnished by the `row` and `group` columns). We'll format `num` to display numbers with 2 decimal places. The dates in `date` will be formatted with `date_style` `6` (the `m_day_year` style, use `info_date_style()` to learn about all of them). The 24-h time values in `time` will use `time_style` `4` (`hm_p`, more info at `info_time_style()`). Datetimes as in `datetime` column can be formatted with the `fmt_datetime()` function (which uses the `date_style` and `time_style` arguments). The column `currency` will be formatted as a currency with `fmt_currency` and we'll consider these values to be euros (`currency = "EUR"`).
```{r exibble}
# Create a gt table based on `exibble`
exibble |>
gt(
rowname_col = "row",
groupname_col = "group"
) |>
fmt_number(
columns = num,
decimals = 2
) |>
fmt_date(
columns = date,
date_style = "m_day_year"
) |>
fmt_time(
columns = time,
time_style = "h_m_p"
) |>
fmt_datetime(
columns = datetime,
date_style = "m_day_year",
time_style = "h_m_p"
) |>
fmt_currency(
columns = currency,
currency = "EUR"
) |>
tab_options(
column_labels.font.size = "small",
table.font.size = "small",
row_group.font.size = "small",
data_row.padding = px(3)
)
```