Reiko Okamoto 2024-09-16
The tidyverse is a collection of R packages designed for data science. Arguably, two of the most popular packages in the tidyverse are dplyr for data manipulation and ggplot2 for data visualization. These are also two of the packages we are covering today!
The skills you gain are not just limited to R. The concepts, like filtering data and creating plots, are applicable to other languages like SQL and Python. This makes it easier to pick up other tools and languages later. Additionally, the tidyverse is in tune with open science practices, helping you create analyses that are more accessible, transparent, and reproducible.
There’s no expectation for you to memorize everything. Even experienced programmers don’t have every function memorized - they’re constantly googling things! My goal today is to help you get comfortable with, and hopefully interested in, using the tidyverse for data analysis.
We’ll use the penguins
data set from the
palmerpenguins
package. It contains measurements for adult penguins observed in the
islands in the Palmer Archipelago off the Antarctic Peninsula. The data
were collected and made available by Dr. Kristen Gorman and the Palmer
Station Long Term Ecological Research (LTER) Program. This data set is
great for learning because it’s small enough to be manageable for
beginners but contains a range of different types of data and features.
💻Load the necessary packages:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
We only need to install a package once, but we’ll need to reload it every time we start a new session.
💻Open the data set documentation:
?penguins
starting httpd help server ... done
💻Inspect the data using the
glimpse()
function.
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
-
How many rows?
-
How many columns?
-
How many data types?
-
Are there any missing values?
Sometimes, we only need to work with a few specific columns instead of
the entire data. The
select()
function
makes it easy to do just that. It allows us to focus on the columns we
need and ignore the rest, making our code cleaner.
💻Select one column by name:
select(penguins, species)
# A tibble: 344 × 1
species
<fct>
1 Adelie
2 Adelie
3 Adelie
4 Adelie
5 Adelie
6 Adelie
7 Adelie
8 Adelie
9 Adelie
10 Adelie
# ℹ 334 more rows
💻Select two columns by name:
select(penguins, species, island)
# A tibble: 344 × 2
species island
<fct> <fct>
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
5 Adelie Torgersen
6 Adelie Torgersen
7 Adelie Torgersen
8 Adelie Torgersen
9 Adelie Torgersen
10 Adelie Torgersen
# ℹ 334 more rows
select(penguins, c(species, island))
# A tibble: 344 × 2
species island
<fct> <fct>
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
5 Adelie Torgersen
6 Adelie Torgersen
7 Adelie Torgersen
8 Adelie Torgersen
9 Adelie Torgersen
10 Adelie Torgersen
# ℹ 334 more rows
💻Select all columns except certain ones:
select(penguins, -species, -island)
# A tibble: 344 × 6
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<dbl> <dbl> <int> <int> <fct> <int>
1 39.1 18.7 181 3750 male 2007
2 39.5 17.4 186 3800 female 2007
3 40.3 18 195 3250 female 2007
4 NA NA NA NA <NA> 2007
5 36.7 19.3 193 3450 female 2007
6 39.3 20.6 190 3650 male 2007
7 38.9 17.8 181 3625 female 2007
8 39.2 19.6 195 4675 male 2007
9 34.1 18.1 193 3475 <NA> 2007
10 42 20.2 190 4250 <NA> 2007
# ℹ 334 more rows
💻Select columns that start with “bill”:
select(penguins, starts_with("bill"))
# A tibble: 344 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 NA NA
5 36.7 19.3
6 39.3 20.6
7 38.9 17.8
8 39.2 19.6
9 34.1 18.1
10 42 20.2
# ℹ 334 more rows
Similar functions like ends_with()
and contains()
are also available
to select columns based on the ending or presence of specific
characters.
💻Select numeric variables:
select(penguins, where(is.numeric))
# A tibble: 344 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<dbl> <dbl> <int> <int> <int>
1 39.1 18.7 181 3750 2007
2 39.5 17.4 186 3800 2007
3 40.3 18 195 3250 2007
4 NA NA NA NA 2007
5 36.7 19.3 193 3450 2007
6 39.3 20.6 190 3650 2007
7 38.9 17.8 181 3625 2007
8 39.2 19.6 195 4675 2007
9 34.1 18.1 193 3475 2007
10 42 20.2 190 4250 2007
# ℹ 334 more rows
Filtering data is a common task in data analysis. Use the
filter()
function
to focus on specific observations.
💻Show only the penguins of the Gentoo species:
filter(penguins, species == "Gentoo")
# A tibble: 124 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 46.1 13.2 211 4500
2 Gentoo Biscoe 50 16.3 230 5700
3 Gentoo Biscoe 48.7 14.1 210 4450
4 Gentoo Biscoe 50 15.2 218 5700
5 Gentoo Biscoe 47.6 14.5 215 5400
6 Gentoo Biscoe 46.5 13.5 210 4550
7 Gentoo Biscoe 45.4 14.6 211 4800
8 Gentoo Biscoe 46.7 15.3 219 5200
9 Gentoo Biscoe 43.3 13.4 209 4400
10 Gentoo Biscoe 46.8 15.4 215 5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>
💻Find all Adelie penguins that are also on Torgersen Island:
filter(penguins, species == "Adelie", island == "Torgersen")
# A tibble: 52 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 42 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The comma acts as an AND operator, meaning both conditions must be true for a row to be included.
💻Find all penguins that either Adelie or Gentoo:
filter(penguins, species == "Adelie" | species == "Gentoo")
# A tibble: 276 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 266 more rows
# ℹ 2 more variables: sex <fct>, year <int>
filter(penguins, species != "Chinstrap")
# A tibble: 276 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 266 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The vertical bar acts as an OR operator, meaning a row is returned if any of the conditions are true.
-
Select the
sex
andyear
columns from the data. -
Filter the data to show only the rows where
flipper_length_mm
is greater than 200. -
Filter the data to find all penguins that are either on Biscoe Island or Torgersen Island.
select(penguins, sex, year)
# A tibble: 344 × 2 sex year <fct> <int> 1 male 2007 2 female 2007 3 female 2007 4 <NA> 2007 5 female 2007 6 male 2007 7 female 2007 8 male 2007 9 <NA> 2007 10 <NA> 2007 # ℹ 334 more rows
filter(penguins, flipper_length_mm > 200)
# A tibble: 148 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Dream 35.7 18 202 3550 2 Adelie Dream 41.1 18.1 205 4300 3 Adelie Dream 40.8 18.9 208 4300 4 Adelie Biscoe 41 20 203 4725 5 Adelie Torgersen 41.4 18.5 202 3875 6 Adelie Torgersen 44.1 18 210 4000 7 Adelie Dream 41.5 18.5 201 4000 8 Gentoo Biscoe 46.1 13.2 211 4500 9 Gentoo Biscoe 50 16.3 230 5700 10 Gentoo Biscoe 48.7 14.1 210 4450 # ℹ 138 more rows # ℹ 2 more variables: sex <fct>, year <int>
filter(penguins, island == "Biscoe" | island == "Torgersen")
# A tibble: 220 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 210 more rows # ℹ 2 more variables: sex <fct>, year <int>
💻Filter the data to only include Adelie penguins, and then keep only the columns that start with “bill”:
select(filter(penguins, species == "Adelie"), starts_with("bill"))
# A tibble: 152 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 NA NA
5 36.7 19.3
6 39.3 20.6
7 38.9 17.8
8 39.2 19.6
9 34.1 18.1
10 42 20.2
# ℹ 142 more rows
This works, but it can get difficult to read, especially as our code grows more complex. The nested functions force us to read the code inside-out, making it less intuitive
💻Use the pipe to do the same thing in a more streamlined way:
Keyboard shortcuts for the pipe:
-
Windows: Ctrl + Shift + M
-
Mac: Cmd + Shift + M
penguins |>
filter(species == "Adelie") |>
select(starts_with("bill"))
# A tibble: 152 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 NA NA
5 36.7 19.3
6 39.3 20.6
7 38.9 17.8
8 39.2 19.6
9 34.1 18.1
10 42 20.2
# ℹ 142 more rows
The pipe allows us to pass the output of one function directly to the next. This approach makes our code easier to read because the operations flow left to right, top to bottom.
Using the pipe, filter the data to include only female penguins with
bill_length_mm
less than or equal to 40, and then select the species
and bill_length_mm
columns.
penguins |>
filter(sex == "female", bill_length_mm <= 40) |>
select(species, bill_length_mm)
# A tibble: 66 × 2
species bill_length_mm
<fct> <dbl>
1 Adelie 39.5
2 Adelie 36.7
3 Adelie 38.9
4 Adelie 36.6
5 Adelie 38.7
6 Adelie 34.4
7 Adelie 37.8
8 Adelie 35.9
9 Adelie 35.3
10 Adelie 37.9
# ℹ 56 more rows
In data analysis, it’s common to derive new variables from existing
ones. The
mutate()
function
is essential for these tasks.
💻Create a new column called body_mass_kg
(i.e., the body_mass_g
column converted from grams to kilograms):
penguins |>
mutate(body_mass_kg = body_mass_g / 1000)
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_kg <dbl>
By separating the new columns with a comma, we can create multiple new variables in one function call.
💻Use if_else()
to create a column called large_penguin
, which indicates whether the
penguin’s body mass is greater than 4,000 grams:
penguins |>
mutate(large_penguin = if_else(body_mass_g > 4000, TRUE, FALSE))
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, large_penguin <lgl>
💻Use
case_when()
to
categorize the penguins based on their body mass:
penguins |>
mutate(size_category = case_when(
body_mass_g < 3000 ~ "small",
body_mass_g < 4000 ~ "medium",
body_mass_g >= 4000 ~ "large",
.default = "unknown"
))
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, size_category <chr>
if_else()
is best for binary conditions where we need to choose
between two outcomes. In contrast, case_when()
is ideal for handling
multiple conditions, allowing us to return different values based on
various criteria.
💻Use across()
to
efficiently round values in multiple columns simultaneously:
penguins |>
mutate(across(.cols = c(bill_length_mm, bill_depth_mm), .fns = round))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39 19 181 3750
2 Adelie Torgersen 40 17 186 3800
3 Adelie Torgersen 40 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 37 19 193 3450
6 Adelie Torgersen 39 21 190 3650
7 Adelie Torgersen 39 18 181 3625
8 Adelie Torgersen 39 20 195 4675
9 Adelie Torgersen 34 18 193 3475
10 Adelie Torgersen 42 20 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
-
Create a new column called
bill_depth_cm
that converts thebill_depth_mm
columns from millimeters to centimeters. -
Create a new column called
flipper_size
that categorizes penguins as short, average, or long based on theirflipper_length_mm
. Hint: Define short as less than 190 mm, average as between 190 and 210 mm, and long as greater than 210 mm.penguins |> mutate(bill_depth_cm = bill_depth_mm / 0.1)
# A tibble: 344 × 9 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 334 more rows # ℹ 3 more variables: sex <fct>, year <int>, bill_depth_cm <dbl>
penguins |> mutate(flipper_size = case_when( flipper_length_mm < 190 ~ "short", 190 <= flipper_length_mm & flipper_length_mm <= 210 ~ "average", flipper_length_mm > 210 ~ "long", .default = "unknown" ))
# A tibble: 344 × 9 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 334 more rows # ℹ 3 more variables: sex <fct>, year <int>, flipper_size <chr>
We often need to summarize our data to understand key characteristics
(e.g., measures of central tendency, measures of variability,
frequency). The
summarise()
function lets us calculate these summary statistics efficiently.
💻Calculate the mean body mass for all the penguins in the data.
penguins |>
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 1 × 1
mean_body_mass
<dbl>
1 4202.
The function creates a new data frame with a single row containing the summary statistic.
In data analysis, a common task is to split our data into groups, apply
a function to each group, and then combine the results. This approach is
known as the split-apply-combine paradigm. The
group_by()
function helps us achieve this by allowing us to specify how we want to
split our data into groups.
💻Calculate mean body mass for each species:
penguins |>
group_by(species) |>
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 2
species mean_body_mass
<fct> <dbl>
1 Adelie 3701.
2 Chinstrap 3733.
3 Gentoo 5076.
A grouped data frame has all the properties of a regular data frame but
has an additional property that describes the grouping structure. R will
treat each group as it were a separate data frame, so operations like
summarise()
are applied to each group separately, and then brought
back together.
💻Count the number of penguins on each island:
penguins |>
group_by(island) |>
tally()
# A tibble: 3 × 2
island n
<fct> <int>
1 Biscoe 168
2 Dream 124
3 Torgersen 52
Alternatively, use the
count()
function,
which combines group_by()
and tally()
in one step.
💻Calculate the mean and standard deviation of body mass for each combination of species and sex:
penguins |>
group_by(species, sex) |>
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 8 × 4
# Groups: species [3]
species sex mean_body_mass sd_body_mass
<fct> <fct> <dbl> <dbl>
1 Adelie female 3369. 269.
2 Adelie male 4043. 347.
3 Adelie <NA> 3540 477.
4 Chinstrap female 3527. 285.
5 Chinstrap male 3939. 362.
6 Gentoo female 4680. 282.
7 Gentoo male 5485. 313.
8 Gentoo <NA> 4588. 338.
By default, when we apply a grouping with multiple factors, dplyr will
keep the last level of grouping after the summary. Here, the output is
still grouped by species
. To remove grouping from a data frame, use
the ungroup()
function or the .groups = "drop"
argument in the
summarise()
function. Both methods will allow us to continue working
with the data as a regular data frame.
# option 1
penguins |>
group_by(species, sex) |>
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE)) |>
ungroup()
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 8 × 4
species sex mean_body_mass sd_body_mass
<fct> <fct> <dbl> <dbl>
1 Adelie female 3369. 269.
2 Adelie male 4043. 347.
3 Adelie <NA> 3540 477.
4 Chinstrap female 3527. 285.
5 Chinstrap male 3939. 362.
6 Gentoo female 4680. 282.
7 Gentoo male 5485. 313.
8 Gentoo <NA> 4588. 338.
# option 2
df <- penguins |>
group_by(species, sex) |>
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE),
.groups = "drop")
Similar to what we’ve seen in other functions, we can create multiple summaries in a single step by separating them with commas.
Sometimes, we’re interested in extracting a particular value from a data frame, like finding the largest or smallest value in a column.
-
💻Use the
arrange()
function to reorder our data bymean_body_mass
in descending order -
💻Once our data is sorted, use
slice()
to retrieve the row with the largest mean body mass -
💻Use
pull()
to extract the mean body mass value from the data as a single numeric vector
df |>
arrange(desc(mean_body_mass)) |>
slice(1) |>
pull(mean_body_mass)
[1] 5484.836
Use the dplyr verbs we have learned so far to analyze penguin flipper length:
-
Select the relevant columns:
island
,species
, andflipper_length_mm
-
Group the data by
island
andspecies
, then calculate the average and maximum flipper length for each group -
Arrange the results by the average flipper length in ascending order
Hint: Remember to handle missing values using na.rm = TRUE
when
calculating the summary statistics.
penguins |>
select(island, species, flipper_length_mm) |>
group_by(island, species) |>
summarise(mean_flipper_len = mean(flipper_length_mm, na.rm = TRUE),
max_flipped_len = max(flipper_length_mm, na.rm = TRUE),
.groups = "drop") |>
arrange(mean_flipper_len)
# A tibble: 5 × 4
island species mean_flipper_len max_flipped_len
<fct> <fct> <dbl> <int>
1 Biscoe Adelie 189. 203
2 Dream Adelie 190. 208
3 Torgersen Adelie 191. 210
4 Dream Chinstrap 196. 212
5 Biscoe Gentoo 217. 231
The tidyr package is another key component of the tidyverse. It focuses on reshaping data to ensure it’s in the right format for analysis.
Sometimes we need data in a “long” format (more rows, fewer columns) for certain analyses, or in a “wide” format (more columns, fewer rows) for others. For example, a wide data set might have separate columns for different measurements (e.g., bill length, flipper length), but for some analyses or visualizations, we might need to convert this into a long format where all measurements are in a single column, with another column specifying the measurement type.
Imagine our observation of interest is the measurement itself, rather than the penguin…
💻Use the
pivot_longer()
function to reshape the penguins
data so that all the measurements
related to the penguins are in a single column, and another column
indicates what measurement type it is:
penguins_long <- penguins |>
mutate(id = row_number()) |>
pivot_longer(
cols = ends_with("_mm") | ends_with("_g"),
names_to = "measurement_type",
values_to = "value"
)
What if we want to go back to the original wide format?
💻Use the
pivot_wider()
function to reverse the process:
penguins_wide <- penguins_long |>
pivot_wider(
names_from = measurement_type,
values_from = value
)
Function | Description |
---|---|
dplyr::glimpse() |
Get a glimpse of your data |
dplyr::select() |
Keep or drop columns using their names and types |
dplyr::filter() |
Keep rows that match a condition |
dplyr::mutate() |
Create, modify, and delete columns |
dplyr::summarise() |
Summarise each group down to one row |
dplyr::group_by() |
Group by one or more variables |
dplyr::count() |
Count the observations in each group |
dplyr::arrange() |
Order rows using column values |
dplyr::slice() |
Subset rows using their positions |
dplyr::pull() |
Extract a single column |
tidyr::pivot_longer() |
Pivot data from wide to long |
tidyr::pivot_wider() |
Pivot data from long to wide |