-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathintro-to-dplyr_blank.qmd
326 lines (203 loc) · 12.7 KB
/
intro-to-dplyr_blank.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
---
title: "Introduction to dplyr"
author: "Reiko Okamoto"
date: "`r Sys.Date()`"
format: gfm
editor: visual
execute:
echo: true
---
## 👋Welcome to the tidyverse
#### *What is the tidyverse?*
The [tidyverse](https://tidyverse.tidyverse.org/) is a collection of R packages designed for data science. Arguably, two of the most popular packages in the tidyverse are [dplyr](https://dplyr.tidyverse.org/) for data manipulation and [ggplot2](https://ggplot2.tidyverse.org/) for data visualization. These are also two of the packages we are covering today!
#### *Why learn it?*
The skills you gain are not just limited to R. The concepts, like filtering data and creating plots, are applicable to other languages like SQL and Python. This makes it easier to pick up other tools and languages later. Additionally, the tidyverse is in tune with open science practices, helping you create analyses that are more accessible, transparent, and reproducible.
#### *Keep in mind...*
There's no expectation for you to memorize everything. Even experienced programmers don't have every function memorized - they're constantly googling things! My goal today is to help you get comfortable with, and hopefully interested in, using the tidyverse for data analysis.
## 🐧Meet the Palmer penguins
{fig-align="center"}
We'll use the `penguins` data set from the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/) package. It contains measurements for adult penguins observed in the islands in the Palmer Archipelago off the Antarctic Peninsula. The data were collected and made available by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program. This data set is great for learning because it's small enough to be manageable for beginners but contains a range of different types of data and features.
💻Load the necessary packages:
```{r}
# YOUR CODE HERE
```
We only need to install a package once, but we'll need to reload it every time we start a new session.
💻Open the data set documentation:
```{r}
# YOUR CODE HERE
```
{fig-align="center" width="500"}
## 1️⃣Get a glimpse of our data: glimpse()
💻Inspect the data using the [`glimpse()`](https://dplyr.tidyverse.org/reference/glimpse.html) function.
```{r}
# YOUR CODE HERE
```
- *How many rows?*
- *How many columns?*
- *How many data types?*
- *Are there any missing values?*
## 2️⃣Keep or drop columns: select()
Sometimes, we only need to work with a few specific columns instead of the entire data. The [`select()`](https://dplyr.tidyverse.org/reference/select.html) function makes it easy to do just that. It allows us to focus on the columns we need and ignore the rest, making our code cleaner.
💻Select one column by name:
```{r}
# YOUR CODE HERE
```
💻Select two columns by name:
```{r}
# YOUR CODE HERE
```
💻Select all columns except certain ones:
```{r}
# YOUR CODE HERE
```
💻Select columns that start with "bill":
```{r}
# YOUR CODE HERE
```
Similar functions like `ends_with()` and `contains()` are also available to select columns based on the ending or presence of specific characters.
💻Select numeric variables:
```{r}
# YOUR CODE HERE
```
## 3️⃣Keep rows that match a condition: filter()
Filtering data is a common task in data analysis. Use the [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) function to focus on specific observations.
💻Show only the penguins of the Gentoo species:
```{r}
# YOUR CODE HERE
```
💻Find all Adelie penguins that are also on Torgersen Island:
```{r}
# YOUR CODE HERE
```
The comma acts as an AND operator, meaning both conditions must be true for a row to be included.
💻Find all penguins that either Adelie or Gentoo:
```{r}
# YOUR CODE HERE
```
The vertical bar acts as an OR operator, meaning a row is returned if any of the conditions are true.
#### 📝Exercise 1
1. Select the `sex` and `year` columns from the data.
2. Filter the data to show only the rows where `flipper_length_mm` is greater than 200.
3. Filter the data to find all penguins that are either on Biscoe Island or Torgersen Island.
```{r}
# YOUR CODE HERE
```
## 4️⃣Pipes
💻Filter the data to only include Adelie penguins, and then keep only the columns that start with "bill":
```{r}
# YOUR CODE HERE
```
This works, but it can get difficult to read, especially as our code grows more complex. The nested functions force us to read the code inside-out, making it less intuitive
💻Use the pipe to do the same thing in a more streamlined way:
Keyboard shortcuts for the pipe:
- Windows: Ctrl + Shift + M
- Mac: Cmd + Shift + M
```{r}
# YOUR CODE HERE
```
The pipe allows us to pass the output of one function directly to the next. This approach makes our code easier to read because the operations flow left to right, top to bottom.
#### 📝Exercise 2
Using the pipe, filter the data to include only female penguins with `bill_length_mm` less than or equal to 40, and then select the `species` and `bill_length_mm` columns.
```{r}
# YOUR CODE HERE
```
## 5️⃣Create and modify columns: mutate()
In data analysis, it's common to derive new variables from existing ones. The [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) function is essential for these tasks.
💻Create a new column called `body_mass_kg` (i.e., the `body_mass_g` column converted from grams to kilograms):
```{r}
# YOUR CODE HERE
```
By separating the new columns with a comma, we can create multiple new variables in one function call.
💻Use [`if_else()`](https://dplyr.tidyverse.org/reference/if_else.html) to create a column called `large_penguin`, which indicates whether the penguin's body mass is greater than 4,000 grams:
```{r}
# YOUR CODE HERE
```
💻Use [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to categorize the penguins based on their body mass:
```{r}
# YOUR CODE HERE
```
`if_else()` is best for binary conditions where we need to choose between two outcomes. In contrast, `case_when()` is ideal for handling multiple conditions, allowing us to return different values based on various criteria.
💻Use [`across()`](https://dplyr.tidyverse.org/reference/across.html) to efficiently round values in multiple columns simultaneously:
```{r}
# YOUR CODE HERE
```
#### 📝Exercise 3
1. Create a new column called `bill_depth_cm` that converts the `bill_depth_mm` columns from millimeters to centimeters.
2. Create a new column called `flipper_size` that categorizes penguins as short, average, or long based on their `flipper_length_mm`. Hint: Define short as less than 190 mm, average as between 190 and 210 mm, and long as greater than 210 mm.
```{r}
# YOUR CODE HERE
```
## 6️⃣Compute summary statistics: summarise()
We often need to summarize our data to understand key characteristics (e.g., measures of central tendency, measures of variability, frequency). The [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html) function lets us calculate these summary statistics efficiently.
💻Calculate the mean body mass for all the penguins in the data.
```{r}
# YOUR CODE HERE
```
The function creates a new data frame with a single row containing the summary statistic.
## 7️⃣Group by one or more variables: group_by()
In data analysis, a common task is to split our data into groups, apply a function to each group, and then combine the results. This approach is known as the split-apply-combine paradigm. The [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) function helps us achieve this by allowing us to specify how we want to split our data into groups.
💻Calculate mean body mass for each species:
```{r}
# YOUR CODE HERE
```
A grouped data frame has all the properties of a regular data frame but has an additional property that describes the grouping structure. R will treat each group as it were a separate data frame, so operations like `summarise()` are applied to each group separately, and then brought back together.
💻Count the number of penguins on each island:
```{r}
# YOUR CODE HERE
```
Alternatively, use the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step.
💻Calculate the mean and standard deviation of body mass for each combination of species and sex:
```{r}
# YOUR CODE HERE
```
By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame.
```{r}
# YOUR CODE HERE
```
Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas.
## 8️⃣Sort rows and extract specific values: arrange(), slice(), pull()
Sometimes, we're interested in extracting a particular value from a data frame, like finding the largest or smallest value in a column.
1. 💻Use the [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html) function to reorder our data by `mean_body_mass` in descending order
2. 💻Once our data is sorted, use [`slice()`](https://dplyr.tidyverse.org/reference/slice.html) to retrieve the row with the largest mean body mass
3. 💻Use [`pull()`](https://dplyr.tidyverse.org/reference/pull.html) to extract the mean body mass value from the data as a single numeric vector
```{r}
# YOUR CODE HERE
```
#### 📝Exercise 4
Use the dplyr verbs we have learned so far to analyze penguin flipper length:
- Select the relevant columns: `island`, `species`, and `flipper_length_mm`
- Group the data by `island` and `species`, then calculate the **average** and **maximum** flipper length for each group
- Arrange the results by the **average** flipper length in **ascending** order
Hint: Remember to handle missing values using `na.rm = TRUE` when calculating the summary statistics.
```{r}
# YOUR CODE HERE
```
## 🌟Bonus: lengthen and widen data: pivot_longer(), pivot_wider()
The [tidyr](https://tidyr.tidyverse.org/) package is another key component of the tidyverse. It focuses on reshaping data to ensure it's in the right format for analysis.
Sometimes we need data in a "long" format (more rows, fewer columns) for certain analyses, or in a "wide" format (more columns, fewer rows) for others. For example, a wide data set might have separate columns for different measurements (e.g., bill length, flipper length), but for some analyses or visualizations, we might need to convert this into a long format where all measurements are in a single column, with another column specifying the measurement type.
Imagine our observation of interest is the measurement itself, rather than the penguin...
💻Use the [`pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html) function to reshape the `penguins` data so that all the measurements related to the penguins are in a single column, and another column indicates what measurement type it is:
```{r}
# YOUR CODE HERE
```
What if we want to go back to the original wide format?
💻Use the [`pivot_wider()`](https://tidyr.tidyverse.org/reference/pivot_wider.html) function to reverse the process:
```{r}
# YOUR CODE HERE
```
## 📚Resources
| Function | Description |
|------------------------|-----------------------------------------------|
| `dplyr::glimpse()` | Get a glimpse of your data |
| `dplyr::select()` | Keep or drop columns using their names and types |
| `dplyr::filter()` | Keep rows that match a condition |
| `dplyr::mutate()` | Create, modify, and delete columns |
| `dplyr::summarise()` | Summarise each group down to one row |
| `dplyr::group_by()` | Group by one or more variables |
| `dplyr::count()` | Count the observations in each group |
| `dplyr::arrange()` | Order rows using column values |
| `dplyr::slice()` | Subset rows using their positions |
| `dplyr::pull()` | Extract a single column |
| `tidyr::pivot_longer()` | Pivot data from wide to long |
| `tidyr::pivot_wider()` | Pivot data from long to wide |
- [Data transformation with dplyr cheatsheet](https://rstudio.github.io/cheatsheets/html/data-transformation.html?_gl=1*1foigmt*_ga*Mjg1MTU3NDY3LjE3MDU2OTExOTY.*_ga_2C0WZ1JHG0*MTcyNDA3ODA5Ni4yMy4xLjE3MjQwNzgxNDkuMC4wLjA.)