-
Notifications
You must be signed in to change notification settings - Fork 103
/
Copy pathlab_5_key.Rmd
299 lines (203 loc) · 9.51 KB
/
lab_5_key.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: 'ESM 206 Lab 5'
author: "Allison Horst"
date: "10/25/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Counts, uncounts, t-tests, lubridate
#### Lab 5 objectives:
- Get counts of grouped observations using `count()`
- Perform one and two sample t-tests with `t.test()`
- Parse dates with `lubridate`
- Build a `geom_tile()` heatmap
Set-up: Students should Fork & Clone the **allisonhorst/esm-206-f2019-lab-5** repo from GitHub prior to the lab to work with materials locally.
### Lab 5 data: SBC LTER Lobster abundance and fishing pressure
**Link:** https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=77&revision=newest
**Citation:** Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. https://doi.org/10.6073/pasta/a593a675d644fdefb736750b291579a0. Dataset accessed 10/25/2019.
#### 1. Attach packages
Required: `tidyverse`, `here`, `janitor`
```{r, message = FALSE, warnings = FALSE}
library(tidyverse)
library(here)
library(janitor)
```
#### 2. Read in data
- **lobster_abundance.csv**
- **lobster_traps.csv**
```{r, message = FALSE}
lobster_abundance <- read_csv(here("data", "lobster_abundance.csv"),
na = "-99999") %>%
clean_names()
```
First, let's ask: How is *lobster_abundance* not in tidy format?
- Each observation (lobster) is not in it's own row. (e.g. sort by `lobster_count` from high to low: there are sometimes many lobsters represented in a single row!).
- This is called a **frequency table**, but we want this to be in **case format** - so we need to expand the data frame so that each lobster is in its own row in order to be able to do analyses.
We can use `tidyr::uncount()` to expand a df by a frequency variable (here, by the value in `lobster_count`).
#### 3. Use `tidyr::uncount()` to convert to tidy format (one observation per row)
```{r}
lobster_tidy <- lobster_abundance %>%
tidyr::uncount(lobster_count)
```
Now each lobster observed has been placed into its own *row* - which is good, because now we don't have to worry about weights when we find summary statistics.
#### 4. Always look at your data. Always.
Various by site (the variable we'll consider later on):
```{r}
# Jitterplot
ggplot(lobster_tidy, aes(x = site, y = size_mm)) +
geom_jitter(aes(color = site))
# Violin plot:
ggplot(lobster_tidy, aes(x = site, y = size_mm)) +
geom_violin()
# Histogram:
ggplot(lobster_tidy, aes(x = size_mm)) +
geom_histogram() +
facet_wrap(~site)
# Quantile-quantile plot:
ggplot(lobster_tidy, aes(sample = size_mm)) +
geom_qq() +
facet_wrap(~site)
```
#### 5. Parse dates for lobster observations
Notice that the existing class of the `date` variable is a character. Boo.
```{r}
class(lobster_abundance$date)
```
The `lubridate` package (part of the `tidyverse`), and is built to make it easier to deal with date-time data. Here, we'll use `lubridate::mdy()` to help R understand it's a date, and work with the pieces (month, day and year) more easily.
**Note**: we're using `mdy()` because that's the existing format of the date column (mm/dd/yy). Check `?ymd` to see all the different options based on the format of your date column. Cool.
Add a new column with `mutate()` that is a *date* using `mdy()`:
```{r}
lobster_date <- lobster_tidy %>%
mutate(
date_new = lubridate::mdy(date)
)
```
Check it out! The *date_new* column is in nice ISO date format. Let's check the class:
```{r}
class(lobster_date$date_new)
```
Wooooo.
Now that it's in date format, we can parse it. See `?month` and `?year` - to get or set components of a date or date-time. So here, we'll use `lubridate::month()` and `lubridate::year()` to create separate columns for the month and year.
For `month()`, we can even automatically convert to month abbreviation with argument `label = TRUE`.
```{r}
lobster_parse_date <- lobster_date %>%
mutate(obs_month = lubridate::month(date_new, label = TRUE),
obs_year = lubridate::year(date_new))
```
A really cool thing about that? When we use `lubridate::month()`, it's auto stored as an ordered factor booya (so we don't need to `fct_relevel()`).
Now let's find some counts of lobster observations using different groupings.
#### 6. Find counts by groups with `dplyr::count()`
There are a bunch of ways to count things based on different variables in R. So many that sometimes it's challenging to know which to use. We'll touch on a few of the most generally useful ones here.
- `dplyr::count()`: groups, tallies, then ungroups (whoa)
- `dplyr::tally()`: use `group_by()` first, then use `tally()` for counts, then manually need to `ungroup()`
- `dplyr::n()`: number of observations in current group
##### `count()` example 1: Count the number of lobsters by year and month:
```{r}
lobster_ym <- lobster_parse_date %>%
count(obs_year, obs_month)
lobster_ym
```
##### `count()` example 2: Count the number of lobsters by year and site:
```{r}
lobster_ysite <- lobster_parse_date %>%
count(obs_year, site)
lobster_ysite
```
##### `count()` example 3: Count the number of lobsters by site:
```{r}
lobster_site <- lobster_parse_date %>%
count(site)
lobster_site
```
##### For comparison: group first, then use `tally()` to do the same thing. Other alternatives: `add_tally()`, `add_count()` for mutate versions.
```{r}
lobster_tally <- lobster_parse_date %>%
group_by(site) %>%
tally()
lobster_tally
```
What if we want to make a summary table that has a mean, sd, and the sample size? Use `dplyr::n()` to find the number of observations in the current group.
```{r}
lobster_summary <- lobster_parse_date %>%
group_by(site) %>%
summarize(
mean_size = mean(size_mm, na.rm = TRUE),
sd_size = sd(size_mm, na.rm = TRUE),
sample_n = n()
)
lobster_summary
```
So, to reiterate - three ways to do the same thing to get sample sizes:
```{r, eval = FALSE}
lobster_n <- lobster_parse_date %>%
group_by(site) %>%
summarize(
n = n()
) %>%
ungroup()
lobster_tally <- lobster_parse_date %>%
group_by(site) %>%
tally() %>%
ungroup()
lobster_count <- lobster_parse_date %>%
count(site)
```
Hooray!
More information on tally/count: see descriptions and examples from [tidyverse.org](https://dplyr.tidyverse.org/reference/tally.html).
#### 7. Confidence intervals, t-tests with `t.test()`
So we found the mean and sd of lobster sizes (see object *lobster_summary*). Let's say we want to find the confidence interval and/or standard error for lobsters at the different sites?
We can use the `t.test()` function to find confidence intervals. First, I'll isolate the IVEE lobster lengths and find a 95% confidence interval.
```{r}
ivee_lobsters <- lobster_tidy %>%
filter(site == "IVEE") %>%
pull(size_mm)
t.test(ivee_lobsters)
# What does this 95% CI of [72.99, 74.17] mm mean?
```
Now, what if we ask the question: Is there a significant difference in lobster sizes between Naples Reef and Carpinteria Reef?
We've already looked at the data (jitterplots, histograms, etc.) and it seems like a t-test would be an appropriate test. Ask here: why? Sample size, distribution, data type, etc....
OK, so we've decided that a t-test is appropriate and useful. How do we run a t-test in R?
##### Method 1: Isolate vectors of observations, then use `t.test()`
First, we could get the vectors of observations from the two groups we want to compare, then add them as arguments to `t.test()`. Like this:
```{r}
# Get a vector only containing size observations (mm) for lobsters at Naples Reef (site == "NAPL"):
napl_sample <- lobster_tidy %>%
filter(site == "NAPL") %>%
pull(size_mm)
# Get a vector only containing size observations (mm) for lobsters at Mohawk Reef (site == "MOHK"):
mohk_sample <- lobster_tidy %>%
filter(site == "MOHK") %>%
pull(size_mm)
```
Check out those two objects. Note that they are *vectors* only containing size observations for lobsters at the two sites.
We can add those as our first two arguments in `t.test()` to run a two-sample t-test:
```{r}
mohk_napl_ttest <- t.test(mohk_sample, napl_sample)
mohk_napl_ttest
```
Interpret the outcome. Now let's say we want to write a statement about the result (for now, in very "stats class" language, which we'll ditch quickly):
Mean lobster size differed significantly between Mohawk and Naples Reefs (t(`r round(mohk_napl_ttest$parameter, 2)`) = `r round(mohk_napl_ttest$statistic, 2)`, *p* < 0.001).
**Remember**: the more important things are the data itself, means comparison, effect size, etc. - which we'll also get to.
##### Method 2: Filter to only include 2 groups, then use `t.test(A ~ B)`
```{r}
# First, filter to only include 2 groups (MOHK & NAPL):
lobster_2sample <- lobster_tidy %>%
filter(site %in% c("NAPL", "MOHK"))
# Then:
ttest_2 <- t.test(size_mm ~ site, data = lobster_2sample)
```
And similarly use in-line referencing to add text directly from coded objects. Woo reproducibility!
#### 8. A new type of graph: `geom_tile()` for heatmaps
Take a look at `lobster_ysite` to remind yourself of the data frame. Let's make a geom_tile() plot, with cell colors dependent on the value in column 'n'.
```{r}
ggplot(data = lobster_ysite, aes(x = obs_year, y = site)) +
geom_tile(aes(fill = n)) +
scale_fill_gradientn(colors = c("black","purple","magenta")) +
scale_x_continuous(breaks = seq(2012, 2018), expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x = "Year",
y = "Site")
```
#### END LAB.