-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-loops-R.Rmd
319 lines (259 loc) · 10.9 KB
/
03-loops-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
layout: page
title: Programming with R
subtitle: Analyzing multiple data sets
minutes: 30
---
```{r, include = FALSE}
source("tools/chunk-options.R")
opts_chunk$set(fig.path = "fig/03-loops-R-")
```
> ## Learning Objectives {.objectives}
>
> * Explain what a `for` loop does.
> * Correctly write `for` loops to repeat simple calculations.
> * Trace changes to a loop variable as the loop runs.
> * Trace changes to other variables as they are updated by a `for` loop.
> * Use a function to get a list of filenames that match a simple pattern.
> * Use a `for` loop to process multiple files.
We have created a function called `analyze` that creates graphs of the minimum, average, and maximum daily inflammation rates for a single data set:
```{r inflammation-01}
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
analyze("data/inflammation-01.csv")
```
We can use it to analyze other data sets one by one:
```{r inflammation-02}
analyze("data/inflammation-02.csv")
```
but we have a dozen data sets right now and more on the way.
We want to create plots for all our data sets with a single statement.
To do that, we'll have to teach the computer how to repeat things.
### For Loops
Suppose we want to print each word in a sentence.
One way is to use six `print` statements:
```{r}
best_practice <- c("Let", "the", "computer", "do", "the", "work")
print_words <- function(sentence) {
print(sentence[1])
print(sentence[2])
print(sentence[3])
print(sentence[4])
print(sentence[5])
print(sentence[6])
}
print_words(best_practice)
```
but that's a bad approach for two reasons:
1. It doesn't scale: if we want to print the elements in a vector that's hundreds long, we'd be better off just typing them in.
2. It's fragile: if we give it a longer vector, it only prints part of the data, and if we give it a shorter input, it returns `NA` values because we're asking for elements that don't exist!
```{r}
best_practice[-6]
print_words(best_practice[-6])
```
> ## Tip {.callout}
>
> R has has a special variable, `NA`, for designating missing values that are
> **N**ot **A**vailable in a data set. See `?NA` and [An Introduction to R][na]
> for more details.
[na]: http://cran.r-project.org/doc/manuals/r-release/R-intro.html#Missing-values
Here's a better approach:
```{r}
print_words <- function(sentence) {
for (word in sentence) {
print(word)
}
}
print_words(best_practice)
```
This is shorter---certainly shorter than something that prints every character in a hundred-letter string---and more robust as well:
```{r}
print_words(best_practice[-6])
```
The improved version of `print_words` uses a [for loop](reference.html#for-loop) to repeat an operation---in this case, printing---once for each thing in a collection.
The general form of a loop is:
```{r, eval=FALSE}
for (variable in collection) {
do things with variable
}
```
We can name the [loop variable](reference.html#loop-variable) anything we like (with a few [restrictions][], e.g. the name of the variable cannot start with a digit).
`in` is part of the `for` syntax.
Note that the body of the loop is enclosed in curly braces `{ }`.
For a single-line loop body, as here, the braces aren't needed, but it is good practice to include them as we did.
[restrictions]: http://cran.r-project.org/doc/manuals/R-intro.html#R-commands_003b-case-sensitivity-etc
Here's another loop that repeatedly updates a variable:
```{r}
len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
len <- len + 1
}
# Number of vowels
len
```
It's worth tracing the execution of this little program step by step.
Since there are five elements in the vector `vowels`, the statement inside the loop will be executed five times.
The first time around, `len` is zero (the value assigned to it on line 1) and `v` is `"a"`.
The statement adds 1 to the old value of `len`, producing 1, and updates `len` to refer to that new value.
The next time around, `v` is `"e"` and `len` is 1, so `len` is updated to be 2.
After three more updates, `len` is 5; since there is nothing left in the vector `vowels` for R to process, the loop finishes.
Note that a loop variable is just a variable that's being used to record progress in a loop.
It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:
```{r}
letter <- "z"
for (letter in c("a", "b", "c")) {
print(letter)
}
# after the loop, letter is
letter
```
Note also that finding the length of a vector is such a common operation that R actually has a built-in function to do it called `length`:
```{r}
length(vowels)
```
`length` is much faster than any R function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven't met yet, so we should always use it when we can (see this [lesson](01-supp-data-structures.html) to learn more about the different ways to store data in R).
> ## Challenge - Using loops {.challenge}
>
> 1. R has a built-in function called `seq` that creates a list of numbers:
>
> ```{r}
> seq(3)
> ```
>
> Using `seq`, write a function that prints the first **N** natural numbers, one per line:
>
> ```{r, echo=-1}
> print_N <- function(N) {
> nseq <- seq(N)
> for (num in nseq) {
> print(num)
> }
> }
> print_N(3)
> ```
>
> 2. Write a function called `total` that calculates the sum of the values in a vector.
> (R has a built-in function called `sum` that does this for you.
> Please don't use it for this exercise.)
>
> ```{r, echo=-1}
> total <- function(vec) {
> #calculates the sum of the values in a vector
> vec_sum <- 0
> for (num in vec) {
> vec_sum <- vec_sum + num
> }
> return(vec_sum)
> }
> ex_vec <- c(4, 8, 15, 16, 23, 42)
> total(ex_vec)
> ```
>
> 3. Exponentiation is built into R:
>
> ```{r}
> 2^4
> ```
>
> Write a function called `expo` that uses a loop to calculate the same result.
> ```{r, echo=-1}
> expo <- function(base, power) {
> result <- 1
> for (i in seq(power)) {
> result <- result * base
> }
> return(result)
> }
> expo(2, 4)
> ```
>
### Processing Multiple Files
We now have almost everything we need to process all our data files.
The only thing that's missing is a function that finds files whose names match a pattern.
We do not need to write it ourselves because R already has a function to do this called `list.files`.
If we run the function without any arguments, `list.files()`, it returns every file in the current working directory.
We can understand this result by reading the help file (`?list.files`).
The first argument, `path`, is the path to the directory to be searched, and it has the default value of `"."` (recall from the [lesson](http://swcarpentry.github.io/shell-novice/01-filedir.html) on the Unix Shell that `"."` is shorthand for the current working directory).
The second argument, `pattern`, is the pattern being searched, and it has the default value of `NULL`.
Since no pattern is specified to filter the files, all files are returned.
So to list all the csv files, we could run either of the following:
```{r}
list.files(path = "data", pattern = "csv")
list.files(path = "data", pattern = "inflammation")
```
> ## Tip {.callout}
>
> For larger projects, it is recommended to organize separate parts of the
> analysis into multiple subdirectories, e.g. one subdirectory for the raw data,
> one for the code, and one for the results like figures. We have done that here
> to some extent, putting all of our data files into the subdirectory "data".
> For more advice on this topic, you can read [A quick guide to organizing
> computational biology projects][Noble2009] by William Stafford Noble.
[Noble2009]: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424
As these examples show, `list.files` result is a vector of strings, which means we can loop over it to do something with each filename in turn.
In our case, the "something" we want is our `analyze` function.
Because we have put our data in separate subdirectory, if we want to access these files
using the output of `list.files` we also need to include the "path" portion of the file name.
We can do that by using the argument `full.names = TRUE`.
```{r}
list.files(path = "data", pattern = "csv", full.names = TRUE)
list.files(path = "data", pattern = "inflammation", full.names = TRUE)
```
Let's test out running our `analyze` function by using it on the first three files in the vector returned by `list.files`:
```{r loop-analyze, fig=FALSE}
filenames <- list.files(path = "data", pattern = "inflammation", full.names = TRUE)
filenames <- filenames[1:3]
for (f in filenames) {
print(f)
analyze(f)
}
```
Sure enough, the maxima of these data sets show exactly the same ramp as the first, and their minima show the same staircase structure.
> ## Tip {.callout}
>
> In this lesson we saw how to use a simple `for` loop to repeat an operation.
> As you progress with R, you will learn that there are multiple ways to
> accomplish this. Sometimes the choice of one method over another is more a
> matter of personal style, but other times it can have consequences for the
> speed of your code. For instruction on best practices, see this supplementary
> [lesson](03-supp-loops-in-depth.html) that demonstrates how to properly repeat
> operations in R.
> ## Challenge - Using loops to analyze multiple files {.challenge}
>
> 1. Write a function called `analyze_all` that takes a filename pattern as its sole argument and runs `analyze` for each file whose name matches the pattern.
```{r analyze_all, include=FALSE}
analyze_all <- function(pattern) {
# Runs the function analyze for each file in the current working directory
# that contains the given pattern.
filenames <- list.files(path = "data", pattern = pattern, full.names = TRUE)
for (f in filenames) {
analyze(f)
}
}
# analyze_all("csv")
```
> ## Key Points {.callout}
>
> * Use `for (variable in collection)` to process the elements of a collection one at a time.
> * The body of a `for` loop is surrounded by curly braces (`{ }`).
> * Use `length(thing)` to determine the length of something that contains other values.
> * Use `list.files(path = "path", pattern = "pattern", full.names = TRUE)` to create a list of files whose names match a pattern.
> ## Next Steps {.callout}
>
> We have now solved our original problem: we can analyze any number of data files with a single command.
> More importantly, we have met two of the most important ideas in programming:
>
> * Use functions to make code easier to re-use and easier to understand.
> * Use vectors and data frames to store related values, and loops to repeat operations on them.
>
> We have one more big idea to introduce...