forked from STAT545-UBC/STAT545-UBC-original-website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathblock010_dplyr-end-single-table.html
530 lines (506 loc) · 29 KB
/
block010_dplyr-end-single-table.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<title>dplyr functions for a single dataset</title>
<script src="libs/jquery-1.11.0/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link href="libs/bootstrap-2.3.2/css/united.min.css" rel="stylesheet" />
<link href="libs/bootstrap-2.3.2/css/bootstrap-responsive.min.css" rel="stylesheet" />
<script src="libs/bootstrap-2.3.2/js/bootstrap.min.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
href="libs/highlight/default.css"
type="text/css" />
<script src="libs/highlight/highlight.js"></script>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}
</script>
<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
</style>
<div class="container-fluid main-container">
<header>
<div class="nav">
<a class="nav-logo" href="index.html">
<img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
</a>
<ul>
<li class="home"><a href="index.html">Home</a></li>
<li class="faq"><a href="faq.html">FAQ</a></li>
<li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
<li class="topics"><a href="topics.html">Topics</a></li>
<li class="people"><a href="people.html">People</a></li>
</ul>
</div>
</header>
<div id="header">
<h1 class="title">dplyr functions for a single dataset</h1>
</div>
<div id="TOC">
<ul>
<li><a href="#where-were-we">Where were we?</a><ul>
<li><a href="#load-dplyr-and-the-gapminder-data">Load <code>dplyr</code> and the Gapminder data</a></li>
</ul></li>
<li><a href="#use-mutate-to-add-new-variables">Use <code>mutate()</code> to add new variables</a></li>
<li><a href="#use-arrange-to-row-order-data-in-a-principled-way">Use <code>arrange()</code> to row-order data in a principled way</a></li>
<li><a href="#use-rename-to-rename-variables">Use <code>rename()</code> to rename variables</a></li>
<li><a href="#group_by-is-a-mighty-weapon"><code>group_by()</code> is a mighty weapon</a><ul>
<li><a href="#counting-things-up">Counting things up</a></li>
<li><a href="#general-summarization">General summarization</a></li>
<li><a href="#window-functions">Window functions</a></li>
<li><a href="#grand-finale">Grand Finale</a></li>
</ul></li>
<li><a href="#resources">Resources</a></li>
</ul>
</div>
<div id="where-were-we" class="section level3">
<h3>Where were we?</h3>
<p>In the <a href="block009_dplyr-intro.html">introduction to <code>plyr</code></a>, we used two very important verbs and an operator:</p>
<ul>
<li><code>filter()</code> for subsetting data row-wise</li>
<li><code>select()</code> for subsetting data variable- or column-wise</li>
<li>the pipe operator <code>%>%</code>, which feeds the LHS as the first argument to the expression on the RHS</li>
</ul>
<p>Here we explore other <code>dplyr</code> functions, especially more verbs, for working with a single dataset.</p>
<div id="load-dplyr-and-the-gapminder-data" class="section level4">
<h4>Load <code>dplyr</code> and the Gapminder data</h4>
<p>We use an excerpt of the Gapminder data and store it as a <code>tbl_df</code> object, basically an enhanced data.frame. I’ll use the pipe operator even here, to demonstrate its utility outside of <code>dplyr</code>.</p>
<pre class="r"><code>suppressPackageStartupMessages(library(dplyr))
gd_url <- "http://tiny.cc/gapminder"
gtbl <- gd_url %>% read.delim %>% tbl_df
gtbl %>% glimpse</code></pre>
<pre><code>## Variables:
## $ country (fctr) Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ year (int) 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ pop (dbl) 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ continent (fctr) Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ lifeExp (dbl) 28.80, 30.33, 32.00, 34.02, 36.09, 38.44, 39.85, 40....
## $ gdpPercap (dbl) 779.4, 820.9, 853.1, 836.2, 740.0, 786.1, 978.0, 852...</code></pre>
</div>
</div>
<div id="use-mutate-to-add-new-variables" class="section level3">
<h3>Use <code>mutate()</code> to add new variables</h3>
<p>Imagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Let’s multiply them together.</p>
<pre class="r"><code>gtbl <- gtbl %>%
mutate(gdp = pop * gdpPercap)
gtbl %>% glimpse</code></pre>
<pre><code>## Variables:
## $ country (fctr) Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ year (int) 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ pop (dbl) 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ continent (fctr) Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ lifeExp (dbl) 28.80, 30.33, 32.00, 34.02, 36.09, 38.44, 39.85, 40....
## $ gdpPercap (dbl) 779.4, 820.9, 853.1, 836.2, 740.0, 786.1, 978.0, 852...
## $ gdp (dbl) 6.567e+09, 7.585e+09, 8.759e+09, 9.648e+09, 9.679e+0...</code></pre>
<p>Hmmmm … those GDP numbers are almost uselessly large and abstract. Consider the <a href="http://fivethirtyeight.com/datalab/xkcd-randall-munroe-qanda-what-if/">advice of Randall Munroe of xkcd</a>: “One thing that bothers me is large numbers presented without context… ‘If I added a zero to this number, would the sentence containing it mean something different to me?’ If the answer is ‘no,’ maybe the number has no business being in the sentence in the first place.” Maybe it would be more meaningful to consumers of my tables and figures if I reported GDP per capita, <em>relative to some benchmark country</em>. Since Canada is my adopted home, I’ll go with that.</p>
<pre class="r"><code>just_canada <- gtbl %>% filter(country == "Canada")
gtbl <- gtbl %>%
mutate(canada = just_canada$gdpPercap[match(year, just_canada$year)],
gdpPercapRel = gdpPercap / canada)
gtbl %>%
select(country, year, gdpPercap, canada, gdpPercapRel)</code></pre>
<pre><code>## Source: local data frame [1,704 x 5]
##
## country year gdpPercap canada gdpPercapRel
## 1 Afghanistan 1952 779.4 11367 0.06857
## 2 Afghanistan 1957 820.9 12490 0.06572
## 3 Afghanistan 1962 853.1 13462 0.06337
## 4 Afghanistan 1967 836.2 16077 0.05201
## 5 Afghanistan 1972 740.0 18971 0.03901
## 6 Afghanistan 1977 786.1 22091 0.03559
## 7 Afghanistan 1982 978.0 22899 0.04271
## 8 Afghanistan 1987 852.4 26627 0.03201
## 9 Afghanistan 1992 649.3 26343 0.02465
## 10 Afghanistan 1997 635.3 28955 0.02194
## .. ... ... ... ... ...</code></pre>
<pre class="r"><code>gtbl %>%
select(gdpPercapRel) %>%
summary</code></pre>
<pre><code>## gdpPercapRel
## Min. :0.007
## 1st Qu.:0.062
## Median :0.172
## Mean :0.327
## 3rd Qu.:0.447
## Max. :9.535</code></pre>
<p>Note that, <code>mutate()</code> builds new variables sequentially so you can reference earlier ones (like <code>canada</code>) when defining later ones (like <code>gdpPercapRel</code>). (I got a little off topic here using <code>match()</code> to do table look up, but <a href="http://www.rdocumentation.org/packages/base/functions/match">you can figure that out</a>.)</p>
<p>The relative GDP per capita numbers are, in general, well below 1. We see that most of the countries covered by this dataset have substantially lower GDP per capita, relative to Canada, across the entire time period.</p>
</div>
<div id="use-arrange-to-row-order-data-in-a-principled-way" class="section level3">
<h3>Use <code>arrange()</code> to row-order data in a principled way</h3>
<p>Imagine you wanted this data ordered by year then country, as opposed to by country then year.</p>
<pre class="r"><code>gtbl %>%
arrange(year, country)</code></pre>
<pre><code>## Source: local data frame [1,704 x 9]
##
## country year pop continent lifeExp gdpPercap gdp canada
## 1 Afghanistan 1952 8425333 Asia 28.80 779.4 6.567e+09 11367
## 2 Albania 1952 1282697 Europe 55.23 1601.1 2.054e+09 11367
## 3 Algeria 1952 9279525 Africa 43.08 2449.0 2.273e+10 11367
## 4 Angola 1952 4232095 Africa 30.02 3520.6 1.490e+10 11367
## 5 Argentina 1952 17876956 Americas 62.48 5911.3 1.057e+11 11367
## 6 Australia 1952 8691212 Oceania 69.12 10039.6 8.726e+10 11367
## 7 Austria 1952 6927772 Europe 66.80 6137.1 4.252e+10 11367
## 8 Bahrain 1952 120447 Asia 50.94 9867.1 1.188e+09 11367
## 9 Bangladesh 1952 46886859 Asia 37.48 684.2 3.208e+10 11367
## 10 Belgium 1952 8730405 Europe 68.00 8343.1 7.284e+10 11367
## .. ... ... ... ... ... ... ... ...
## Variables not shown: gdpPercapRel (dbl)</code></pre>
<p>Or maybe you want just the data from 2007, sorted on life expectancy?</p>
<pre class="r"><code>gtbl %>%
filter(year == 2007) %>%
arrange(lifeExp)</code></pre>
<pre><code>## Source: local data frame [142 x 9]
##
## country year pop continent lifeExp gdpPercap
## 1 Swaziland 2007 1133066 Africa 39.61 4513.5
## 2 Mozambique 2007 19951656 Africa 42.08 823.7
## 3 Zambia 2007 11746035 Africa 42.38 1271.2
## 4 Sierra Leone 2007 6144562 Africa 42.57 862.5
## 5 Lesotho 2007 2012649 Africa 42.59 1569.3
## 6 Angola 2007 12420476 Africa 42.73 4797.2
## 7 Zimbabwe 2007 12311143 Africa 43.49 469.7
## 8 Afghanistan 2007 31889923 Asia 43.83 974.6
## 9 Central African Republic 2007 4369038 Africa 44.74 706.0
## 10 Liberia 2007 3193942 Africa 45.68 414.5
## .. ... ... ... ... ... ...
## Variables not shown: gdp (dbl), canada (dbl), gdpPercapRel (dbl)</code></pre>
<p>Oh, you’d like to sort on life expectancy in <strong>desc</strong>ending order? Then use <code>desc()</code>.</p>
<pre class="r"><code>gtbl %>%
filter(year == 2007) %>%
arrange(desc(lifeExp))</code></pre>
<pre><code>## Source: local data frame [142 x 9]
##
## country year pop continent lifeExp gdpPercap gdp
## 1 Japan 2007 127467972 Asia 82.60 31656 4.035e+12
## 2 Hong Kong, China 2007 6980412 Asia 82.21 39725 2.773e+11
## 3 Iceland 2007 301931 Europe 81.76 36181 1.092e+10
## 4 Switzerland 2007 7554661 Europe 81.70 37506 2.833e+11
## 5 Australia 2007 20434176 Oceania 81.23 34435 7.037e+11
## 6 Spain 2007 40448191 Europe 80.94 28821 1.166e+12
## 7 Sweden 2007 9031088 Europe 80.88 33860 3.058e+11
## 8 Israel 2007 6426679 Asia 80.75 25523 1.640e+11
## 9 France 2007 61083916 Europe 80.66 30470 1.861e+12
## 10 Canada 2007 33390141 Americas 80.65 36319 1.213e+12
## .. ... ... ... ... ... ... ...
## Variables not shown: canada (dbl), gdpPercapRel (dbl)</code></pre>
<p>I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.</p>
</div>
<div id="use-rename-to-rename-variables" class="section level3">
<h3>Use <code>rename()</code> to rename variables</h3>
<p><em>NOTE: I am using the development version of <code>dplyr</code> which will soon become the official release 0.3. If <code>rename()</code> does not work for you, try <code>rename_vars()</code>, which is what this function is called in version 0.2 on CRAN. You could also use <code>plyr::rename()</code>, but then you have to be careful to always load <code>plyr</code> before <code>dplyr</code>.</em></p>
<p>I am in the awkward life stage of switching from <a href="http://en.wikipedia.org/wiki/CamelCase"><code>camelCase</code></a> to <a href="http://en.wikipedia.org/wiki/Snake_case"><code>snake_case</code></a>, so I am vexed by the variable names I chose when I cleaned this data years ago. Let’s rename some variables!</p>
<pre class="r"><code>gtbl %>%
rename(life_exp = lifeExp, gdp_percap = gdpPercap,
gdp_percap_rel = gdpPercapRel)</code></pre>
<pre><code>## Source: local data frame [1,704 x 9]
##
## country year pop continent life_exp gdp_percap gdp
## 1 Afghanistan 1952 8425333 Asia 28.80 779.4 6.567e+09
## 2 Afghanistan 1957 9240934 Asia 30.33 820.9 7.585e+09
## 3 Afghanistan 1962 10267083 Asia 32.00 853.1 8.759e+09
## 4 Afghanistan 1967 11537966 Asia 34.02 836.2 9.648e+09
## 5 Afghanistan 1972 13079460 Asia 36.09 740.0 9.679e+09
## 6 Afghanistan 1977 14880372 Asia 38.44 786.1 1.170e+10
## 7 Afghanistan 1982 12881816 Asia 39.85 978.0 1.260e+10
## 8 Afghanistan 1987 13867957 Asia 40.82 852.4 1.182e+10
## 9 Afghanistan 1992 16317921 Asia 41.67 649.3 1.060e+10
## 10 Afghanistan 1997 22227415 Asia 41.76 635.3 1.412e+10
## .. ... ... ... ... ... ... ...
## Variables not shown: canada (dbl), gdp_percap_rel (dbl)</code></pre>
<p>I did NOT assign the post-rename object back to <code>gtbl</code> because that would make the chunks in this tutorial harder to copy/paste and run out of order. In real life, I would probably assign this back to <code>gtbl</code>, in a data preparation script, and proceed with the new variable names.</p>
</div>
<div id="group_by-is-a-mighty-weapon" class="section level3">
<h3><code>group_by()</code> is a mighty weapon</h3>
<p>I have found friends and family love to ask seemingly innocuous questions like, “which country experienced the sharpest 5-year drop in life expectancy?”. In fact, that is a totally natural question to ask. But if you are using a language that doesn’t know about data, it’s an incredibly annoying question to answer.</p>
<p><code>dplyr</code> offers powerful tools to solve this class of problem.</p>
<ul>
<li><code>group_by()</code> adds extra structure to your dataset – grouping information – which lays the groundwork for computations within the groups.</li>
<li><code>summarize()</code> takes a dataset with <span class="math">\(n\)</span> observations, computes requested summaries, and returns a dataset with 1 observation.</li>
<li>window functions take a dataset with <span class="math">\(n\)</span> observations and return a dataset with <span class="math">\(n\)</span> observations.</li>
</ul>
<p>Combined with the verbs you already know, these new tools allow you to solve an extremely diverse set of problems with relative ease.</p>
<div id="counting-things-up" class="section level4">
<h4>Counting things up</h4>
<p>Let’s start with simple counting. How many observations do we have per continent?</p>
<pre class="r"><code>gtbl %>%
group_by(continent) %>%
summarize(n_obs = n())</code></pre>
<pre><code>## Source: local data frame [5 x 2]
##
## continent n_obs
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24</code></pre>
<p>The <code>tally()</code> function is a convenience function for this sort of thing.</p>
<pre class="r"><code>gtbl %>%
group_by(continent) %>%
tally</code></pre>
<pre><code>## Source: local data frame [5 x 2]
##
## continent n
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24</code></pre>
<p>What if we wanted to add the number of unique countries for each continent?</p>
<pre class="r"><code>gtbl %>%
group_by(continent) %>%
summarize(n_obs = n(), n_countries = n_distinct(country))</code></pre>
<pre><code>## Source: local data frame [5 x 3]
##
## continent n_obs n_countries
## 1 Africa 624 52
## 2 Americas 300 25
## 3 Asia 396 33
## 4 Europe 360 30
## 5 Oceania 24 2</code></pre>
</div>
<div id="general-summarization" class="section level4">
<h4>General summarization</h4>
<p>The functions you’ll apply within <code>summarize()</code> include classical statistical summaries, like <code>mean()</code>, <code>median()</code>, <code>sd()</code>, and <code>IQR</code>. Remember they are functions that take <span class="math">\(n\)</span> inputs and distill them down into 1 output.</p>
<p>Although this may be statistically ill-advised, let’s compute the average life expectancy by continent.</p>
<pre class="r"><code>gtbl %>%
group_by(continent) %>%
summarize(avg_lifeExp = mean(lifeExp))</code></pre>
<pre><code>## Source: local data frame [5 x 2]
##
## continent avg_lifeExp
## 1 Africa 48.87
## 2 Americas 64.66
## 3 Asia 60.06
## 4 Europe 71.90
## 5 Oceania 74.33</code></pre>
<p><code>summarize_each()</code> applies the same summary function(s) to multiple variables. Let’s compute average and median life expectancy and GDP per capita by continent by year … but only for 1952 and 2007.</p>
<p><em>NOTE: you won’t have <code>summarize_each()</code> if you’re using <code>dplyr</code> version 0.2. Just wait for it.</em></p>
<pre class="r"><code>gtbl %>%
filter(year %in% c(1952, 2007)) %>%
group_by(continent, year) %>%
summarise_each(funs(mean, median), lifeExp, gdpPercap)</code></pre>
<pre><code>## Source: local data frame [10 x 6]
## Groups: continent
##
## continent year lifeExp_mean gdpPercap_mean lifeExp_median
## 1 Africa 1952 39.14 1253 38.83
## 2 Africa 2007 54.81 3089 52.93
## 3 Americas 1952 53.28 4079 54.74
## 4 Americas 2007 73.61 11003 72.90
## 5 Asia 1952 46.31 5195 44.87
## 6 Asia 2007 70.73 12473 72.40
## 7 Europe 1952 64.41 5661 65.90
## 8 Europe 2007 77.65 25054 78.61
## 9 Oceania 1952 69.25 10298 69.25
## 10 Oceania 2007 80.72 29810 80.72
## Variables not shown: gdpPercap_median (dbl)</code></pre>
<p>Let’s focus just on Asia. What are the minimum and maximum life expectancies seen by year?</p>
<pre class="r"><code>gtbl %>%
filter(continent == "Asia") %>%
group_by(year) %>%
summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))</code></pre>
<pre><code>## Source: local data frame [12 x 3]
##
## year min_lifeExp max_lifeExp
## 1 1952 28.80 65.39
## 2 1957 30.33 67.84
## 3 1962 32.00 69.39
## 4 1967 34.02 71.43
## 5 1972 36.09 73.42
## 6 1977 31.22 75.38
## 7 1982 39.85 77.11
## 8 1987 40.82 78.67
## 9 1992 41.67 79.36
## 10 1997 41.76 80.69
## 11 2002 42.13 82.00
## 12 2007 43.83 82.60</code></pre>
<p>Of course it would be much more interesting to see <em>which</em> country contributed these extreme observations. Is the minimum (maximum) always coming from the same country? That’s where window functions come in.</p>
</div>
<div id="window-functions" class="section level4">
<h4>Window functions</h4>
<p>Recall that window functions take <span class="math">\(n\)</span> inputs and give back <span class="math">\(n\)</span> outputs. Here we use window functions based on ranks and offsets.</p>
<p>Let’s revisit the worst and best life expectancies in Asia over time, but retaining info about <em>which</em> country contributes these extreme values.</p>
<pre class="r"><code>gtbl %>%
filter(continent == "Asia") %>%
select(year, country, lifeExp) %>%
arrange(year) %>%
group_by(year) %>%
filter(min_rank(desc(lifeExp)) < 2 | min_rank(lifeExp) < 2)</code></pre>
<pre><code>## Source: local data frame [24 x 3]
## Groups: year
##
## year country lifeExp
## 1 1952 Afghanistan 28.80
## 2 1952 Israel 65.39
## 3 1957 Afghanistan 30.33
## 4 1957 Israel 67.84
## 5 1962 Afghanistan 32.00
## 6 1962 Israel 69.39
## 7 1967 Afghanistan 34.02
## 8 1967 Japan 71.43
## 9 1972 Afghanistan 36.09
## 10 1972 Japan 73.42
## 11 1977 Cambodia 31.22
## 12 1977 Japan 75.38
## 13 1982 Afghanistan 39.85
## 14 1982 Japan 77.11
## 15 1987 Afghanistan 40.82
## 16 1987 Japan 78.67
## 17 1992 Afghanistan 41.67
## 18 1992 Japan 79.36
## 19 1997 Afghanistan 41.76
## 20 1997 Japan 80.69
## 21 2002 Afghanistan 42.13
## 22 2002 Japan 82.00
## 23 2007 Afghanistan 43.83
## 24 2007 Japan 82.60</code></pre>
<p>We see that (min = Agfhanistan, max = Japan) is the most frequent result, but Cambodia and Israel pop up at least once each as the min or max, respectively. That table should make you impatient for our upcoming work on tidying and reshaping data! Wouldn’t it be nice to have one row per year?</p>
<p>How did that actually work? First, I store and view the result including everything but the last <code>filter()</code> statement. All of these operations are familiar.</p>
<pre class="r"><code>asia <- gtbl %>%
filter(continent == "Asia") %>%
select(year, country, lifeExp) %>%
arrange(year) %>%
group_by(year)
asia</code></pre>
<pre><code>## Source: local data frame [396 x 3]
## Groups: year
##
## year country lifeExp
## 1 1952 Afghanistan 28.80
## 2 1952 Bahrain 50.94
## 3 1952 Bangladesh 37.48
## 4 1952 Cambodia 39.42
## 5 1952 China 44.00
## 6 1952 Hong Kong, China 60.96
## 7 1952 India 37.37
## 8 1952 Indonesia 37.47
## 9 1952 Iran 44.87
## 10 1952 Iraq 45.32
## .. ... ... ...</code></pre>
<p>Now we apply a window function – <code>min_rank()</code>. Since <code>asia</code> is grouped by year, <code>min_rank()</code> operates within mini-datasets, each for a specific year. Applied to the variable <code>lifeExp</code>, <code>min_rank()</code> returns the rank of each country’s observed life expectancy. FYI, the <code>min</code> part just specifies how ties are broken. Here is an explicit peek at these within-year life expectancy ranks, in both the (default) ascending and descending order.</p>
<pre class="r"><code>asia %>%
mutate(le_rank = min_rank(lifeExp),
le_desc_rank = min_rank(desc(lifeExp)))</code></pre>
<pre><code>## Source: local data frame [396 x 5]
## Groups: year
##
## year country lifeExp le_rank le_desc_rank
## 1 1952 Afghanistan 28.80 1 33
## 2 1952 Bahrain 50.94 25 9
## 3 1952 Bangladesh 37.48 7 27
## 4 1952 Cambodia 39.42 9 25
## 5 1952 China 44.00 16 18
## 6 1952 Hong Kong, China 60.96 31 3
## 7 1952 India 37.37 5 29
## 8 1952 Indonesia 37.47 6 28
## 9 1952 Iran 44.87 17 17
## 10 1952 Iraq 45.32 18 16
## .. ... ... ... ... ...</code></pre>
<p>You can understand the original <code>filter()</code> statement now:</p>
<pre class="r"><code>filter(min_rank(desc(lifeExp)) < 2 | min_rank(lifeExp) < 2)</code></pre>
<p>These two sets of ranks are formed, within year group, and <code>filter()</code> retains rows with rank less than 2, which means … the row with rank = 1. Since we do for ascending and descending ranks, we get both the min and the max.</p>
<p>If we had wanted just the min OR the max, an alternative approach using <code>top_n()</code> would have worked.</p>
<pre class="r"><code>gtbl %>%
filter(continent == "Asia") %>%
select(year, country, lifeExp) %>%
arrange(year) %>%
group_by(year) %>%
#top_n(1) ## gets the min
top_n(1, desc(lifeExp)) ## gets the max</code></pre>
<pre><code>## Source: local data frame [12 x 3]
## Groups: year
##
## year country lifeExp
## 1 1952 Afghanistan 28.80
## 2 1957 Afghanistan 30.33
## 3 1962 Afghanistan 32.00
## 4 1967 Afghanistan 34.02
## 5 1972 Afghanistan 36.09
## 6 1977 Cambodia 31.22
## 7 1982 Afghanistan 39.85
## 8 1987 Afghanistan 40.82
## 9 1992 Afghanistan 41.67
## 10 1997 Afghanistan 41.76
## 11 2002 Afghanistan 42.13
## 12 2007 Afghanistan 43.83</code></pre>
</div>
<div id="grand-finale" class="section level4">
<h4>Grand Finale</h4>
<p>So let’s answer that “simple” question: which country experienced the sharpest 5-year drop in life expectancy? Recall that this excerpt of the Gapminder data only has data every five years, e.g. for 1952, 1957, etc. So this really means looking at life expectancy changes between adjacent timepoints.</p>
<p>At this point, that’s just too easy, so let’s do it by continent while we’re at it.</p>
<pre class="r"><code>gtbl %>%
group_by(continent, country) %>%
select(country, year, continent, lifeExp) %>%
mutate(le_delta = lifeExp - lag(lifeExp)) %>%
summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %>%
filter(min_rank(worst_le_delta) < 2) %>%
arrange(worst_le_delta)</code></pre>
<pre><code>## Source: local data frame [5 x 3]
##
## continent country worst_le_delta
## 1 Africa Rwanda -20.421
## 2 Americas El Salvador -1.511
## 3 Asia Cambodia -9.097
## 4 Europe Montenegro -1.464
## 5 Oceania Australia 0.170</code></pre>
<p>Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.</p>
<p>Break the code into pieces, starting at the top, and inspect the intermediate results. That’s certainly how I was able to <em>write</em> such a thing. These commands do not <a href="http://tinyurl.com/athenaforehead">leap fully formed out of anyone’s forehead</a> – they are built up gradually, with lots of errors and refinements along the way. I’m not even sure it’s a great idea to do so much manipulation in one fell swoop. Is the statement above really hard for you to read? If yes, then by all means break it into pieces and make some intermediate objects. Your code should be easy to write and read when you’re done.</p>
<p>In later tutorials, we’ll explore more of <code>dplyr</code>, such as operations based on two datasets.</p>
</div>
</div>
<div id="resources" class="section level3">
<h3>Resources</h3>
<p><code>dplyr</code> official stuff</p>
<ul>
<li>package home <a href="http://cran.r-project.org/web/packages/dplyr/index.html">on CRAN</a>
<ul>
<li>note there are several vignettes, with the <a href="http://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html">introduction</a> being the most relevant right now</li>
<li>the <a href="http://cran.rstudio.com/web/packages/dplyr/vignettes/window-functions.html">one on window functions</a> will also be interesting to you now</li>
</ul></li>
<li>development home <a href="https://github.com/hadley/dplyr">on GitHub</a></li>
<li><a href="https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a">tutorial HW delivered</a> (note this links to a DropBox folder) at useR! 2014 conference</li>
</ul>
<p>Blog post <a href="http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/">Hands-on dplyr tutorial for faster data manipulation in R</a> by Data School, that includes a link to an R Markdown document and links to videos</p>
<p><a href="bit001_dplyr-cheatsheet.html">Cheatsheet</a> I made for <code>dplyr</code> join functions (not relevant yet but soon)</p>
</div>
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>