block010_dplyr-end-single-table.html

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />


<title>dplyr functions for a single dataset</title>

<script src="libs/jquery-1.11.0/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link href="libs/bootstrap-2.3.2/css/united.min.css" rel="stylesheet" />
<link href="libs/bootstrap-2.3.2/css/bootstrap-responsive.min.css" rel="stylesheet" />
<script src="libs/bootstrap-2.3.2/js/bootstrap.min.js"></script>

<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
      href="libs/highlight/default.css"
      type="text/css" />
<script src="libs/highlight/highlight.js"></script>
<style type="text/css">
  pre:not([class]) {
    background-color: white;
  }
</style>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
   window.setTimeout(function() {
      hljs.initHighlighting();
   }, 0);
}
</script>


<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />

</head>

<body>

<style type = "text/css">
.main-container {
  max-width: 940px;
  margin-left: auto;
  margin-right: auto;
}
</style>
<div class="container-fluid main-container">

<header>
  <div class="nav">
    <a class="nav-logo" href="index.html">
      <img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
    </a>
    <ul>
      <li class="home"><a href="index.html">Home</a></li>
      <li class="faq"><a href="faq.html">FAQ</a></li>
      <li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
      <li class="topics"><a href="topics.html">Topics</a></li>
      <li class="people"><a href="people.html">People</a></li>
    </ul>
  </div>
</header>

<div id="header">
<h1 class="title">dplyr functions for a single dataset</h1>
</div>

<div id="TOC">
<ul>
<li><a href="#where-were-we">Where were we?</a><ul>
<li><a href="#load-dplyr-and-the-gapminder-data">Load <code>dplyr</code> and the Gapminder data</a></li>
</ul></li>
<li><a href="#use-mutate-to-add-new-variables">Use <code>mutate()</code> to add new variables</a></li>
<li><a href="#use-arrange-to-row-order-data-in-a-principled-way">Use <code>arrange()</code> to row-order data in a principled way</a></li>
<li><a href="#use-rename-to-rename-variables">Use <code>rename()</code> to rename variables</a></li>
<li><a href="#group_by-is-a-mighty-weapon"><code>group_by()</code> is a mighty weapon</a><ul>
<li><a href="#counting-things-up">Counting things up</a></li>
<li><a href="#general-summarization">General summarization</a></li>
<li><a href="#window-functions">Window functions</a></li>
<li><a href="#grand-finale">Grand Finale</a></li>
</ul></li>
<li><a href="#resources">Resources</a></li>
</ul>
</div>

<div id="where-were-we" class="section level3">
<h3>Where were we?</h3>
<p>In the <a href="block009_dplyr-intro.html">introduction to <code>plyr</code></a>, we used two very important verbs and an operator:</p>
<ul>
<li><code>filter()</code> for subsetting data row-wise</li>
<li><code>select()</code> for subsetting data variable- or column-wise</li>
<li>the pipe operator <code>%&gt;%</code>, which feeds the LHS as the first argument to the expression on the RHS</li>
</ul>
<p>Here we explore other <code>dplyr</code> functions, especially more verbs, for working with a single dataset.</p>
<div id="load-dplyr-and-the-gapminder-data" class="section level4">
<h4>Load <code>dplyr</code> and the Gapminder data</h4>
<p>We use an excerpt of the Gapminder data and store it as a <code>tbl_df</code> object, basically an enhanced data.frame. I’ll use the pipe operator even here, to demonstrate its utility outside of <code>dplyr</code>.</p>
<pre class="r"><code>suppressPackageStartupMessages(library(dplyr))
gd_url &lt;- &quot;http://tiny.cc/gapminder&quot;
gtbl &lt;- gd_url %&gt;% read.delim %&gt;% tbl_df
gtbl %&gt;% glimpse</code></pre>
<pre><code>## Variables:
## $ country   (fctr) Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ year      (int) 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ pop       (dbl) 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ continent (fctr) Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ lifeExp   (dbl) 28.80, 30.33, 32.00, 34.02, 36.09, 38.44, 39.85, 40....
## $ gdpPercap (dbl) 779.4, 820.9, 853.1, 836.2, 740.0, 786.1, 978.0, 852...</code></pre>
</div>
</div>
<div id="use-mutate-to-add-new-variables" class="section level3">
<h3>Use <code>mutate()</code> to add new variables</h3>
<p>Imagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Let’s multiply them together.</p>
<pre class="r"><code>gtbl &lt;- gtbl %&gt;%
  mutate(gdp = pop * gdpPercap)
gtbl %&gt;% glimpse</code></pre>
<pre><code>## Variables:
## $ country   (fctr) Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ year      (int) 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ pop       (dbl) 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ continent (fctr) Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ lifeExp   (dbl) 28.80, 30.33, 32.00, 34.02, 36.09, 38.44, 39.85, 40....
## $ gdpPercap (dbl) 779.4, 820.9, 853.1, 836.2, 740.0, 786.1, 978.0, 852...
## $ gdp       (dbl) 6.567e+09, 7.585e+09, 8.759e+09, 9.648e+09, 9.679e+0...</code></pre>
<p>Hmmmm … those GDP numbers are almost uselessly large and abstract. Consider the <a href="http://fivethirtyeight.com/datalab/xkcd-randall-munroe-qanda-what-if/">advice of Randall Munroe of xkcd</a>: “One thing that bothers me is large numbers presented without context… ‘If I added a zero to this number, would the sentence containing it mean something different to me?’ If the answer is ‘no,’ maybe the number has no business being in the sentence in the first place.” Maybe it would be more meaningful to consumers of my tables and figures if I reported GDP per capita, <em>relative to some benchmark country</em>. Since Canada is my adopted home, I’ll go with that.</p>
<pre class="r"><code>just_canada &lt;- gtbl %&gt;% filter(country == &quot;Canada&quot;)
gtbl &lt;- gtbl %&gt;%
  mutate(canada = just_canada$gdpPercap[match(year, just_canada$year)],
         gdpPercapRel = gdpPercap / canada)
gtbl %&gt;%
  select(country, year, gdpPercap, canada, gdpPercapRel)</code></pre>
<pre><code>## Source: local data frame [1,704 x 5]
## 
##        country year gdpPercap canada gdpPercapRel
## 1  Afghanistan 1952     779.4  11367      0.06857
## 2  Afghanistan 1957     820.9  12490      0.06572
## 3  Afghanistan 1962     853.1  13462      0.06337
## 4  Afghanistan 1967     836.2  16077      0.05201
## 5  Afghanistan 1972     740.0  18971      0.03901
## 6  Afghanistan 1977     786.1  22091      0.03559
## 7  Afghanistan 1982     978.0  22899      0.04271
## 8  Afghanistan 1987     852.4  26627      0.03201
## 9  Afghanistan 1992     649.3  26343      0.02465
## 10 Afghanistan 1997     635.3  28955      0.02194
## ..         ...  ...       ...    ...          ...</code></pre>
<pre class="r"><code>gtbl %&gt;%
  select(gdpPercapRel) %&gt;%
  summary</code></pre>
<pre><code>##   gdpPercapRel  
##  Min.   :0.007  
##  1st Qu.:0.062  
##  Median :0.172  
##  Mean   :0.327  
##  3rd Qu.:0.447  
##  Max.   :9.535</code></pre>
<p>Note that, <code>mutate()</code> builds new variables sequentially so you can reference earlier ones (like <code>canada</code>) when defining later ones (like <code>gdpPercapRel</code>). (I got a little off topic here using <code>match()</code> to do table look up, but <a href="http://www.rdocumentation.org/packages/base/functions/match">you can figure that out</a>.)</p>
<p>The relative GDP per capita numbers are, in general, well below 1. We see that most of the countries covered by this dataset have substantially lower GDP per capita, relative to Canada, across the entire time period.</p>
</div>
<div id="use-arrange-to-row-order-data-in-a-principled-way" class="section level3">
<h3>Use <code>arrange()</code> to row-order data in a principled way</h3>
<p>Imagine you wanted this data ordered by year then country, as opposed to by country then year.</p>
<pre class="r"><code>gtbl %&gt;%
  arrange(year, country)</code></pre>
<pre><code>## Source: local data frame [1,704 x 9]
## 
##        country year      pop continent lifeExp gdpPercap       gdp canada
## 1  Afghanistan 1952  8425333      Asia   28.80     779.4 6.567e+09  11367
## 2      Albania 1952  1282697    Europe   55.23    1601.1 2.054e+09  11367
## 3      Algeria 1952  9279525    Africa   43.08    2449.0 2.273e+10  11367
## 4       Angola 1952  4232095    Africa   30.02    3520.6 1.490e+10  11367
## 5    Argentina 1952 17876956  Americas   62.48    5911.3 1.057e+11  11367
## 6    Australia 1952  8691212   Oceania   69.12   10039.6 8.726e+10  11367
## 7      Austria 1952  6927772    Europe   66.80    6137.1 4.252e+10  11367
## 8      Bahrain 1952   120447      Asia   50.94    9867.1 1.188e+09  11367
## 9   Bangladesh 1952 46886859      Asia   37.48     684.2 3.208e+10  11367
## 10     Belgium 1952  8730405    Europe   68.00    8343.1 7.284e+10  11367
## ..         ...  ...      ...       ...     ...       ...       ...    ...
## Variables not shown: gdpPercapRel (dbl)</code></pre>
<p>Or maybe you want just the data from 2007, sorted on life expectancy?</p>
<pre class="r"><code>gtbl %&gt;%
  filter(year == 2007) %&gt;%
  arrange(lifeExp)</code></pre>
<pre><code>## Source: local data frame [142 x 9]
## 
##                     country year      pop continent lifeExp gdpPercap
## 1                 Swaziland 2007  1133066    Africa   39.61    4513.5
## 2                Mozambique 2007 19951656    Africa   42.08     823.7
## 3                    Zambia 2007 11746035    Africa   42.38    1271.2
## 4              Sierra Leone 2007  6144562    Africa   42.57     862.5
## 5                   Lesotho 2007  2012649    Africa   42.59    1569.3
## 6                    Angola 2007 12420476    Africa   42.73    4797.2
## 7                  Zimbabwe 2007 12311143    Africa   43.49     469.7
## 8               Afghanistan 2007 31889923      Asia   43.83     974.6
## 9  Central African Republic 2007  4369038    Africa   44.74     706.0
## 10                  Liberia 2007  3193942    Africa   45.68     414.5
## ..                      ...  ...      ...       ...     ...       ...
## Variables not shown: gdp (dbl), canada (dbl), gdpPercapRel (dbl)</code></pre>
<p>Oh, you’d like to sort on life expectancy in <strong>desc</strong>ending order? Then use <code>desc()</code>.</p>
<pre class="r"><code>gtbl %&gt;%
  filter(year == 2007) %&gt;%
  arrange(desc(lifeExp))</code></pre>
<pre><code>## Source: local data frame [142 x 9]
## 
##             country year       pop continent lifeExp gdpPercap       gdp
## 1             Japan 2007 127467972      Asia   82.60     31656 4.035e+12
## 2  Hong Kong, China 2007   6980412      Asia   82.21     39725 2.773e+11
## 3           Iceland 2007    301931    Europe   81.76     36181 1.092e+10
## 4       Switzerland 2007   7554661    Europe   81.70     37506 2.833e+11
## 5         Australia 2007  20434176   Oceania   81.23     34435 7.037e+11
## 6             Spain 2007  40448191    Europe   80.94     28821 1.166e+12
## 7            Sweden 2007   9031088    Europe   80.88     33860 3.058e+11
## 8            Israel 2007   6426679      Asia   80.75     25523 1.640e+11
## 9            France 2007  61083916    Europe   80.66     30470 1.861e+12
## 10           Canada 2007  33390141  Americas   80.65     36319 1.213e+12
## ..              ...  ...       ...       ...     ...       ...       ...
## Variables not shown: canada (dbl), gdpPercapRel (dbl)</code></pre>
<p>I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.</p>
</div>
<div id="use-rename-to-rename-variables" class="section level3">
<h3>Use <code>rename()</code> to rename variables</h3>
<p><em>NOTE: I am using the development version of <code>dplyr</code> which will soon become the official release 0.3. If <code>rename()</code> does not work for you, try <code>rename_vars()</code>, which is what this function is called in version 0.2 on CRAN. You could also use <code>plyr::rename()</code>, but then you have to be careful to always load <code>plyr</code> before <code>dplyr</code>.</em></p>
<p>I am in the awkward life stage of switching from <a href="http://en.wikipedia.org/wiki/CamelCase"><code>camelCase</code></a> to <a href="http://en.wikipedia.org/wiki/Snake_case"><code>snake_case</code></a>, so I am vexed by the variable names I chose when I cleaned this data years ago. Let’s rename some variables!</p>
<pre class="r"><code>gtbl %&gt;%
  rename(life_exp = lifeExp, gdp_percap = gdpPercap,
         gdp_percap_rel = gdpPercapRel)</code></pre>
<pre><code>## Source: local data frame [1,704 x 9]
## 
##        country year      pop continent life_exp gdp_percap       gdp
## 1  Afghanistan 1952  8425333      Asia    28.80      779.4 6.567e+09
## 2  Afghanistan 1957  9240934      Asia    30.33      820.9 7.585e+09
## 3  Afghanistan 1962 10267083      Asia    32.00      853.1 8.759e+09
## 4  Afghanistan 1967 11537966      Asia    34.02      836.2 9.648e+09
## 5  Afghanistan 1972 13079460      Asia    36.09      740.0 9.679e+09
## 6  Afghanistan 1977 14880372      Asia    38.44      786.1 1.170e+10
## 7  Afghanistan 1982 12881816      Asia    39.85      978.0 1.260e+10
## 8  Afghanistan 1987 13867957      Asia    40.82      852.4 1.182e+10
## 9  Afghanistan 1992 16317921      Asia    41.67      649.3 1.060e+10
## 10 Afghanistan 1997 22227415      Asia    41.76      635.3 1.412e+10
## ..         ...  ...      ...       ...      ...        ...       ...
## Variables not shown: canada (dbl), gdp_percap_rel (dbl)</code></pre>
<p>I did NOT assign the post-rename object back to <code>gtbl</code> because that would make the chunks in this tutorial harder to copy/paste and run out of order. In real life, I would probably assign this back to <code>gtbl</code>, in a data preparation script, and proceed with the new variable names.</p>
</div>
<div id="group_by-is-a-mighty-weapon" class="section level3">
<h3><code>group_by()</code> is a mighty weapon</h3>
<p>I have found friends and family love to ask seemingly innocuous questions like, “which country experienced the sharpest 5-year drop in life expectancy?”. In fact, that is a totally natural question to ask. But if you are using a language that doesn’t know about data, it’s an incredibly annoying question to answer.</p>
<p><code>dplyr</code> offers powerful tools to solve this class of problem.</p>
<ul>
<li><code>group_by()</code> adds extra structure to your dataset – grouping information – which lays the groundwork for computations within the groups.</li>
<li><code>summarize()</code> takes a dataset with <span class="math">\(n\)</span> observations, computes requested summaries, and returns a dataset with 1 observation.</li>
<li>window functions take a dataset with <span class="math">\(n\)</span> observations and return a dataset with <span class="math">\(n\)</span> observations.</li>
</ul>
<p>Combined with the verbs you already know, these new tools allow you to solve an extremely diverse set of problems with relative ease.</p>
<div id="counting-things-up" class="section level4">
<h4>Counting things up</h4>
<p>Let’s start with simple counting. How many observations do we have per continent?</p>
<pre class="r"><code>gtbl %&gt;%
  group_by(continent) %&gt;%
  summarize(n_obs = n())</code></pre>
<pre><code>## Source: local data frame [5 x 2]
## 
##   continent n_obs
## 1    Africa   624
## 2  Americas   300
## 3      Asia   396
## 4    Europe   360
## 5   Oceania    24</code></pre>
<p>The <code>tally()</code> function is a convenience function for this sort of thing.</p>
<pre class="r"><code>gtbl %&gt;%
  group_by(continent) %&gt;%
  tally</code></pre>
<pre><code>## Source: local data frame [5 x 2]
## 
##   continent   n
## 1    Africa 624
## 2  Americas 300
## 3      Asia 396
## 4    Europe 360
## 5   Oceania  24</code></pre>
<p>What if we wanted to add the number of unique countries for each continent?</p>
<pre class="r"><code>gtbl %&gt;%
  group_by(continent) %&gt;%
  summarize(n_obs = n(), n_countries = n_distinct(country))</code></pre>
<pre><code>## Source: local data frame [5 x 3]
## 
##   continent n_obs n_countries
## 1    Africa   624          52
## 2  Americas   300          25
## 3      Asia   396          33
## 4    Europe   360          30
## 5   Oceania    24           2</code></pre>
</div>
<div id="general-summarization" class="section level4">
<h4>General summarization</h4>
<p>The functions you’ll apply within <code>summarize()</code> include classical statistical summaries, like <code>mean()</code>, <code>median()</code>, <code>sd()</code>, and <code>IQR</code>. Remember they are functions that take <span class="math">\(n\)</span> inputs and distill them down into 1 output.</p>
<p>Although this may be statistically ill-advised, let’s compute the average life expectancy by continent.</p>
<pre class="r"><code>gtbl %&gt;%
  group_by(continent) %&gt;%
  summarize(avg_lifeExp = mean(lifeExp))</code></pre>
<pre><code>## Source: local data frame [5 x 2]
## 
##   continent avg_lifeExp
## 1    Africa       48.87
## 2  Americas       64.66
## 3      Asia       60.06
## 4    Europe       71.90
## 5   Oceania       74.33</code></pre>
<p><code>summarize_each()</code> applies the same summary function(s) to multiple variables. Let’s compute average and median life expectancy and GDP per capita by continent by year … but only for 1952 and 2007.</p>
<p><em>NOTE: you won’t have <code>summarize_each()</code> if you’re using <code>dplyr</code> version 0.2. Just wait for it.</em></p>
<pre class="r"><code>gtbl %&gt;%
  filter(year %in% c(1952, 2007)) %&gt;%
  group_by(continent, year) %&gt;%
  summarise_each(funs(mean, median), lifeExp, gdpPercap)</code></pre>
<pre><code>## Source: local data frame [10 x 6]
## Groups: continent
## 
##    continent year lifeExp_mean gdpPercap_mean lifeExp_median
## 1     Africa 1952        39.14           1253          38.83
## 2     Africa 2007        54.81           3089          52.93
## 3   Americas 1952        53.28           4079          54.74
## 4   Americas 2007        73.61          11003          72.90
## 5       Asia 1952        46.31           5195          44.87
## 6       Asia 2007        70.73          12473          72.40
## 7     Europe 1952        64.41           5661          65.90
## 8     Europe 2007        77.65          25054          78.61
## 9    Oceania 1952        69.25          10298          69.25
## 10   Oceania 2007        80.72          29810          80.72
## Variables not shown: gdpPercap_median (dbl)</code></pre>
<p>Let’s focus just on Asia. What are the minimum and maximum life expectancies seen by year?</p>
<pre class="r"><code>gtbl %&gt;%
  filter(continent == &quot;Asia&quot;) %&gt;%
  group_by(year) %&gt;%
  summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))</code></pre>
<pre><code>## Source: local data frame [12 x 3]
## 
##    year min_lifeExp max_lifeExp
## 1  1952       28.80       65.39
## 2  1957       30.33       67.84
## 3  1962       32.00       69.39
## 4  1967       34.02       71.43
## 5  1972       36.09       73.42
## 6  1977       31.22       75.38
## 7  1982       39.85       77.11
## 8  1987       40.82       78.67
## 9  1992       41.67       79.36
## 10 1997       41.76       80.69
## 11 2002       42.13       82.00
## 12 2007       43.83       82.60</code></pre>
<p>Of course it would be much more interesting to see <em>which</em> country contributed these extreme observations. Is the minimum (maximum) always coming from the same country? That’s where window functions come in.</p>
</div>
<div id="window-functions" class="section level4">
<h4>Window functions</h4>
<p>Recall that window functions take <span class="math">\(n\)</span> inputs and give back <span class="math">\(n\)</span> outputs. Here we use window functions based on ranks and offsets.</p>
<p>Let’s revisit the worst and best life expectancies in Asia over time, but retaining info about <em>which</em> country contributes these extreme values.</p>
<pre class="r"><code>gtbl %&gt;%
  filter(continent == &quot;Asia&quot;) %&gt;%
  select(year, country, lifeExp) %&gt;%
  arrange(year) %&gt;%
  group_by(year) %&gt;%
  filter(min_rank(desc(lifeExp)) &lt; 2 | min_rank(lifeExp) &lt; 2)</code></pre>
<pre><code>## Source: local data frame [24 x 3]
## Groups: year
## 
##    year     country lifeExp
## 1  1952 Afghanistan   28.80
## 2  1952      Israel   65.39
## 3  1957 Afghanistan   30.33
## 4  1957      Israel   67.84
## 5  1962 Afghanistan   32.00
## 6  1962      Israel   69.39
## 7  1967 Afghanistan   34.02
## 8  1967       Japan   71.43
## 9  1972 Afghanistan   36.09
## 10 1972       Japan   73.42
## 11 1977    Cambodia   31.22
## 12 1977       Japan   75.38
## 13 1982 Afghanistan   39.85
## 14 1982       Japan   77.11
## 15 1987 Afghanistan   40.82
## 16 1987       Japan   78.67
## 17 1992 Afghanistan   41.67
## 18 1992       Japan   79.36
## 19 1997 Afghanistan   41.76
## 20 1997       Japan   80.69
## 21 2002 Afghanistan   42.13
## 22 2002       Japan   82.00
## 23 2007 Afghanistan   43.83
## 24 2007       Japan   82.60</code></pre>
<p>We see that (min = Agfhanistan, max = Japan) is the most frequent result, but Cambodia and Israel pop up at least once each as the min or max, respectively. That table should make you impatient for our upcoming work on tidying and reshaping data! Wouldn’t it be nice to have one row per year?</p>
<p>How did that actually work? First, I store and view the result including everything but the last <code>filter()</code> statement. All of these operations are familiar.</p>
<pre class="r"><code>asia &lt;- gtbl %&gt;%
  filter(continent == &quot;Asia&quot;) %&gt;%
  select(year, country, lifeExp) %&gt;%
  arrange(year) %&gt;%
  group_by(year)
asia</code></pre>
<pre><code>## Source: local data frame [396 x 3]
## Groups: year
## 
##    year          country lifeExp
## 1  1952      Afghanistan   28.80
## 2  1952          Bahrain   50.94
## 3  1952       Bangladesh   37.48
## 4  1952         Cambodia   39.42
## 5  1952            China   44.00
## 6  1952 Hong Kong, China   60.96
## 7  1952            India   37.37
## 8  1952        Indonesia   37.47
## 9  1952             Iran   44.87
## 10 1952             Iraq   45.32
## ..  ...              ...     ...</code></pre>
<p>Now we apply a window function – <code>min_rank()</code>. Since <code>asia</code> is grouped by year, <code>min_rank()</code> operates within mini-datasets, each for a specific year. Applied to the variable <code>lifeExp</code>, <code>min_rank()</code> returns the rank of each country’s observed life expectancy. FYI, the <code>min</code> part just specifies how ties are broken. Here is an explicit peek at these within-year life expectancy ranks, in both the (default) ascending and descending order.</p>
<pre class="r"><code>asia %&gt;%
  mutate(le_rank = min_rank(lifeExp),
         le_desc_rank = min_rank(desc(lifeExp)))</code></pre>
<pre><code>## Source: local data frame [396 x 5]
## Groups: year
## 
##    year          country lifeExp le_rank le_desc_rank
## 1  1952      Afghanistan   28.80       1           33
## 2  1952          Bahrain   50.94      25            9
## 3  1952       Bangladesh   37.48       7           27
## 4  1952         Cambodia   39.42       9           25
## 5  1952            China   44.00      16           18
## 6  1952 Hong Kong, China   60.96      31            3
## 7  1952            India   37.37       5           29
## 8  1952        Indonesia   37.47       6           28
## 9  1952             Iran   44.87      17           17
## 10 1952             Iraq   45.32      18           16
## ..  ...              ...     ...     ...          ...</code></pre>
<p>You can understand the original <code>filter()</code> statement now:</p>
<pre class="r"><code>filter(min_rank(desc(lifeExp)) &lt; 2 | min_rank(lifeExp) &lt; 2)</code></pre>
<p>These two sets of ranks are formed, within year group, and <code>filter()</code> retains rows with rank less than 2, which means … the row with rank = 1. Since we do for ascending and descending ranks, we get both the min and the max.</p>
<p>If we had wanted just the min OR the max, an alternative approach using <code>top_n()</code> would have worked.</p>
<pre class="r"><code>gtbl %&gt;%
  filter(continent == &quot;Asia&quot;) %&gt;%
  select(year, country, lifeExp) %&gt;%
  arrange(year) %&gt;%
  group_by(year) %&gt;%
  #top_n(1)               ## gets the min
  top_n(1, desc(lifeExp)) ## gets the max</code></pre>
<pre><code>## Source: local data frame [12 x 3]
## Groups: year
## 
##    year     country lifeExp
## 1  1952 Afghanistan   28.80
## 2  1957 Afghanistan   30.33
## 3  1962 Afghanistan   32.00
## 4  1967 Afghanistan   34.02
## 5  1972 Afghanistan   36.09
## 6  1977    Cambodia   31.22
## 7  1982 Afghanistan   39.85
## 8  1987 Afghanistan   40.82
## 9  1992 Afghanistan   41.67
## 10 1997 Afghanistan   41.76
## 11 2002 Afghanistan   42.13
## 12 2007 Afghanistan   43.83</code></pre>
</div>
<div id="grand-finale" class="section level4">
<h4>Grand Finale</h4>
<p>So let’s answer that “simple” question: which country experienced the sharpest 5-year drop in life expectancy? Recall that this excerpt of the Gapminder data only has data every five years, e.g. for 1952, 1957, etc. So this really means looking at life expectancy changes between adjacent timepoints.</p>
<p>At this point, that’s just too easy, so let’s do it by continent while we’re at it.</p>
<pre class="r"><code>gtbl %&gt;%
  group_by(continent, country) %&gt;%
  select(country, year, continent, lifeExp) %&gt;%
  mutate(le_delta = lifeExp - lag(lifeExp)) %&gt;%
  summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %&gt;%
  filter(min_rank(worst_le_delta) &lt; 2) %&gt;%
  arrange(worst_le_delta)</code></pre>
<pre><code>## Source: local data frame [5 x 3]
## 
##   continent     country worst_le_delta
## 1    Africa      Rwanda        -20.421
## 2  Americas El Salvador         -1.511
## 3      Asia    Cambodia         -9.097
## 4    Europe  Montenegro         -1.464
## 5   Oceania   Australia          0.170</code></pre>
<p>Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.</p>
<p>Break the code into pieces, starting at the top, and inspect the intermediate results. That’s certainly how I was able to <em>write</em> such a thing. These commands do not <a href="http://tinyurl.com/athenaforehead">leap fully formed out of anyone’s forehead</a> – they are built up gradually, with lots of errors and refinements along the way. I’m not even sure it’s a great idea to do so much manipulation in one fell swoop. Is the statement above really hard for you to read? If yes, then by all means break it into pieces and make some intermediate objects. Your code should be easy to write and read when you’re done.</p>
<p>In later tutorials, we’ll explore more of <code>dplyr</code>, such as operations based on two datasets.</p>
</div>
</div>
<div id="resources" class="section level3">
<h3>Resources</h3>
<p><code>dplyr</code> official stuff</p>
<ul>
<li>package home <a href="http://cran.r-project.org/web/packages/dplyr/index.html">on CRAN</a>
<ul>
<li>note there are several vignettes, with the <a href="http://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html">introduction</a> being the most relevant right now</li>
<li>the <a href="http://cran.rstudio.com/web/packages/dplyr/vignettes/window-functions.html">one on window functions</a> will also be interesting to you now</li>
</ul></li>
<li>development home <a href="https://github.com/hadley/dplyr">on GitHub</a></li>
<li><a href="https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a">tutorial HW delivered</a> (note this links to a DropBox folder) at useR! 2014 conference</li>
</ul>
<p>Blog post <a href="http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/">Hands-on dplyr tutorial for faster data manipulation in R</a> by Data School, that includes a link to an R Markdown document and links to videos</p>
<p><a href="bit001_dplyr-cheatsheet.html">Cheatsheet</a> I made for <code>dplyr</code> join functions (not relevant yet but soon)</p>
</div>

<div class="footer">
This work is licensed under the  <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>

</div>

<script>

// add bootstrap table styles to pandoc tables
$(document).ready(function () {
  $('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});

</script>

<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

</body>
</html>