-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathHadleyverse.Rmd
129 lines (99 loc) · 5.31 KB
/
Hadleyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
title: "Introduction to the Hadleyverse"
author: "Douglas Bates"
date: "September 24, 2015"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: null
keep_md: yes
smaller: yes
widescreen: yes
---
```{r preliminaries,echo=FALSE,results='hide',cache=FALSE}
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(dplyr))
options(width=112)
```
## RStudio and Hadley Wickham
- The biggest changes in the __R__ environment over the last 10 years have come from [Rstudio](http://rstudio.com)
and its chief scientist [Hadley Wickham](https://github.com/hadley)
- Hadley did his Ph.D. at Iowa State, during which he developed the [ggplot2](https://github.com/hadley/ggplot2) package implementing the "Grammar of Graphics"
- He continues to be a driving force in innovation for the __R__ environment, to the extent that his collection of packages is known as the _Hadleyverse_
- He was also an early adopter of [github](https://github.com) and through its integration with RStudio encouraged its use with __R__
- The packages he has written are characterized by composability
* [readr](http://github.com/hadley/readr) - replacements for data-reading functions suitable for use with
* [dplyr](https://github.com/hadley/dplyr) - data manipulation, grouping and summaries
* [tidyr](https://github.com/hadley/tidyr) - transfer from `wide` to `long` format and vice-versa
* [rvest](https://github.com/hadley/rvest) - web-scraping in R
* [devtools](https://github.com/hadley/devtools) - utilities for building and maintaining __R__ packages
## [readr](http://github.com/hadley/readr) and [dplyr](https://github.com/hadley/dplyr)
Earlier we read the `classroom` data using
```{r classroom,eval=FALSE}
str(classroom <- read.csv("http://www-personal.umich.edu/~bwest/classroom.csv"))
```
The equivalent expression using [readr](http://github.com/hadley/readr) is
```{r classroomreadr}
(classroom <- read_csv("http://www-personal.umich.edu/~bwest/classroom.csv"))
```
## Verbs and composition in dplyr
- The philosophy of `dplyr` is to use a number of "verbs" to express operations on the data and chain calls together using "pipes"
- The pipe operator is written `%>%`. It takes the output of the operand on the left and uses it as the first argument to the operand on the right. There is a keyboard shortcut (ctrl-shift-M) in RStudio
- Various verbs applied to a single frame are:
* `select` - select a subset of the columns
* `filter` - select a subset of the rows
* `arrange` - re-order the rows
* `mutate` - add new variables
* `sunnarise` - reduce a group to a smaller number of observations
* `group_by` - create the groups
## `mutate`
- The equivalent of the `with` and `within` functions in base R is `mutate`
- I prefer to have categorical data expressed as factors. Some disagree, sometimes vehemently, because of the retaining unused levels behavior. They would rather express categorical data as strings. In either case, expressing categorical data as integers is asking for trouble.
- to convert the categorical data we can use
```{r mutate}
(clf <- mutate(classroom,sex = factor(sex,labels=c('F','M')), minority = factor(minority,labels=c('N','Y')),
classid = factor(classid), schoolid = factor(schoolid), childid = NULL))
```
## Alternative
```{r mutate2}
(clc <- mutate(classroom, sex = ifelse(sex,'M','F'), minority = ifelse(minority,'Y','N'), childid = NULL))
```
- Neither of these alter the `classroom` table.
## What's the difference?
The difference between a factor and a character or integer variable is most important when subsetting, which is called `filter` in `dplyr`.
```{r filter}
girlsf <- filter(clf, sex == 'F')
girlsc <- filter(clc, sex == 'F')
xtabs(~sex, girlsf)
xtabs(~sex, girlsc)
```
- the factor retains information about the original levels. The character variable doesn't.
## `select` and `filter`
- The base `R` function `subset` allows for subsetting both rows and columns. These operations are distinct in `dplyr`; `select` for rows and `filter` for columns
```{r}
classroom %>%
select(schoolid,housepov) %>%
distinct()
```
## split-apply-combine
- A common idiom in data exploration is __split-apply-combine__ in which the data are split according to the levels of a variable, an operation is applied and the result is expressed in a new summary.
- The `group_by` verb is valuable here
```{r sac}
sclcl <- classroom %>% group_by(schoolid) %>% select(classid) %>% unique()
xtabs(~schoolid,sclcl)
tabulate(xtabs(~schoolid,sclcl))
```
## more split-apply-combine
```{r sac1}
classroom %>% group_by(classid) %>% summarise(mngain = mean(mathgain))
```
## "Data munging"
- Experienced consultants often find that a substantial part of a project is taken up with "data munging" - getting the data into a usable form.
- Extracting data from spreadsheets is often very difficult
- A good place to start exploring data is in the data sets for our Master's exam, in `/afs/cs.wisc.edu/p/stat/Data/MS.exam/`
- Try, for example, initial exploration of the data in `f14`.
- I was unable to read the `LakeSuperior.txt` file with `read_tsv`. I could use `read.delim` with `header=TRUE`
- The results from `read_tsv` on `organoid.txt` look peculiar.
```{r organoid}
(org <- read_tsv("/afs/cs.wisc.edu/p/stat/Data/MS.exam/f14/organoid.txt",skip=1))
```