-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
487 lines (279 loc) · 19.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
title: "R for institutional research"
author: "Jordan Prendez"
date: "`r format(Sys.Date())`"
output:
github_document:
toc: true
toc_depth: 2
---
## Introduction to R for institutional Research.
R is a programming language that is the Lingua franca of statistics. One of the most cited benefits of R is that it can produce publication quality graphics that can highly modified. The base R language includes common statistical procedures, however, R's wide spread use is undoutably related to the the ever growing set of user created statistical packages. Finally, another *huge* benefit is **R is availible for free**. This is especially important in the era of decreasing support and tighter budgets in higher education.
While R was not made specifically for institutional research it can perform all of the same types of functions and manipulations you might be accustomed to in other statistical software packages. However, R's learning curve can be somewhat steep and this short presentation will help those not acquanted with R to understand how to install, import data, and perform basic manipulations and other procedures that are common in institutional research, including a few examples of R's graphing capability.
## Installation of R and Rstudio
R is maintained by the R foundation and can be downloaded from the following website.
R is availible to Windows, OS X and and Linux. Click on the appropriate link and follow the instructions to install R. This guide is written for the windows version of R. If you have already installed R it is important to make sure you have the most updated version (some parts of this tutorial may not work with older versions of R.)
https://cran.r-project.org/
<img style="float: right" src="https://www.r-project.org/logo/Rlogo.svg" width="200" height="200"/>
Everything we do from this point onwards can be accomplished in base R (what we have installed so far) but I prefer and would recommend using the Rstudio environment. Rstudio is a user interface that keeps R organized and automates some of the written code into a point in click interface.
###Download the R studio interface from here
https://www.rstudio.com/products/rstudio/download2/
<!-- <img style="float: left" src="https://www.rstudio.com/wp-content/uploads/2014/07/RStudio-Logo-All-Gray.png" width="200" height="200" /> -->
Once this is installed you are ready to start using R!
This presentation can be opened in R and the code executed directly from the R scripting window. Rstudio can open traditional R scripts .R, in addition to Rmarkdown files .Rmd (this presentation) and other more interactive file types.
####Rstudio
The Rstudio environment consists of four different panes of which
* The script window and
* Console window
are usually visible as the left two quadrants. The two panels on the right side change depending on need. Most of your work should be typed into the scripting window (where it can be saved) but also can be run live from the console.
###Additional pre-reqs: You must download the following files listed at the top of this github page if you wish to run these analysis yourself.
1. **README.Rmd**
2. **Both files located in the Prerequisite files**
*Once you have downloaded these files open the README.Rmd and follow along. You will need to rename the file paths to point to the two prerequisite files at the appropriate time.*
For those who are transisitioning from a point in click interface there are several graphical user interfaces (GUI) availible for R that might be useful for someone transitioning.
###Download and Install R commander
```{r eval=FALSE, echo=TRUE}
##We won't be focusing on R commander. For those interested here are the necessary lines of code to install and run the program.
install.packages("Rcmdr") ##Run this to install R commander. (Only needs to be run once). ##Evaluate this, or any line in R, by highlighting it then using "Ctrl+R".
library(Rcmdr) #Run this line
Rcommander() #and execute this line to start R commander from the R console (or script window) each time you want to start R commander.
```
For more info about Rcmdr see the appropriate footnotes at the bottom of this page. [^1] [^2]
[^1]: This Rcommander [intro](https://cran.r-project.org/doc/contrib/Karp-Rcommander-intro2.pdf) is a couple of years old and is less technical.
[^2]: For those wanting a more in depth and technical guide see the official cran [documentation](https://cran.r-project.org/web/packages/Rcmdr/Rcmdr.pdf) on Rcommander
##Basic R
Practice executing code: R as a calculator. For this example, practice executing the following commands in the R console.
```{r eval=FALSE, echo=TRUE}
##Run these examples yourself. Try your own values.
10+12 ##Evaluate this, or any line in R, by highlighting it then using "Ctrl+R"
abs(-10-500)
sqrt(pi+2)
```
Executing the commands above, and others, will help give you practice in the Rstudio environment and demonstrate basic R functionality.
## R for Institutional Research
The goal of this section is to provide concrete examples of tasks that are useful to the institutional researcher.
###Importing data
To import a dataset click on the import dataset button in the "Environment" tab (by default this can be found in the top right quadrant of Rstudio). Select a file to import, select the appropriate options, name the dataset and press "import" to import the dataset into R.
*Remember the file paths shown throughout this document are examples and must be changed to the actual location on your local machine.*
```{r eval=FALSE, echo=TRUE}
##We can also use the command line to perform the same function as the import dataset button in Rstudio.
original_dataset_name <- read.csv("C:/Users/jprendez/Desktop/c2015_a.csv")
```
###Basic R functions to memorize
There are a few basic functions that are very important to understand in R.
1. Assigning variables
2. How to view data
3. Basic subsetting
4. Exporting data from R.
<!-- 5. Joining a dataset [coming soon] -->
####Variable assignment
Assigning a variable
```{r eval=FALSE, echo=TRUE}
new_dataset_name <- original_dataset_name # we can rename any variable or dataset in R to any other name using the assignment arrow " <- ". It is not a good idea to name your datasets, however, the same names as R functions (i.e. do not name your dataset mean, or lm, etc.)
```
####Viewing data
To view a dataset use the "view" command.
```{r eval=FALSE, echo=TRUE}
##We can view our dataset using the "view" command.
View(new_dataset_name)
view(new_dataset_name) #Note, that R is case sensitive. So a lowercase "view" will not work.
```
####Basic subsetting
The basic subsetting function in R is defined using the following form
*new_dataset_name[row,column]*
```{r eval=FALSE, echo=TRUE}
new_dataset_name[3,2] ##this gives us the value corresponding to the third row, and second column of our dataset.
x <- new_dataset_name[3,2] ##we can assign this single value to a variable or..
y <- new_dataset_name[,2] ##by not defining the row, assign an entire column to a variable (or vice versa)
```
####Exporting a file
```{r eval=FALSE, echo=TRUE}
write.table(x=new_dataset_name, file="C:/Users/jprendez/Desktop/name_of_exported_table.txt", sep=",")
#Notice the first argument is the name of the variable in R, the second is the path we want to write the file to, and lastly, the type of delimeter.
```
In order to learn more about the write.table function or the View function (or any function), execute the function name with a question mark immediately before it
```{r eval=FALSE, echo=TRUE}
?View()
?write.table()
#alternatively you can simply type in the funciton name into the help search bar. This tab is by default in the bottom right quadrant of Rstudio.
```
One of R's biggest strength is the large amount of packages written for it. A package is a user created set of functions that expand R functionality.
These are the commands to install the devtools package. We use this to install packages hosted on bitbucket, and github.
```{r eval=FALSE, echo=TRUE}
install.packages("devtools", repos="http://cran.us.r-project.org") #installs the devtools package - this should only be run once
library(devtools) #attaches the devtools package - this must be run at the start of every R session (usually just place this at the beginning of your R script)
```
####Install the inst.research package.
R does not come with a default way of assigning names to variables. I wrote a package called inst.research that includes the function called import_lables that gives R this capability. First we need to install the package...
```{r eval=FALSE, echo=TRUE, cache=TRUE}
##Run these two lines of code to install the inst.research package
library(devtools)
devtools::install_bitbucket("nietsnel/inst.research", force=TRUE)
library(inst.research)
```
Remember, In R we can learn more about a particular package by typing the "?" before any particular function.
```{r eval=FALSE, echo=TRUE}
##For instance ?mean calls R documentation for the mean function (a base function in R)
?mean
##We can also view the code of the import_labels funtion which is part of the inst.research package.
?import_labels
```
Once the inst.research package is installed We must attach it using the library function.
We can then use the import_labels to load a dataset for our next example
```{r eval=TRUE, echo=TRUE}
library(inst.research) ##attach the inst.research package
import_labels(dataset="C:/Users/jprendez/Desktop/c2015_a.csv", definitions="C:/Users/jprendez/Desktop/definitions_1.txt")
```
####More tools for institutional research:
The dplyr and tidyr packages are two of the best packages for data wrangling. Here we will demonstrate some of their powerful and intuitive functionality.
```{r eval=FALSE, echo=TRUE, cache=TRUE, include=TRUE}
# To select a subset of the dataset above use
install.packages("tidyr", repos="http://cran.us.r-project.org", dependencies = TRUE) #installs the tidyr package - this should only be run once
install.packages("dplyr", repos="http://cran.us.r-project.org", dependencies = TRUE) #installs the dplyr package - this should only be run once
install.packages("scales", repos="http://cran.us.r-project.org", dependencies = TRUE) #installs the dplyr package - this should only be run once. The scales packages is used in our graphing example.
library(tidyr) ##then attach them
library(dplyr)
library(scales)
```
```{r eval=FALSE, echo=FALSE, cache=TRUE, messages=FALSE, warning=FALSE}
flowers <- iris
install.packages('printr', type = 'source', repos = c('http://yihui.name/xran', 'http://cran.rstudio.com'))
library(knitr)
# library(dataset)
# table(data_set$MAJORNUM, data_set$CTOTALM)
# install.packages('printr', type = 'source', repos = c('http://yihui.name/xran', 'http://cran.rstudio.com'))
````
The following filters and commands should apply to "data_set". See the RStudio cheatsheet on datawrangling for more info.
Here is an example on how we can subset our loaded "dataset"
```{r eval=TRUE, echo=TRUE, cache=TRUE, messages=FALSE, warning=FALSE}
library(tidyverse)
data_set2 <- data_set %>% #the first line of code indicates that we want our subsetted dataset assigned to a new variable called "dataset2".
filter(MAJORNUM=="First major") %>% #select only records with the column MAJORNUM equal to "First major".
filter(AWLEVEL!="Associate^s degree") %>% #select only records with the column AWLEVEL NOT equal to "Associate^s degree"
mutate(my_new_var_name=CTOTALM+CTOTALW) %>% #creates new variable based on existing variables and appends to dataframe
select(UNITID, CIPCODE, MAJORNUM, AWLEVEL, CTOTALM, CTOTALW, my_new_var_name) %>% #selects only those columns listed (drops all others)
rename(degree_level=AWLEVEL) #rename a column ("new name=old name")
# View(
library(knitr)
# library(dataset)
# table(data_set$MAJORNUM, data_set$CTOTALM)
glimpse(data_set2) ##Use "glimpse"" to obtain a summary of the dataset
```
###Combining datasets
Lets take a sample of the dataframe in memory called "data_set" and name it "data_set_small".
```{r eval=TRUE, echo=TRUE, cache=TRUE}
data_set_small<- sample_frac(tbl=data_set, size=0.5, replace=FALSE)
```
Next, merge our new "data_set_small" to our dataframe that we named "data_set".
```{r eval=TRUE, echo=TRUE, cache=TRUE}
in_common_dataset<- semi_join(x=data_set_small, y=data_set, by="UNITID")
```
This command gives us a new dataset called "in_common_dataset". The environment quadrant in the top right of Rstudio should show us that this new dataset contains the same number of records that have a match in the second dataset (i.e. the number that match a record in "data_set").
There are many different types of ways to combine datasets. For a list of other useful commands including merges see the Rstudio cheatsheet for datawrangling [here](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
```{r eval=TRUE, echo=FALSE, collapse=TRUE, include=FALSE, cache=TRUE}
## Here I am simulating data for the upcoming graphing example. You do not need to understand this to proceed. Simply execute the code in this code chunk in order to generate the dataset used in the next example.
library(tidyr)
library(dplyr)
library(scales)
aa <- data_set %>%
filter(AWLEVEL=="Associate^s degree") %>%
mutate(cost_attend= rnorm(51656, mean=6000, sd=2000)) %>%
mutate(earnings=rnorm(51656, mean=30000, sd=3000)) %>%
select(UNITID, CIPCODE, MAJORNUM, CNRALW, cost_attend, earnings, AWLEVEL)
ba <- data_set %>%
filter(AWLEVEL=="Bachelor^s degree") %>%
mutate(cost_attend= rnorm(116689, mean=9000, sd=3000)) %>%
mutate(earnings=rnorm(116689, mean=40000, sd=2000)) %>%
select(UNITID, CIPCODE, MAJORNUM, CNRALW, cost_attend, earnings, AWLEVEL)
ma <- data_set %>%
filter(AWLEVEL=="Master^s degree") %>%
mutate(cost_attend= rnorm(41431, mean=20000, sd=4200)) %>%
mutate(earnings=rnorm(41431, mean=42000, sd=4000)) %>%
select(UNITID, CIPCODE, MAJORNUM, CNRALW, cost_attend, earnings, AWLEVEL)
aa <- sample_n(aa, 5000)
ba <- sample_n(ba, 5000)
ma <- sample_n(ma, 5000)
total <- rbind(aa, ba, ma)
data <- total
rm(aa,ba,ma)
```
```{r eval=FALSE, echo=FALSE, include=FALSE}
##this is used to make the datafile included in this presentation.
write.table(x=data, file="C:/Users/jprendez/Desktop/cost_of_attend.csv", sep=",")
#Notice the first argument is the name of the variable in R, the second is the path we want to write the file to, and lastly, the type of delimeter.
```
###Graphics in R
R produced high quality graphs. Here is an example of graphing using the popular ggplot2 package.
Practice...
1. First import the file "cost_of_attend" .csv file.
2. rename the file data using the assignment variable
3. run the code below to generate graph
```{r eval=TRUE, echo=TRUE, fig.retina=TRUE, fig.align='center', cache=TRUE}
d_bg <- data[, -7] # Background Data - full without the 5th column
ggplot(data, aes(x = cost_attend, y = earnings)) + #specifies which columns we want to graph
# geom_point(data = d_bg, colour = "grey", alpha = .2) + #specifies which color we want the background data
geom_point() + ##adds layer (we use this to add subsequent colors)
# facet_wrap(~ AWLEVEL) + #panels data by degree
# guides(colour = FALSE) + #supresses default option for color legend
theme_bw() + #changes background color from default grey to white
ggtitle("Earnings by Degree") + #adds Main title
xlab("Yearly cost of attendance") + #adds x lable
ylab("5-year post-grad income") + #adds y lable
scale_y_continuous(labels=dollar) + #indicates that the y variable is continuous and currency (scales package addon)
scale_x_continuous(labels=dollar) #indicates that the x variable is continuous and currency (scales package addon)
```
Here we can add color to each group while putting into the foreground the remaining groups. While the black and white version, above, is fairly clear, graphing with color can highlight group differences in a more realistic example.
```{r eval=TRUE, echo=TRUE, fig.retina=TRUE, fig.align='center', cache=TRUE}
d_bg <- data[, -7] # Background Data - full without the 5th column
ggplot(data, aes(x = cost_attend, y = earnings, colour = AWLEVEL)) + #specifies which columns we want to graph. Note: colour=AWLEVEL here. This indicates that we are specifiying the colour by the degree level.
geom_point(data = d_bg, colour = "grey", alpha = .2) + #specifies which color we want the background data
geom_point() + ##adds layer (we use this to add subsequent colors)
facet_wrap(~ AWLEVEL) + #panels data by degree
guides(colour = FALSE) + #supresses default option for color legend
theme_bw() + #changes background color from default grey to white
ggtitle("Earnings by Degree") + #adds Main title
xlab("Yearly cost of attendance") + #adds x lable
ylab("5-year post-grad income") + #adds y lable
scale_y_continuous(labels=dollar) + #indicates that the y variable is continuous and currency (scales package addon)
scale_x_continuous(labels=dollar) #indicates that the x variable is continuous and currency (scales package addon)
```
Source [^3]
```{r eval=TRUE, echo=FALSE, cache=TRUE}
###this is used to generate the data for the next linear regression example.
set.seed(955)
# Make some noisily increasing data
dat <- data.frame(cond = rep(c("A", "B"), each=10),
xvar = 1:20 + rnorm(20,sd=3),
yvar = 1:20 + rnorm(20,sd=3))
```
We can also easily perform statistical analysis in R. Here we see we can see that it takes only
two lines of code to perform a linear regression. (data for this regression is not included in HTML)
```{r eval=TRUE, echo=TRUE}
ans <- lm(dat$xvar ~ dat$yvar) ##code to run a linear model on two variables within the "dat" dataset. We can access a variable within a data.frame using the "$".
summary(ans) ##Gives us the summary of the results -- shown below
```
```{r eval=FALSE, echo=TRUE, cache=TRUE, include=TRUE}
install.packages("ggplot2", repos="http://cran.us.r-project.org", dependencies = TRUE) #installs the dplyr package - this should only be run once. The scales packages is used in our graphing example.
library(ggplot2)
```
We can then graph those results ...
```{r eval=TRUE, echo=TRUE, fig.align='center', cache=TRUE}
library(ggplot2)
##credit http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/
##^ great website for more graphing ideas.
##this code block graphs that dataset
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm) + # Add linear regression line -- (by default includes 95% confidence region)
theme_bw() + #changes the background color
ggtitle("Important research") + #adds a main title
xlab("Temperature") + #adds a x lable
ylab("Need for ice-cream") #adds a y lable
```
[^3]: The code shown in this example is adapted from this site [site](http://buff.ly/2aOKpzY)
###Further Resources
1. The Rstudio authored [cheat sheets](https://www.rstudio.com/resources/cheatsheets/)
2. Google + Stackoverflow = [answer](http://bfy.tw/7E5L)
(The let-me-google-that-for-you site is a bit snarky but the point is that it is easy to search for solutions in stack overflow using google)
***
RStudio and Shiny are trademarks of RStudio, Inc.