Jordan Prendez 2016-11-18
- Introduction to R for institutional Research.
- Installation of R and Rstudio
- Basic R
- R for Institutional Research
R is a programming language that is the Lingua franca of statistics. One of the most cited benefits of R is that it can produce publication quality graphics that can highly modified. The base R language includes common statistical procedures, however, R's wide spread use is undoutably related to the the ever growing set of user created statistical packages. Finally, another huge benefit is R is availible for free. This is especially important in the era of decreasing support and tighter budgets in higher education.
While R was not made specifically for institutional research it can perform all of the same types of functions and manipulations you might be accustomed to in other statistical software packages. However, R's learning curve can be somewhat steep and this short presentation will help those not acquanted with R to understand how to install, import data, and perform basic manipulations and other procedures that are common in institutional research, including a few examples of R's graphing capability.
R is maintained by the R foundation and can be downloaded from the following website. R is availible to Windows, OS X and and Linux. Click on the appropriate link and follow the instructions to install R. This guide is written for the windows version of R. If you have already installed R it is important to make sure you have the most updated version (some parts of this tutorial may not work with older versions of R.)
Everything we do from this point onwards can be accomplished in base R (what we have installed so far) but I prefer and would recommend using the Rstudio environment. Rstudio is a user interface that keeps R organized and automates some of the written code into a point in click interface.
https://www.rstudio.com/products/rstudio/download2/
Once this is installed you are ready to start using R!
This presentation can be opened in R and the code executed directly from the R scripting window. Rstudio can open traditional R scripts .R, in addition to Rmarkdown files .Rmd (this presentation) and other more interactive file types.
The Rstudio environment consists of four different panes of which
- The script window and
- Console window
are usually visible as the left two quadrants. The two panels on the right side change depending on need. Most of your work should be typed into the scripting window (where it can be saved) but also can be run live from the console.
Additional pre-reqs: You must download the following files listed at the top of this github page if you wish to run these analysis yourself.
- README.Rmd
- Both files located in the "Prerequisite files" folder
Once you have downloaded these files open the README.Rmd and follow along. You will need to rename the file paths to point to the two prerequisite files at the appropriate time.
For those who are transisitioning from a point in click interface there are several graphical user interfaces (GUI) availible for R that might be useful for someone transitioning.
##We won't be focusing on R commander. For those interested here are the necessary lines of code to install and run the program.
install.packages("Rcmdr") ##Run this to install R commander. (Only needs to be run once). ##Evaluate this, or any line in R, by highlighting it then using "Ctrl+R".
library(Rcmdr) #Run this line
Rcommander() #and execute this line to start R commander from the R console (or script window) each time you want to start R commander.
For more info about Rcmdr see the appropriate footnotes at the bottom of this page. [1] [2]
Practice executing code: R as a calculator. For this example, practice executing the following commands in the R console.
##Run these examples yourself. Try your own values.
10+12 ##Evaluate this, or any line in R, by highlighting it then using "Ctrl+R"
abs(-10-500)
sqrt(pi+2)
Executing the commands above, and others, will help give you practice in the Rstudio environment and demonstrate basic R functionality.
The goal of this section is to provide concrete examples of tasks that are useful to the institutional researcher.
To import a dataset click on the import dataset button in the "Environment" tab (by default this can be found in the top right quadrant of Rstudio). Select a file to import, select the appropriate options, name the dataset and press "import" to import the dataset into R.
Remember the file paths shown throughout this document are examples and must be changed to the actual location on your local machine.
##We can also use the command line to perform the same function as the import dataset button in Rstudio.
original_dataset_name <- read.csv("C:/Users/jprendez/Desktop/c2015_a.csv")
There are a few basic functions that are very important to understand in R.
- Assigning variables
- How to view data
- Basic subsetting
- Exporting data from R.
Assigning a variable
new_dataset_name <- original_dataset_name # we can rename any variable or dataset in R to any other name using the assignment arrow " <- ". It is not a good idea to name your datasets, however, the same names as R functions (i.e. do not name your dataset mean, or lm, etc.)
To view a dataset use the "view" command.
##We can view our dataset using the "view" command.
View(new_dataset_name)
view(new_dataset_name) #Note, that R is case sensitive. So a lowercase "view" will not work.
The basic subsetting function in R is defined using the following form
new_dataset_name[row,column]
new_dataset_name[3,2] ##this gives us the value corresponding to the third row, and second column of our dataset.
x <- new_dataset_name[3,2] ##we can assign this single value to a variable or..
y <- new_dataset_name[,2] ##by not defining the row, assign an entire column to a variable (or vice versa)
write.table(x=new_dataset_name, file="C:/Users/jprendez/Desktop/name_of_exported_table.txt", sep=",")
#Notice the first argument is the name of the variable in R, the second is the path we want to write the file to, and lastly, the type of delimeter.
In order to learn more about the write.table function or the View function (or any function), execute the function name with a question mark immediately before it
?View()
?write.table()
#alternatively you can simply type in the funciton name into the help search bar. This tab is by default in the bottom right quadrant of Rstudio.
One of R's biggest strength is the large amount of packages written for it. A package is a user created set of functions that expand R functionality.
These are the commands to install the devtools package. We use this to install packages hosted on bitbucket, and github.
install.packages("devtools", repos="http://cran.us.r-project.org") #installs the devtools package - this should only be run once
library(devtools) #attaches the devtools package - this must be run at the start of every R session (usually just place this at the beginning of your R script)
R does not come with a default way of assigning names to variables. I wrote a package called inst.research that includes the function called import_lables that gives R this capability. First we need to install the package...
##Run these two lines of code to install the inst.research package
library(devtools)
devtools::install_bitbucket("nietsnel/inst.research", force=TRUE)
library(inst.research)
Remember, In R we can learn more about a particular package by typing the "?" before any particular function.
##For instance ?mean calls R documentation for the mean function (a base function in R)
?mean
##We can also view the code of the import_labels funtion which is part of the inst.research package.
?import_labels
Once the inst.research package is installed We must attach it using the library function. We can then use the import_labels to load a dataset for our next example
library(inst.research) ##attach the inst.research package
import_labels(dataset="C:/Users/jprendez/Desktop/c2015_a.csv", definitions="C:/Users/jprendez/Desktop/definitions_1.txt")
The dplyr and tidyr packages are two of the best packages for data wrangling. Here we will demonstrate some of their powerful and intuitive functionality.
# To select a subset of the dataset above use
install.packages("tidyr", repos="http://cran.us.r-project.org", dependencies = TRUE) #installs the tidyr package - this should only be run once
install.packages("dplyr", repos="http://cran.us.r-project.org", dependencies = TRUE) #installs the dplyr package - this should only be run once
install.packages("scales", repos="http://cran.us.r-project.org", dependencies = TRUE) #installs the dplyr package - this should only be run once. The scales packages is used in our graphing example.
library(tidyr) ##then attach them
library(dplyr)
library(scales)
The following filters and commands should apply to "data_set". See the RStudio cheatsheet on datawrangling for more info. Here is an example on how we can subset our loaded "dataset"
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
data_set2 <- data_set %>% #the first line of code indicates that we want our subsetted dataset assigned to a new variable called "dataset2".
filter(MAJORNUM=="First major") %>% #select only records with the column MAJORNUM equal to "First major".
filter(AWLEVEL!="Associate^s degree") %>% #select only records with the column AWLEVEL NOT equal to "Associate^s degree"
mutate(my_new_var_name=CTOTALM+CTOTALW) %>% #creates new variable based on existing variables and appends to dataframe
select(UNITID, CIPCODE, MAJORNUM, AWLEVEL, CTOTALM, CTOTALW, my_new_var_name) %>% #selects only those columns listed (drops all others)
rename(degree_level=AWLEVEL) #rename a column ("new name=old name")
# View(
library(knitr)
# library(dataset)
# table(data_set$MAJORNUM, data_set$CTOTALM)
glimpse(data_set2) ##Use "glimpse"" to obtain a summary of the dataset
## Observations: 209,389
## Variables: 7
## $ UNITID <int> 100654, 100654, 100654, 100654, 100654, 100654...
## $ CIPCODE <fctr> Agriculture, General, Animal Sciences, Genera...
## $ MAJORNUM <fctr> First major, First major, First major, First ...
## $ degree_level <fctr> Bachelor^s degree, Bachelor^s degree, Bachelo...
## $ CTOTALM <int> 3, 0, 2, 2, 1, 2, 2, 1, 1, 3, 2, 3, 15, 3, 2, ...
## $ CTOTALW <int> 0, 4, 7, 5, 1, 1, 4, 2, 0, 1, 7, 4, 12, 3, 1, ...
## $ my_new_var_name <int> 3, 4, 9, 7, 2, 3, 6, 3, 1, 4, 9, 7, 27, 6, 3, ...
Lets take a sample of the dataframe in memory called "data_set" and name it "data_set_small".
data_set_small<- sample_frac(tbl=data_set, size=0.5, replace=FALSE)
Next, merge our new "data_set_small" to our dataframe that we named "data_set".
in_common_dataset<- semi_join(x=data_set_small, y=data_set, by="UNITID")
This command gives us a new dataset called "in_common_dataset". The environment quadrant in the top right of Rstudio should show us that this new dataset contains the same number of records that have a match in the second dataset (i.e. the number that match a record in "data_set").
There are many different types of ways to combine datasets. For a list of other useful commands including merges see the Rstudio cheatsheet for datawrangling here
R produced high quality graphs. Here is an example of graphing using the popular ggplot2 package.
Practice...
- First import the file "cost_of_attend" .csv file.
- rename the file data using the assignment variable
- run the code below to generate graph
d_bg <- data[, -7] # Background Data - full without the 5th column
ggplot(data, aes(x = cost_attend, y = earnings)) + #specifies which columns we want to graph
# geom_point(data = d_bg, colour = "grey", alpha = .2) + #specifies which color we want the background data
geom_point() + ##adds layer (we use this to add subsequent colors)
# facet_wrap(~ AWLEVEL) + #panels data by degree
# guides(colour = FALSE) + #supresses default option for color legend
theme_bw() + #changes background color from default grey to white
ggtitle("Earnings by Degree") + #adds Main title
xlab("Yearly cost of attendance") + #adds x lable
ylab("5-year post-grad income") + #adds y lable
scale_y_continuous(labels=dollar) + #indicates that the y variable is continuous and currency (scales package addon)
scale_x_continuous(labels=dollar) #indicates that the x variable is continuous and currency (scales package addon)
Here we can add color to each group while putting into the foreground the remaining groups. While the black and white version, above, is fairly clear, graphing with color can highlight group differences in a more realistic example.
d_bg <- data[, -7] # Background Data - full without the 5th column
ggplot(data, aes(x = cost_attend, y = earnings, colour = AWLEVEL)) + #specifies which columns we want to graph. Note: colour=AWLEVEL here. This indicates that we are specifiying the colour by the degree level.
geom_point(data = d_bg, colour = "grey", alpha = .2) + #specifies which color we want the background data
geom_point() + ##adds layer (we use this to add subsequent colors)
facet_wrap(~ AWLEVEL) + #panels data by degree
guides(colour = FALSE) + #supresses default option for color legend
theme_bw() + #changes background color from default grey to white
ggtitle("Earnings by Degree") + #adds Main title
xlab("Yearly cost of attendance") + #adds x lable
ylab("5-year post-grad income") + #adds y lable
scale_y_continuous(labels=dollar) + #indicates that the y variable is continuous and currency (scales package addon)
scale_x_continuous(labels=dollar) #indicates that the x variable is continuous and currency (scales package addon)
We can also easily perform statistical analysis in R. Here we see we can see that it takes only two lines of code to perform a linear regression. (data for this regression is not included in HTML)
ans <- lm(dat$xvar ~ dat$yvar) ##code to run a linear model on two variables within the "dat" dataset. We can access a variable within a data.frame using the "$".
summary(ans) ##Gives us the summary of the results -- shown below
##
## Call:
## lm(formula = dat$xvar ~ dat$yvar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.385 -2.411 1.441 2.649 7.820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9557 1.8431 1.604 0.126187
## dat$yvar 0.6267 0.1460 4.292 0.000438 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.659 on 18 degrees of freedom
## Multiple R-squared: 0.5058, Adjusted R-squared: 0.4784
## F-statistic: 18.43 on 1 and 18 DF, p-value: 0.0004384
install.packages("ggplot2", repos="http://cran.us.r-project.org", dependencies = TRUE) #installs the dplyr package - this should only be run once. The scales packages is used in our graphing example.
library(ggplot2)
We can then graph those results ...
library(ggplot2)
##credit http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/
##^ great website for more graphing ideas.
##this code block graphs that dataset
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm) + # Add linear regression line -- (by default includes 95% confidence region)
theme_bw() + #changes the background color
ggtitle("Important research") + #adds a main title
xlab("Temperature") + #adds a x lable
ylab("Need for ice-cream") #adds a y lable
-
The Rstudio authored cheat sheets
-
Google + Stackoverflow = answer (The let-me-google-that-for-you site is a bit snarky but the point is that it is easy to search for solutions in stack overflow using google)
RStudio and Shiny are trademarks of RStudio, Inc.
[1] This Rcommander intro is a couple of years old and is less technical.
[2] For those wanting a more in depth and technical guide see the official cran documentation on Rcommander
[3] The code shown in this example is adapted from this site site