ProjectTemplate is an R Package which facilitates data analysis with R. It makes it easy to start a new data analysis project. In short, ProjectTemplate is awesome and worth learning.
I have modified the default ProjectTemplate folder structure to align more with my workflow. However, I think these tweaks should be useful for others. More information is described below.
- Install R https://cran.rstudio.com/
- Install Rstudio https://www.rstudio.com/products/rstudio/download3/
- Install R packages:
knitr
,ProjectTemplate
(e.g.,install.packages("knitr")
) - Install any R packages required for data import or listed in
config/global.dcf
In addition:
- Install any dependencies of these R packages. In particular, if you use xls files to store some data. You may need to install Perl: http://www.perl.org/get.html ; OSX and Linux generally come with perl installed; On Windows OS you need to install it.
- If you want to be able to knit to pdf, then get a TeX distribution (https://www.latex-project.org/get/); otherwise, just knit to Word or HTML.
If you want to use this workflow for doing data analysis, adopt the following steps:
- Download a zip file of the AnglimModifiedProjectTemplate
- Unzip this file in an appropriate location and give both the directory and the RStudio project file (i.e.,
InsertProjectNameHere.Rproj
) a name corresponding to your project - Open in Rstudio by clicking the
.Rproj
file (this helps to ensure that the R working directory is correct) - Prepare raw-data (i.e., ensure name of data file is what you want it to be in R) and add to
data
directory - Open included RMarkdown file (i.e.,
explore.rmd
) and run the chunklibrary(ProjectTemplate); load.project()
and check that the data imported correctly (you may well get errors at this point indicating that you need to install additional R packages or dependencies (particularly perl with gdata), if so, install these).
You're now ready to start manipulating and analysing your data.
- Data manipulation: If you need to modify your imported data, put this code in an r-script in the munge directory (i.e.,
munge/01-munge.R
). E.g., add or modify a variables in a data.frame; add or remove cases; merge data frames. More generally, if there are any objects that need to be accessible across multiple analyses, then put them in the munge script. - Additional R Packages: If you need to use an additional R package, add the name of this package to the
libraries
line inconfig/global.dcf
rather than addinglibrary("packagename")
to your script. - Custom functions: If you write a function to help you perform your analyses, then place it in an r script in the
lib
directory. This way it will be automatically imported everytime you runload.project()
- Data analysis: If you are running analyses that you plan to keep, then place them in an an RMarkdown file (e.g.,
explore.rmd
). Make sure that the you have an initial chunk that just haslibrary(ProjecTemplate); load.project()
. Then add code chunks after that as required.
-
What if data does not import correctly? In some cases, the default data import rules used by ProjectTemplate do not work as you might want. In that case, you can add your own code to import the data. Place this code in an R script in the
lib
directory (e.g.,data-import-override.r
). If the data is not importing at all, this can be a sign of several things (you don't have a package installed; you don't have a dependency like Perl or Java installed; the data has formatting issues). For more information about file formats in ProjectTemplate see http://projecttemplate.net/file_formats.html ). -
What if there is an error in my data manipulation code (i.e., munge files)? Clear the workspace (i.e., click the broom in Rstudio or run
rm(list=ls())
. Then runlibrary(ProjectTemplate); load.project(list(munging=FALSE))
. This will load the data, import packages and so on, but wont run the data manipulation code. Then run each line in the data manipulation file until you encounter the error. Then it's just a matter of adopting normal debugging procedures. Make sure the data manipulation file is saved. -
Returning to an analysis after closing RStudio? In general, you should just need to open Rstudio using the Rproj file and then run
library(ProjectTemplate); load.project()
. This will automatically load the data, import specified packages, import support functions, and run initial data manipulations manipulations. You should then be able to begin analysis. -
What if RMarkdown code chunks can not find a variable? This is usually a sign that data manipulation steps have been placed in the RMarkdown file. Try to find these and move them into the munge folder. In general, aim for RMarkdown code chunks to only depend on
load.project()
and not on other code chunks. So in theory, after runningload.project()
you should be able to run any other chunks in any order. -
General comments about workflow: For some steps, you have the choice between refreshing the project or adopting a manual step. For example, when importing a package, you can either use
library(...)
or you can add the package toglobal.dcf
and runload.project()
. Similarly if you are creating a new function that you put in thelib
directory, you can either source it manually or you can runload.project()
. Either approach is fine. The key is to remember to do the steps required to ensure everything works with ProjectTemplate. A general trick for debugging, is to clear the workspace, and then runload.project()
. In addition, when exploring the data, you often run analyses that don't need to be saved in your data analysis script. -
Saving figures and tables: When conducting reproducible data analysis a writing a mansucript, you can use a document format that allows for weaving results and formatted text (e.g., RMarkdown or LaTeX using either knitr or Sweave). An alternative is to make the process of generating tables, figures, and text fully reproducible by a script, but then manually import this into your document. While weaving formats are more reproducible, they can result in practical problems. In particular, I'm tyipcally collaborating with people who use Microsoft Word. Furthermore, it can be nice to know that the figures and tables are fixed and wont accidentally change without your knowledge. And some steps like detailed table formatting, can be quite complicated to automate. If you adopt this semi-automated process, then it's useful to export the figures to the
output
directory (e.g., http://www.statmethods.net/graphs/creating.html). Tables can also be exported to the output directory (e.g., usingwrite.csv(mytable, file = "output/mytable.csv")
where it can have manual formatting applied (e.g., lines, centering, fonts, etc.).
- An general explanation on using this customised ProjectTemplate. Note that it was written with version 0.1. A few things have changed since then.
- The general ProjectTemplate website includes more general information about ProjectTemplate and here is a video providing an introduction to ProjectTemplate more generally
- General video giving overview of ProjectTemplate
- General video on RMarkdown
- Tutorial on scoring psychological test using RMarkdown and ProjectTemplate
- Tutorial on confirmatory factor analyais using lavaan and ProjectTemplate
v. 0.6
- Migrated project to make global.dcf consistent with version 0.9 of ProjectTemplate
- Default is data_frame
- Removed "data/input.xls"; it seemed to be distracting new users
- Hide code that deals with overriding tibble conversion now that their an option to to use data frames.
v. 0.5.2
- Added function to prevent tibble conversion of imported data.frames
v. 0.5.1
- added
rm(list=ls())
in various location to make it easier to reset project before runninglibrary(ProjectTemplate); load.project()
v. 0.5
- Moved rmd files out of reports directory (rmd files are simpler to understand when they are in the working directory, even if they do create a little clutter)
- Removed figure and doc directories as they weren't being used; the
output
directory functions as a useful general store of output - Added this readme file to explain how projectTemplate works
- Added ggplot2 as default package in global.dcf
- Disabled saving and loading of R Workspaces in the Rstudio project file as this workflow works against the purpose of ProjectTemplate
- Added file
raw-data/import-raw-data.r
as a place holder file for preparing initial data files - Created a change log and version information.
- Added file
lib/importxls.r
. It reads my two default xls files in the data directory using thereadxl
package rather thangdata
.readxl
has the advantage that it does not have an external dependency on perl.
v. 0.4
- Added
raw-data
directory as a standardised location for converting raw data into data suitable for the data directory (i.e., convert original file names to those suitable for data directory; - Updated
config/global.dcf
to reflect updates to ProjectTemplate (v 0.6)
v. 0.1
- Modified default global.dcf (
as_factors: off
and changd default packages) - Made readme.md blank
- Removed a couple of directories (e.g., diagnositics, logs, profiling)
- Added an initial Rmarkdown file in the reports directory
- Added RStudio project file (i.e., .Rproj) to enable easy launching of RStudio.
- Created output directory as the general location for saving all output (figures, tables, data files). This name seemed more appropriate than the built-in "graphs" diretory.
- Added
output/output-processing.xlsx
for manually preparing tables from exported data - Added
data/meta.xls
as general file for storing meta data (e.g., scoring rules for psychological tests) - Added
data/input.xls
a simple general purpose spreadsheet for storing ad hoc data that needs to be imported into R