-
Notifications
You must be signed in to change notification settings - Fork 39
Writing download functions
After writing a few of these download functions (in the downloads.R file), I've compiled a list of some notes and helpful tips to make the experience all the more pleasant. Any additional tips or tricks are welcome!
-
After opening up R, there are a few things you need to load in before running any of the
downloads.R
code:library(reshape2)
library(devtools)
install_github("willpearse/fulltext")
library(fulltext)
source('/path/to/natdb/R/utility.R')
-
Once you get settled, find a paper to get data from and download to your computer. Then you can open the file in R to have a look at it:
data <- read.delim("~/Desktop/PanTHERIA_1-0_WR05_Aug2008.txt")
-
Look through the metadata and begin to figure out which columns are useful, what the units are, etc.
-
You can use
names(data)
to pull out just the names for each column, which can make it easier to extract just the ones you'd like to keep. -
Make sure that any meaningless info is removed, and that NAs are in place where data is absent.
You cannot write a download function without using .df.melt
. .df.melt
turns your downloaded data into a format that natdb
can work with. If you don't use this function, you may as well not have bothered writing a function at all! .df.melt
takes four arguments, only two of which are required:
-
data
- adata.frame
containing all of the trait measurements you have downloaded. -
species
- the name of the column indata
that contains the species names of the trait measurements -
units
- (optional, but recommended!) the units of each of column indata
, in the order they appear, with nothing (noNA
- nothing!) for columnspecies
-
metadata
- (optional, but recommended!) adata.frame
containing the meta-data for everything indata
. This could be things like the longitude and latitude at which a measurement was made, which is useful information but isn't a trait.
Lines 13-56 in the downloads.R
file contain sample functions for several different data repositories that you might be pulling data from. Before worrying too much about how to make the Best Function, look through those 5 functions to see if you can create your own from those examples.
# name your function with the author and year; if there is more than one dataset from a paper, use a, b, c... after the author name/year
.jones.2009 <- function(...){
# read in your data from an internet source
data <- read.delim(ft_get_si("E090-184", "PanTHERIA_1-0_WR05_Aug2008.txt", "esa_archives"))
# do somethings to make the dataset more useful - add in NAs where appropriate, remove meaningless columns, etc
for(i in 1:ncol(data))
data[data[,i]==-999 | data[,i]=="-999",i] <- NA
# from the metadata, find units for each column; add in as with c() - sample() is used just in this case
units <- sample(c("g","m^2"),length(names(data))-1,TRUE)
# use the .df.melt() function to add in the data, species name column, and units
data <- .df.melt(data, "MSW05_Binomial", units=units)
# ta-da! data!
return(data)
}
-
Below is the step-by-step constuction of one of these downloaded functions, as written by Will during one of our meetings. It can be helpful to work through the process for the first few functions, but after a while you can write one using just the last step:
# downloading from harddrive after manually unzipping # editing to include NAs, units .jones.2009 <- function(...){ data <- read.delim("~/Desktop/PanTHERIA_1-0_WR05_Aug2008.txt") for(i in 1:ncol(data)) data[data[,i]==-999 | data[,i]=="-999",i] <- NA units <- sample(c("g","m^2"),length(names(data))-1,TRUE) data <- .df.melt(data, "MSW05_Binomial", units=units) return(data) } # downloading .zip, unzipping .jones.2009 <- function(...){ file <- .unzip("PanTHERIA_1-0_WR05_Aug2008.txt", "~/Downloads/ECOL_90_184.zip") data <- read.delim(file) for(i in 1:ncol(data)) data[data[,i]==-999 | data[,i]=="-999",i] <- NA units <- sample(c("g","m^2"),length(names(data))-1,TRUE) data <- .df.melt(data, "MSW05_Binomial", units=units) return(data) } # downloading from the internet, unzipping .jones.2009 <- function(...){ file <- tempfile() download.file("https://ndownloader.figshare.com/files/5604752", file) file <- .unzip("PanTHERIA_1-0_WR05_Aug2008.txt", file) data <- read.delim(file) for(i in 1:ncol(data)) data[data[,i]==-999 | data[,i]=="-999",i] <- NA units <- sample(c("g","m^2"),length(names(data))-1,TRUE) data <- .df.melt(data, "MSW05_Binomial", units=units) return(data) } # downloading directly from ESA (in this case), selecting desired file to retrieve from database # this is specific to which database you're downloading from (ESA, Dryad, etc...) .jones.2009 <- function(...){ data <- read.delim(ft_get_si("E090-184", "PanTHERIA_1-0_WR05_Aug2008.txt", "esa_archives")) for(i in 1:ncol(data)) data[data[,i]==-999 | data[,i]=="-999",i] <- NA units <- sample(c("g","m^2"),length(names(data))-1,TRUE) data <- .df.melt(data, "MSW05_Binomial", units=units) return(data) } # test the bad boi test <- .jones.2009()`