Benchmark datasets for WGS analysis.
Grab the latest stable release under the releases tab. If you are feeling adventurous, use git clone
! Include the scripts directory in your path. For example, if you downloaded this project into your local bin directory:
$ export PATH=$PATH:$HOME/bin/datasets/scripts
In addition to the installation above, please install the following.
- edirect (see section on edirect below)
- sra-toolkit, built from source: https://github.com/ncbi/sra-tools/wiki/Building-and-Installing-from-Source
- Perl 5.12.0
- Make
- wget - Brew users:
brew install wget
- sha256sum - Linux-based OSs should have this already; Other users should see the relevant installation section below.
Modified instructions from https://www.ncbi.nlm.nih.gov/books/NBK179288/
mkdir -p ~/bin
cd ~/bin
perl -MNet::FTP -e \
'$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
$ftp->login; $ftp->binary;
$ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -c edirect.tar.gz | tar xf -
rm edirect.tar.gz
export PATH=$PATH:$HOME/bin/edirect
./edirect/setup.sh
If you do not have sha256sum (e.g., if you are on MacOS), then try to make the shell function and export it.
function sha256sum() { shasum -a 256 "$@" ; }
export -f sha256sum
This shell function will need to be defined in the current session. To make it permanent for future sessions, add it to $HOME/.bashrc
.
We have included a script that downloads all datasets, runs the CFSAN SNP Pipeline, infers a phylogeny, and compares the tree against the suggested tree. All example commands are present in the shell script for your manual inspection.
$ bash EXAMPLES/downloadAll.sh
To run, you need a dataset in tsv format. Here is the usage statement:
Usage: GenFSGopher.pl -o outdir spreadsheet.dataset.tsv
PARAM DEFAULT DESCRIPTION
--outdir <req'd> The output directory
--format tsv The input format. Default: tsv. No other format
is accepted at this time.
--layout onedir onedir - Everything goes into one directory
byrun - Each genome run gets its separate directory
byformat - Fastq files to one dir, assembly to another, etc
cfsan - Reference and samples in separate directories with
each sample in a separate subdirectory
--shuffled <NONE> Output the reads as interleaved instead of individual
forward and reverse files.
--norun <NONE> Do not run anything; just create a Makefile.
--numcpus 1 How many jobs to run at once. Be careful of disk I/O.
--citation Print the recommended citation for this script and exit
--version Print the version and exit
--help Print the usage statement and die
There is a field intendedUse
which suggests how a particular dataset might be used. For example, Epi-validated outbreak datasets might be used with a SNP-based or MLST-based workflow. As the number of different values for intendedUse
increases, other use-cases will be available. Otherwise, how you use a dataset is up to you!
To create your own dataset and to make it compatible with the existing script(s) here, please follow these instructions. These instructions are subject to change.
- Create a new Excel spreadsheet with only one tab. Please delete any extraneous tabs to avoid confusion.
- The first part describes the dataset. This is given as a two-column key/value format. The keys are case-insensitive, but the values are case-sensitive. The order of rows is unimportant.
- Organism. Usually genus and species, but there is no hard rule at this time.
- Outbreak. This is usually an outbreak code but can be some other descriptor of the dataset.
- pmid. Any publications associated with this dataset should be listed as pubmed IDs.
- tree. This is a URL to the newick-formatted tree. This tree serves as a guide to future analyses.
- source. Where did this dataset come from?
- intendedUsge. How do you think others will use this dataset?
- Blank row - separates the two parts of the dataset
- Header row with these names (case-insensitive): biosample_acc, strain, genbankAssembly, SRArun_acc, outbreak, dataSetName, suggestedReference, sha256sumAssembly, sha256sumRead1, sha256sumRead2
- Accessions to the genomes for download. Each row represents a genome and must have the following fields. Use a dash (-) for any missing data.
- biosample_acc - The BioSample accession
- strain - Its genome name
- genbankAssembly - GenBank accession number
- SRArun_acc - SRR accession number
- outbreak - The name of the outbreak clade. Usually named after an outbreak code. If not part of an important clade, the field can be filled in using 'outgroup'
- dataSetName - this should be redundant with the outbreak field in the first part of the spreadsheet
- suggestedReference - The suggested reference genome for analysis, e.g., SNP analysis.
- sha256sumAssembly - A checksum for the GenBank file
- sha256sumRead1 - A checksum for the first read from the SRR accession
- sha256sumRead2 - A checksum for the second read from the SRR accession