Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Salmon index #211

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## Version 1.4dev

#### Pipeline updates

* Add Salmon index for transcriptome

#### Dependency Updates
* Added htseq=0.11.2

Expand Down
4 changes: 4 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ process {
memory = { check_max( 200.GB * task.attempt, 'memory' ) }
time = { check_max( 5.h * task.attempt, 'time' ) }
}
withName: salmon_index {
cpus = { check_max( 8, 'cpus' ) }
memory = { check_max( 16.GB * task.attempt, 'memory' ) }
}
withLabel: low_memory {
memory = { check_max( 16.GB * task.attempt, 'memory' ) }
}
Expand Down
1 change: 1 addition & 0 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,5 @@ params {
// Genome references
fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genome.fa'
gtf = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gtf'
transcriptome = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/transcriptome.fasta'
}
27 changes: 27 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ and processes data using the following steps:
* [dupRadar](#dupradar) - technical / biological read duplication
* [Preseq](#preseq) - library complexity
* [featureCounts](#featurecounts) - gene counts, biotype counts, rRNA estimation.
* [Salmon](#salmon) - gene counts, biotype counts, rRNA estimation.
* [StringTie](#stringtie) - FPKMs for genes and transcripts
* [Sample_correlation](#Sample_correlation) - create MDS plot and sample pairwise distance heatmap / dendrogram
* [MultiQC](#multiqc) - aggregate report, describing results of the whole pipeline
Expand Down Expand Up @@ -301,6 +302,32 @@ We also use featureCounts to count overlaps with different classes of features.

**Output directory: `results/featureCounts`**

* `Sample.bam_biotype_counts.txt`
* Read counts for the different gene biotypes that featureCounts distinguishes.
* `Sample.featureCounts.txt`
* Read the counts for each gene provided in the reference `gtf` file
* `Sample.featureCounts.txt.summary`
* Summary file, containing statistics about the counts

## Salmon
[Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) from [Ocean Genomics](https://oceangenomics.com/) quasi-maps reads to a transcriptome and counts gene expression per transcript.

### Salmon Index

**Output directory: `results/reference_transcriptome`**

* `Sample.bam_biotype_counts.txt`
* Read counts for the different gene biotypes that featureCounts distinguishes.
* `Sample.featureCounts.txt`
* Read the counts for each gene provided in the reference `gtf` file
* `Sample.featureCounts.txt.summary`
* Summary file, containing statistics about the counts


### Salmon quant

**Output directory: `results/salmon`**

* `Sample.bam_biotype_counts.txt`
* Read counts for the different gene biotypes that featureCounts distinguishes.
* `Sample.featureCounts.txt`
Expand Down
11 changes: 11 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
* [FeatureCounts Extra Gene Names](#featurecounts-extra-gene-names)
* [Default Attribute Type](#default-attribute-type)
* [Extra Gene Names](#extra-gene-names)
* [Transcriptome mapping with Salmon](#transcriptome-mapping-with-salmon)
* [Indexing the transcriptome](#indexing-the-transcriptome)
* [Alignment tool](#alignment-tool)
* [Reference genomes](#reference-genomes)
* [`--genome` (using iGenomes)](#--genome-using-igenomes)
Expand Down Expand Up @@ -196,6 +198,15 @@ Note that you can also specify more than one desired value, separated by a comma
``--fcExtraAttributes gene_id,...``


## Transcriptome mapping with Salmon

If the option `--transcriptome` is provided to a fasta file of cDNA sequences, the pipeline will also run transcriptome quantification using [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html).

### Indexing the transcriptome

The transcriptome is indexed using the default parameters of Salmon, using the default k-mer size of 31. As [discussed](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode), the a k-mer size off 31 works well with reads that are length 75bp or longer.


## Alignment tool
By default, the pipeline uses [STAR](https://github.com/alexdobin/STAR) to align the raw FastQ reads to the reference genome. STAR is fast and common, but requires a lot of memory to run, typically around 38GB for the Human GRCh37 reference genome.

Expand Down
61 changes: 61 additions & 0 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def helpMessage() {
--star_index Path to STAR index
--hisat2_index Path to HiSAT2 index
--fasta Path to Fasta reference
--transcriptome Path to Fasta transcriptome
--gtf Path to GTF file
--gff Path to GFF3 file
--bed12 Path to bed12 file
Expand Down Expand Up @@ -168,6 +169,15 @@ if( params.gtf ){
exit 1, "No GTF or GFF3 annotation specified!"
}

if (params.transcriptome){
Channel
.fromPath(params.transcriptome)
.ifEmpty { exit 1, "Transcript fasta file is unreachable: ${params.transcriptome}" }
.set { tx_fasta_ch }
}



if( params.bed12 ){
bed12 = Channel
.fromPath(params.bed12)
Expand Down Expand Up @@ -251,6 +261,7 @@ if(params.aligner == 'star'){
else if(params.fasta) summary['Fasta Ref'] = params.fasta
if(params.splicesites) summary['Splice Sites'] = params.splicesites
}
if(params.transcriptome) summary['Transcriptome'] = params.transcriptome
if(params.gtf) summary['GTF Annotation'] = params.gtf
if(params.gff) summary['GFF3 Annotation'] = params.gff
if(params.bed12) summary['BED Annotation'] = params.bed12
Expand Down Expand Up @@ -431,6 +442,31 @@ if(params.aligner == 'hisat2' && !params.hisat2_index && params.fasta){
"""
}
}


/*
* PREPROCESSING - Create Salmon transcriptome index
*/
if(params.transcriptome){
process salmon_index {
tag "$transcriptome.simpleName"
publishDir path: { "${params.outdir}/reference_transcriptome" },
mode: 'copy'

input:
file transcriptome from tx_fasta_ch

output:
file 'salmon_index' into index_ch

script:
"""
salmon index --threads $task.cpus -t $transcriptome -i salmon_index
"""
}
}


/*
* PREPROCESSING - Convert GFF3 to GTF
*/
Expand Down Expand Up @@ -968,6 +1004,31 @@ process dupradar {
"""
}

if (params.transcriptome){
process salmon_quant {
tag "$sample"
publishDir "${params.outdir}/Salmon", mode: 'copy'

input:
file index from index_ch.collect()
set sample, file(reads) from raw_salmon

output:
file(sample) into quant_ch

script:
if (params.singleEnd){
"""
salmon quant --validateMappings --threads $task.cpus --libType=A -i $index -r ${reads[0]} -o $sample
"""
}else{
"""
salmon quant --validateMappings --threads $task.cpus --libType=A -i $index -1 ${reads[0]} -2 ${reads[1]} -o $sample
"""
}
}
}


/*
* STEP 9 - Feature counts
Expand Down
3 changes: 2 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,15 @@ params {
fcExtraAttributes = 'gene_name'
fcGroupFeatures = 'gene_id'
fcGroupFeaturesType = 'gene_biotype'
markdup_java_options = '"-Xms4000m -Xmx7g"' //Established values for markDuplicate memory consumption, see issue PR #689 (in Sarek) for details
markdup_java_options = '"-Xms4000m -Xmx7g"' //Established values for markDuplicate memory consumption, see issue PR #689 (in Sarek) for details
splicesites = false
saveReference = false
saveTrimmed = false
saveAlignedIntermediates = false
singleEnd = false
reads = "data/*{1,2}.fastq.gz"
outdir = './results'
transcriptome = false
seqCenter = false
skip_qc = false
skip_fastqc = false
Expand Down