Skip to content

Commit

Permalink
Merge pull request #22 from Juke34/20
Browse files Browse the repository at this point in the history
#20 provide statistics information in readme
  • Loading branch information
Juke34 authored Jan 27, 2025
2 parents 6af36e4 + 7685a88 commit 7c28fb6
Show file tree
Hide file tree
Showing 11 changed files with 138 additions and 31 deletions.
85 changes: 85 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ AliNe is a pipeline written in Nextflow that aims to efficiently align reads aga
* [Usage and test](#usage)
* [Parameters](#parameters)
* [Output](#output)
* [Structure](#structure)
* [Statistics](#statistics)
* [Contributing](#contributing)

## Foreword
Expand Down Expand Up @@ -353,6 +355,9 @@ On success you should get a message looking like this:
## Output
### Structure
Here the description of typical ouput you will get from AliNe:
```
Expand Down Expand Up @@ -404,7 +409,87 @@ Here the description of typical ouput you will get from AliNe:
└── MultiQC # MultiQC folder that aggregate results across many samples into a single report
├── multiqc_report.html # Report with interactive plots for statistics across many samples.
└── multiqc_report_data # Plot and data used by the multiqc_report.html
```
### Statistics
In order to compare several aligner output you should activate the `--fastqc` parameter. AliNe will run the FastQC program for each output file, thereafter all FastQC file will be gathered in an html file using MultiQC. The resulting file called `multiqc_report.html` can be found in <outdir>/MultiQC ( <outdir> by default is called `alignment_results` and can be setup using `--outdir` AliNe parameter).
FastQC collect the following information:
* Sequence Counts
* Sequence Quality
* Per Sequence Quality Scores
* Per Base Sequence Content
* Per Sequence GC Content
* Per Base N Content
* Sequence Length Distribution
* Sequence Duplication Levels
* Overrepresented sequences
* Adapter Content
#### General Statistics
Sequence Duplication Levels, Per Sequence GC Content and Sequence Counts are reported at the top of the `multiqc_report.html` file in a table called `General Statistics` as % Dups, %GC a,d M Seqs accordingly (see below).
<img src="img/multiqc.png" />
In order to facilitate the reading of this `General Statistics` you can export the table in tsv using the `Export as CSV...` button and execute the following piece of R code on the downloaded `general_stats_table.tsv` file :
```R
# install packages
install.packages("dplyr")
install.packages("stringr")
install.packages("tidyr")
# Load necessary libraries
library(dplyr)
library(stringr)
library(tidyr)
# Read the TSV file
file_path <- "general_stats_table.tsv"
df <- read.delim(file_path, check.names = FALSE)
# sample name as row name
rownames(df) <- df$Sample
# remove Sample column and clean up the column names
tmp <- cbind(ID = rownames(df), stack(df[-1])) |>
transform(ind = as.character(ind) |> stringr::str_remove_all("\\.\\d+"))
# remove na values
tmp <- tmp[!is.na(tmp$values),]
# remove . values
tmp$values <- tmp$values |> stringr::str_remove_all("^\\.$")
# pivot data
tmp <- tmp |> pivot_wider(id_cols = ID , names_from = ind, values_from = values,
values_fn = \(x) paste(unique(x), collapse = ""))
# print only 4 first columns
tmp[1:4]
```
You will get a table similar to this one:
```
ID Dups GC Seqs
<chr> <chr> <chr> <chr>
1 yeast_R1 73.01 43 0.01
2 yeast_R1_seqkit_trim 69.21661409043114 44 0.007607999999999999
3 yeast_R1_seqkit_trim_STAR_sorted 69.54090258811289 44 0.007689
4 yeast_R1_seqkit_trim_bbmap_sorted 69.21661409043114 44 0.007607999999999999
5 yeast_R1_seqkit_trim_bowtie2_sorted 69.21661409043114 44 0.007607999999999999
6 yeast_R1_seqkit_trim_bwaaln_sorted 69.21661409043114 44 0.007607999999999999
7 yeast_R1_seqkit_trim_bwamem_sorted 69.18933123111286 44 0.007611
8 yeast_R1_seqkit_trim_bwasw_sorted 69.23279033105622 44 0.007612
9 yeast_R1_seqkit_trim_graphmap2_sorted 69.21661409043114 44 0.007607999999999999
10 yeast_R1_seqkit_trim_hisat2_sorted 69.21661409043114 44 0.007607999999999999
11 yeast_R1_seqkit_trim_kallisto_sorted 69.21661409043114 44 0.007607999999999999
12 yeast_R1_seqkit_trim_minimap2_sorted 69.63058976020739 44 0.007715
13 yeast_R1_seqkit_trim_nucmer.fixed_sorted 64.70588235294117 42 0.00016999999999999999
14 yeast_R1_seqkit_trim_sublong 53.23343848580442 44 0.007607999999999999
15 yeast_R1_seqkit_trim_subread 69.21661409043114 44 0.007607999999999999
```
Expand Down
3 changes: 2 additions & 1 deletion aline.nf
Original file line number Diff line number Diff line change
Expand Up @@ -513,7 +513,7 @@ workflow align {
// ------------------- QC -----------------
if(params.fastqc){
fastqc_raw(raw_reads, "fastqc/raw", "raw")
logs.mix(fastqc_raw.out)
logs.concat(fastqc_raw.out).set{logs} // save log
}

// ------------------- Standardize Score to Be Phred+33 ----------------
Expand Down Expand Up @@ -1092,4 +1092,5 @@ In aline.nf
Think to convert sam to bam if necessary
Think to sort bam output if necessary
Add info in README.md
Add tool in multiqc config file (at least for fastqc)
*/
16 changes: 13 additions & 3 deletions config/multiqc_conf.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
title: "Nextflow Alignment pipeline"
title: "AliNe - Nextflow Alignment pipeline (https://github.com/Juke34/AliNe)"

run_modules:
- fastqc
Expand Down Expand Up @@ -45,6 +45,10 @@ module_order:
name: "FastQC (graphmap2)"
path_filters:
- "*graphmap2_logs*"
- fastqc:
name: "FastQC (hisat)"
path_filters:
- "*hisat_logs*"
- hisat2
- fastqc:
name: "FastQC (hisat2)"
Expand Down Expand Up @@ -73,6 +77,12 @@ module_order:
- "*sublong_logs*"
- star
- fastqc:
name: "FastQC (star)"
name: "FastQC (STAR)"
path_filters:
- "*star_logs*"
- "*STAR_logs*"
- fastqc:
name: "FastQC (STARlong)"
path_filters:
- "*star_logs*"
- "*starlong_logs*"
- "*STARlong_logs*"
3 changes: 2 additions & 1 deletion config/softwares.config
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ process {
container = 'quay.io/biocontainers/minimap2:2.26--he4a0461_1'
}
withName: 'multiqc' {
container = 'quay.io/biocontainers/multiqc:1.14--pyhdfd78af_0'
//container = 'quay.io/biocontainers/multiqc:1.14--pyhdfd78af_0'
container = 'quay.io/biocontainers/multiqc:1.27--pyhdfd78af_0'
}
withLabel: 'mummer4' {
container = 'quay.io/biocontainers/mummer4:4.0.0rc1--pl5321hdbdd923_6'
Expand Down
Binary file added img/multiqc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion modules/bbmap.nf
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ process bbmap {
}

// set fileName
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_bbmap"
"""
${tool} \\
ref=${genome_index} \\
Expand Down
13 changes: 8 additions & 5 deletions modules/bowtie2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ process bowtie2 {

output:
tuple val(sample), path ("*.sam"), emit: tuple_sample_sam
path "*bowtie2.log", emit: bowtie2_summary
path "*.log", emit: bowtie2_summary

script:

Expand All @@ -48,22 +48,25 @@ process bowtie2 {
read_orientation = "--ff"
}
}

// catch filename
filename = reads[0].baseName.replace('.fastq','') + "_bowtie2"

if (params.read_type == "short_paired"){
"""
bowtie2 ${params.bowtie2_options} ${read_orientation}\\
-p ${task.cpus} \\
-x ${genome.baseName} \\
-S ${reads[0].baseName.replace('.fastq','')}_bowtie2.sam \\
-1 ${reads[0]} -2 ${reads[1]} 2> ${reads[0].baseName.replace('.fastq','')}_bowtie2.log
-S ${filename}.sam \\
-1 ${reads[0]} -2 ${reads[1]} 2> ${filename}_sorted.log
"""
} else {
"""
bowtie2 ${params.bowtie2_options} ${read_orientation}\\
-p ${task.cpus} \\
-x ${genome.baseName} \\
-S ${reads.baseName.replace('.fastq','')}_bowtie2.sam \\
-U ${reads} 2> ${reads.baseName}_bowtie2.log
-S ${filename}.sam \\
-U ${reads} 2> ${filename}_sorted.log
"""
}

Expand Down
14 changes: 7 additions & 7 deletions modules/hisat2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,16 @@ process hisat2 {

output:
tuple val(sample), path ("*.sam"), emit: tuple_sample_sam
path "${sample}_splicesite.txt"
path "*hisat2-summary.txt", emit: hisat2_summary
path "*_splicesite.txt"
path "*_summary.txt", emit: hisat2_summary

script:

// catch index basename
index_basename = hisat2_index_files[0].toString() - ~/.\d.ht2l?/

// catch filename
filename = reads[0].baseName.replace('.fastq','')
filename = reads[0].baseName.replace('.fastq','') + "_hisat2"

// deal with library type - default is unstranded.
def lib_strand=""
Expand Down Expand Up @@ -82,14 +82,14 @@ process hisat2 {

if (params.read_type == "short_paired"){
"""
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${sample}_splicesite.txt \\
--new-summary --summary-file ${sample}.hisat2-summary.txt \\
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${filename}_splicesite.txt \\
--new-summary --summary-file ${filename}_sorted_summary.txt \\
-p ${task.cpus} -x $index_basename -1 ${reads[0]} -2 ${reads[1]} > ${filename}.sam
"""
} else {
"""
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${sample}_splicesite.txt \\
--new-summary --summary-file ${sample}.hisat2-summary.txt \\
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${filename}_splicesite.txt \\
--new-summary --summary-file ${filename}_sorted_summary.txt \\
-p ${task.cpus} -x $index_basename -U $reads > ${filename}.sam
"""
}
Expand Down
21 changes: 14 additions & 7 deletions modules/kallisto.nf
Original file line number Diff line number Diff line change
Expand Up @@ -26,21 +26,21 @@ process kallisto_index {
process kallisto {
label 'kallisto'
tag "$sample"
publishDir "${params.outdir}/${outpath}", pattern: "*.bam", mode: 'copy'
publishDir "${params.outdir}/${outpath}", pattern: "${filename}/*.bam", mode: 'copy'

input:
tuple val(sample), path(reads), val(library), val(read_length)
path kallisto_index
val outpath

output:
tuple val(sample), path ("*.bam"), emit: tuple_sample_bam
path "*kallisto.log", emit: kallisto_summary
tuple val(sample), path ("${filename}/*.bam"), emit: tuple_sample_bam
path "*.log", emit: kallisto_summary

script:

// catch filename
filename = reads[0].baseName.replace('.fastq','')
filename = reads[0].baseName.replace('.fastq','') + "_kallisto_sorted"

// deal with library type
def read_orientation=""
Expand All @@ -61,7 +61,11 @@ process kallisto {
-t ${task.cpus} \
--pseudobam \
-i ${kallisto_index} \
${reads[0]} ${reads[1]} -o ${filename}.bam 2> ${filename}_kallisto.log
${reads[0]} ${reads[1]} -o ${filename} 2> ${filename}.log
mv ${filename}/pseudoalignments.bam ${filename}/${filename}.bam
# in order that the log file contains the name of the output fastq files (MultiQC)
sed -i 's/${reads[0]}/${filename}.fastq.gz/' ${filename}.log
"""
} else {

Expand All @@ -83,8 +87,11 @@ process kallisto {
--pseudobam \
-i ${kallisto_index} \
--single \
${reads} -o ${filename}.bam 2> ${filename}_kallisto.log
${reads} -o ${filename} 2> ${filename}.log
mv ${filename}/pseudoalignments.bam ${filename}/${filename}.bam
# in order that the log file contains the name of the output fastq files (MultiQC)
sed -i 's/${reads[0]}/${filename}.fastq.gz/' ${filename}.log
"""
}

}
8 changes: 4 additions & 4 deletions modules/star.nf
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ process star {
--readFilesIn pipedRead1 pipedRead2 \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')} \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_${star_tool}_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate
"""
Expand All @@ -121,7 +121,7 @@ process star {
--readFilesIn pipedRead \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')} \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_${star_tool}_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate
"""
Expand Down Expand Up @@ -172,7 +172,7 @@ process star2pass{
--readFilesIn pipedRead1 pipedRead2 \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_2pass \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_${star_tool}_2pass_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate \\
--sjdbFileChrStartEnd *SJ.out.tab
Expand All @@ -195,7 +195,7 @@ process star2pass{
--readFilesIn pipedRead \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_2pass \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_${star_tool}_2pass_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate \\
--sjdbFileChrStartEnd *SJ.out.tab
Expand Down
4 changes: 2 additions & 2 deletions modules/subread.nf
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ process subread {
}

// remove fastq.gz
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_subread"

// prepare index name
def index_prefix = genome.baseName + "_index"
Expand Down Expand Up @@ -120,7 +120,7 @@ process sublong {
script:

// remove fastq.gz
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_sublong"

// prepare index name
def index_prefix = genome.baseName + "_index"
Expand Down

0 comments on commit 7c28fb6

Please sign in to comment.