- Installation
- Prerequisites
- Usage
- Informatic resources
- Examples
- Performance
- Parallel filesystems
- Algorithm
- FAQ
- References
Follow the Installation guidelines to compile and install mpiBWA
.
As mpiBWA
relies on the Message Passing Interface (MPI) standard, mpirun
must be available to run the program. Several MPI implementations exist
such as mpich, open-mpi or Intel® MPI Library. The mpirun
program must be available in your PATH.
We have 2 versions of mpiBWA
:
- the first version creates a unique SAM file and is available in the binary
mpiBWA
- the second version sorts the output by the chromosomes present in the header and is available in the binary
mpiBWAByChr
After installation, 3 binaries are created:
mpiBWA
mpiBWAByChr
mpiBWAIdx
mpiBWAIdx
is responsible for creating a binary image of the reference genome. This image is subsequently loaded in the shared memory by mpiBWA
. Then, every mpiBWA
process on the same computing node will share the same genome reference in order to save memory usage. mpiBWAIdx
does not need MPI to run. To create a binary image from a reference genome type:
mpiBWAIdx myReferenceGenome.fa
This creates the file myReferenceGenome.fa.map
.
Importantly, mpiBWAIdx
requires the following files: myReferenceGenome.fa.sa
, myReferenceGenome.fa.bwt
, myReferenceGenome.fa.ann
, myReferenceGenome.fa.pac
, myReferenceGenome.fa.amb
to build the fa.map
file. These files are generated by bwa index
(see bwa documentation). It works also with fasta file.
The .map
file needs only to be generated once.
The genome.sh script provides an example to build the .map
file.
mpiBWA
is executed with the mpirun
program, for example:
mpirun -n 2 mpiBWA mem -t 8 -o ./HCC1187C.sam hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq
This command launches 2 processes MPI and 8 threads will be created by mpiBWA.
WARNING: do not write the extension .map
for the reference. If the file is myReferenceGenome.fa.map
just provide in the command line myReferenceGenome.fa
to mpiBWA
.
The -n
options passed to mpirun
indicates the number of processes to run in parallel. The total number of cores required is the product of the values provided in the -n
and -t
options (in the previous example, 16 cores are required). For more details on how to choose the number processes, see the Informatic resources section.
mpiBWA
requires several mandatory arguments:
- Input: a reference genome in
.map
format, one or two fastq files trimmed or not - Output: a sam file containing the entire reference header (e.g. HCC1187C.sam)
If you want to split the results of the alignment by chromosome, use mpiBWAByChr
, for example:
mpirun -n 2 mpiBWAByChr mem -t 8 -o ./HCC1187C.sam hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq
mpiBWAByChr
requires several mandatory arguments:
- Input: a reference genome in
.map
format, one or two fastq files trimmed or not - Output: a sam file containing the entire reference header, sam files by chromosome, discordant.sam and unmapped.sam.
- a reference genome in
.map
format (but do not write the extension '.map' for the reference, e.g. myReferenceGenome.fa) - fastq file for R1 (trimmed or not)
- fastq file for R2 (trimmed or not) if the data come from paired-end sequencing (optional)
bwa mem
options can be passed from command line (e.g.mpiBWA mem -t 8 -k 18
)-f
add this option if you want to fix the mates during the mapping (optionnal)-b
add this option to write output files in BAM format-g
add this option to write output files in BGZF format-z
add this option to place the reference genome in NUMA domain. Options are shared(default), socket, numa, l1, l2, l3
example1: mpirun -n 2 mpiBWAByChr mem -t 8 -b -f -o SAM REF FASTQ1 FASTQ2
the -f option add mate CIGAR, mate quality tags to each mate. This is equivalent to "samtools fixmate -m". This option permits to mark the duplicates with samtools markdup in downstream analisys. The overhead is negligible as this part is multithreaded.
example2: mpirun -n 2 --map-by numa mpiBWA mem -t 8 -z numa -o SAM REF FASTQ1 FASTQ2
the -z will place the reference genome in the NUMA domain called NUMA-node.
Each mpi job are placed in a NUMA node (--map-by numa) and the reference genome is loaded in the memory of this domain.
NUMA domains are important for memory bandwidth optimizations(see memory section in Informatics ressources).
In the case of mpiBWA
:
- a sam file containing the entire reference header
In the case of mpiBWAByChr
, additional files are provided:
-
Individual chrN.sam files with aligned reads on each chrN (ChrN are the chromosome name from the header). The chrN.sam contains the header for the chrN and the reads mapping to that chromosome. The file contains primary and supplementary alignments for a read. If supplementary mapping are discordant they are not filtered out. The file can be sorted independently with mpiSORT and during the sorting these supplementary alignments are filtered out in the discordant.gz file (see mpiSORT documentation).
-
The file discordant.sam contains primary chimeric alignments with their secondary alignments.
-
The unmapped.sam contains unmapped reads.
Note that:
-
Supplementary or secondary alignments could be ignored with the
-M
bwa-mem option passed tompiBWA
. In all cases they are filtered out by mpiSORT. -
The discordant fragments are not extracted from the chromosome files, they are just copied. The purpose of the discordant fragment SAM is to help the marking of duplicates in the downstream analysis. Indeed it is easier to mark them separately and pass the result in the chromosome file.
The total memory used during the alignment if approximately the size of the .map
file plus the size of the SAM chunks loaded by each bwa tasks. A bwa thread takes around 300 MB of memory.
To optimize the access to reference genome you can put the reference in a NUMA domain. To get the number of numa domain use lstopo.
For instance AMD milan architecture:
lstopo | grep NUMA
NUMANode L#0 (P#0 31GB)
NUMANode L#1 (P#1 31GB)
NUMANode L#2 (P#2 31GB)
NUMANode L#3 (P#3 31GB)
NUMANode L#4 (P#4 31GB)
NUMANode L#5 (P#5 31GB)
NUMANode L#6 (P#6 31GB)
NUMANode L#7 (P#7 31GB)
and each NUMANode has 16 cores associated.
We have then 8 NUMANodes with each 31GB on the same node, enough for the human genome reference. Then place each MPI job in a NUMANode (mpirun -n 8 --map-by numa) and the reference genome in the memory associated ($MPIBWA mem -t 16 -z numa -o $OUTPUT $REF $FASTQ1 $FASTQ2).
NB: This feature has only been tested with openMPI.
The number of cores is related to the number of rank of the MPI jobs and to the number of threads you ask with bwa with the -t option. For example, the command line mpirun -n 4 mpiBWA mem -t 8 -o HCC1187C.sam hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq
will use 32 cores.
This section provides some guidelines to benchmark mpiBWA
and bwa
with your infrastructure. It is intended to help the reader to assess what is the best use case and configuration to efficiently benefit from the multithreading with MPI depending on your computing cluster infrastructure. We strongly recommend that you read carefully this section before running mpiBWA
on your cluster.
1 threads on 1 node :
bwa mem -t 1
[M::mem_process_seqs] Processed 40246 reads in 23.303 CPU sec, 23.367 real sec
8 threads on 1 node :
bwa mem -t 8
[M::mem_process_seqs] Processed 322302 reads in 199.261 CPU sec, 25.104 real sec
16 threads on 1 node :
bwa mem -t 16
[M::mem_process_seqs] Processed 644448 reads in 413.054 CPU sec, 26.000 real sec
Start with one node and one bwa thread:
1 threads on 1 node :
mpirun -n 1 mpiBWA mem -t 1
[M::mem_process_seqs] Processed 40224 reads in 25.779 CPU sec, 25.840 real sec
Now use several nodes and one bwa thread:
1 threads on 8 nodes :
mpirun -N 8 -npernode 1 -n 8 mpiBWA mem -t 1
[M::mem_process_seqs] Processed 40244 reads in 24.416 CPU sec, 24.475 real sec
So far, the walltime for both bwa
or mpiBWA
is pretty much the same (~25 sec) whatever the configuration.
mpiBWA MEM + multithreads
Now we go further with the parallelization and increase the number of mpi jobs with 10 bwa threads.
10 threads on 1 node :
mpirun -N 1 -n 1 mpiBWA mem -t 10
[M::mem_process_seqs] Processed 402610 reads in 257.005 CPU sec, 25.803 real sec
20 threads on 1 node :
mpirun -N 1 -n 1 mpiBWA mem -t 20
[M::mem_process_seqs] Processed 804856 reads in 546.449 CPU sec, 27.477 real sec
mpiBWA MEM + multithreads + multi nodes
Now you can increase the number of nodes.
20 threads on 2 nodes :
mpirun -N 2 -npernode 1 -n 2 --bind-to socket mpiBWA mem -t 10
[M::mem_process_seqs] Processed 403144 reads in 260.081 CPU sec, 26.114 real sec
40 threads on 2 nodes :
mpirun -N 2 -npernode 1 -n 2 --bind-to socket mpiBWA mem -t 20
[M::mem_process_seqs] Processed 805198 reads in 549.086 CPU sec, 27.610 real sec
Still, the walltime is pretty much the same (~25 sec) whatever you use 1 node or 2 nodes. Then, you can repeat the experiment with 3 and more nodes.
In this benchmark example, we can see that running mpiBWA with 10 threads seems a good configuration.
We notice a small increase of walltime when we use all the cores of a node. Therefore, We recommend to leave some cores for the system node.
Explore the mpirun
options as we do with the bind to socket, this could help and remove contention like NUMA effects.
Test also the setup with mpiBWAByChr
.
There are many ways to distribute and bind MPI jobs according to your architecture. We provide below several examples to launch MPI jobs in a standard manner or with a job scheduling system such as Slurm and PBS/Torque.
Two toy datasets (fastq files trimmed or not) are provided in the examples/data folder for testing the program.
mpirun
can be launched in a standard manner without using any job scheduling systems. For example:
mpirun -n 4 mpiBWA mem -t 8 -o ${HOME}/mpiBWAExample/HCC1187C.sam ${HOME}/mpiBWAExample/hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq
If needed, a file with the server name in -host
option can be provided to mpirun
. We invite you to read the mpirun
documentation for more details.
You can go in the examples directory and execute the standard.sh script to test the program.
In order to submit a job using Slurm, you can write a shell script as follows:
#! /bin/bash
#SBATCH -J MPIBWA_32_JOBS
#SBATCH -N 2 # Ask 2 nodes
#SBATCH -n 2 # total number of mpi jobs
#SBATCH -c 16 # use 16 cores per mpi job
#SBATCH --tasks-per-node=1 # Ask 1 mpi jobs per node
#SBATCH --mem-per-cpu=${MEM} # See Memory ressources
#SBATCH -t 01:00:00
#SBATCH -o STDOUT_FILE.%j.o
#SBATCH -e STDERR_FILE.%j.e
mpirun mpiBWA mem -t 16 -o ${HOME}/mpiBWAExample/HCC1187C.sam ${HOME}/mpiBWAExample/hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq`
You can go in the examples directory and submit the job with sbatch
using the slurm.sh script to test the program.
In order to submit a job using PBS/Torque, you can write a shell script as follows:
#! /bin/bash
#PBS -N MPIBWA_32_JOBS
#PBS -l nodes=2:ppn=16:mem=${MEM} # Ask 2 nodes and 16 jobs per node
#PBS -l walltime=24:00:00
#PBS -o STDOUT_FILE.%j.o
#PBS -e STDERR_FILE.%j.e
mpirun -n 2 mpiBWA mem -t 16 -o ${HOME}/mpiBWAExample/HCC1187C.sam ${HOME}/mpiBWAExample/hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq`
You can go in the examples directory and submit the job with qsub
command using the pbs.sh script to test the program.
This software needs a parallel filesystem for execution. The program has been tested with Lustre and BeeGFS.
WARNING: be aware that the flock mode (i.e. file locking) on parallel filesystems (Lustre, BeeGFS, etc) must be on, otherwise the reproducibility is not guaranteed.
The algorithm consists of 3 parts:
- MPI jobs create chunks of reads. All these chunks contain the same number of nucleotids.
- MPI jobs calls aligner jobs. This part invokes BWA MEM algorithm.
- MPI jobs write the alignment results in a SAM file or in individual chromosome SAM files. This part uses shared file pointers.
Due to DEFLATE gunzip cannot be parallelized and therefore is a bottleneck.
One option is to use bzip2 for a better compression of fastq (on average 20% better) and pbzip2, mpibzip2, unpigz for staging in the fastqs in parallel.
Here is an example tested on a 2 sockets AMD EPYC with 8 numa Nodes of 18 cores each
#MSUB -r MPI.MPIBWA.1NODE.128CPU
#MSUB -N 1
#MSUB -n 8
#MSUB --tasks-per-node=8
#MSUB -c 16
#MSUB -t 6000
#MSUB -x
#MSUB -o test.HG002.HiSeq30x.mpi.bwa.%I.o
#MSUB -e test.HG002.HiSeq30x.mpi.bwa.%I.e
mpirun --map-by 'numa:PE=16' --bind-to core numactl -N 0,1,2,3,4,5,6,7 -l mpiBWA mem -t 16 -f -z numa -Y -K 100000000 -o HG002.HiSeq30x hg19.fasta HG002.HiSeq30x.subsampled.R1.fastq HG002.HiSeq30x.subsampled.R2.fastq
This work is based on the original bwa aligner (Burrow-Wheeler Aligner for short-read alignment) written by Li et al.:
-
Li H. and Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 2010 (pdf).
-
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v1, 2013 (pdf).
Shared memory with MPI:
- Latham R. et al. Implementing MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided Communication, 2007.