Skip to content

An example of real assembly

Dinghua Li edited this page Aug 10, 2015 · 6 revisions

We use a human gut dataset (SRR341725) in this tutorial.

1. Install MEGAHIT

# clone MEGAHIT to the folder "megahit"
git clone https://github.com/voutcn/megahit.git
# compile MEGAHIT
cd megahit
make -j
# test MEGAHIT
make test
# go back to the original directory 
cd ..

2. Download the data

wget http://www.bio8.cs.hku.hk/dataset/MEGAHIT/SRR341725_1.fastq.gz
wget http://www.bio8.cs.hku.hk/dataset/MEGAHIT/SRR341725_2.fastq.gz

3. Assemble

megahit/megahit -1 SRR341725_1.fastq.gz -2 SRR341725_2.fastq.gz -o SRR341725.megahit_asm

When the assembly finishes, you will see a screen message like:

--- [Wed Aug 5 15:27:33 2015] Merging to output final contigs ---

--- [STAT] 70771 contigs, total 117990591 bp, min 200 bp, max 217183 bp, avg 1667 bp, N50 3897 bp

--- [Wed Aug 5 15:27:35 2015] ALL DONE. Time elapsed: 1992.627266 seconds ---

Assembled contigs are located in SRR341725.megahit_asm/final.contigs.fa.

Tips:

  • Single-end reads could be specified by -r
  • Interleaved PE reads could be specified by --12
  • -r, --12, -1 and -2 could be specified for multiple times to assemble multiple libraries; be careful that files in -1 and -2 should be paired and in order

4. Calculate contig coverage and extract unassembled reads

Here we only demonstrate how to calculate the read coverage per contig and extract unassembled reads using BBMap and samtools. You may pass the contigs your own subsequent analyses.

4.1 Download & install BBMap and samtools:

# BBmap
wget http://downloads.sourceforge.net/project/bbmap/BBMap_35.24.tar.gz
tar zvxf BBMap_35.24.tar.gz
# samtools
wget https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2
tar xvjf samtools-1.2.tar.bz2
cd samtools-1.2 && make -j && cd ..

4.2 Align reads with bbwrap.sh (which accepts multiple read files):

bbmap/bbwrap.sh ref=SRR341725.megahit_asm/final.contigs.fa in=SRR341725_1.fastq.gz in2=SRR341725_2.fastq.gz out=aln.sam.gz

4.3 Output per contig coverage to cov.txt with pileup.sh:

bbmap/pileup.sh in=aln.sam.gz out=cov.txt

4.4 Extract unmapped reads (SE to unmapped.se.fq and PE to unmapped.pe.fq):

samtools-1.2/samtools view -u -f4 aln.sam.gz | samtools-1.2/samtools bam2fq -s unmapped.se.fq - > unmapped.pe.fq
Clone this wiki locally