Skip to content

Tutorial (medium)

Ryan Wick edited this page Sep 14, 2022 · 19 revisions

Welcome to the MEDIUM version of the tutorial. Here you will be given:

  • Moderately detailed instructions on what to do.
  • Goals for each step in the process.
  • Expected results after each step.
  • Tips and guidelines along the way.

Required files

If you haven't already, download the sample data hybrid Illumina+ONT read set to assemble:

  • Paired-end Illumina reads in FASTQ format:
    • S_aureus_JKD6159_Illumina_1.fastq.gz: 3.4 million reads, 499 Mbp
    • S_aureus_JKD6159_Illumina_2.fastq.gz: 3.4 million reads, 499 Mbp
  • Basecalled ONT R10.4 reads in FASTQ format:
    • S_aureus_JKD6159_ONT_R10.4_guppy_v6.1.7.fastq.gz: 1.8 million reads, 5.6 Gbp

Read QC

The goal of read QC is to discard low-quality reads and/or trim off low-quality regions of reads. This will make them easier to use in later steps (assembly and polishing).

For Illumina read QC, use fastp to remove adapters and trim off low-quality bases. Its default settings work well, so you just need to give it input and output files. Note that some paired-end reads become orphaned during QC, i.e. their corresponding read is discard so they are no longer part of a pair. This shouldn't be very many of these, so I like to save the orphaned reads into a file, confirm that it's a small proportion of the reads, then discard them.

This ONT read set has a poor N50 (4.2 kbp). Throwing out shorter reads will improve the N50 at the cost of depth, but since this read set is so deep (5.6 Gbp), that trade-off is worth it. Run Filtlong with --min_length 6000 to discard reads less than 6 kbp in length. You can then run Filtlong again with --keep_percent 90 to throw out the worst 10% of reads. After these QC steps, you should be left with an ONT read set with a much better N50 (15 kbp) but still plenty of depth (1.8 Gbp).

At this point you should have post-QC Illumina reads (in two FASTQ files) and post-QC ONT reads (in one FASTQ file).