Skip to content

Manually curated assembly

Ryan Wick edited this page Jan 14, 2025 · 29 revisions

This page follows the same steps as the Fully Automated Assembly page but adds additional manual steps that allow the user to curate and examine the results. These optional steps let you inspect intermediate outputs and make adjustments, ensuring that the final consensus assembly is as accurate as possible.

Steps 1 and 2: subsample reads and generate input assemblies

reads=ont.fastq.gz  # your read set goes here
threads=16  # set as appropriate for your system
genome_size=$(genome_size_raven.sh "$reads" "$threads")  # can set this manually if you know the value

autocycler subsample --reads "$reads" --out_dir subsampled_reads --genome_size "$genome_size"

mkdir assemblies
for assembler in canu flye miniasm necat nextdenovo raven; do
    for i in 01 02 03 04; do
        "$assembler".sh subsampled_reads/sample_"$i".fastq assemblies/"$assembler"_"$i" "$threads" "$genome_size"
    done
done

# Optional step: remove the subsampled reads to save space
rm subsampled_reads/*.fastq

Manual step: curate input assemblies

At this stage, you can inspect each input assembly and decide whether you want to delete or modify it before continuing with Autocycler. See the Generating input assemblies page for more details.

Steps 3 and 4: compress and cluster input assemblies

autocycler compress -i assemblies -a autocycler_out
autocycler cluster -a autocycler_out

Manual step: curate clusters

At this stage, you can inspect the clustering and, if desired, modify it before continuing with Autocycler. See the Autocycler cluster page for more details.

Steps 5 and 6: trim and resolve each QC-pass cluster

for c in autocycler_out/clustering/qc_pass/cluster_*; do
    autocycler trim -c "$c"
    if [[ $(wc -c <"$c"/1_untrimmed.gfa) -lt 1000000 ]]; then
        autocycler dotplot -i "$c"/1_untrimmed.gfa -o "$c"/1_untrimmed.png
        autocycler dotplot -i "$c"/2_trimmed.gfa -o "$c"/2_trimmed.png
    fi
    autocycler resolve -c "$c"
done

The above loop also runs Autocycler dotplot clusters less than ~1 Mbp in size, for both the untrimmed and trimmed sequences. This size limit is because Autocycler dotplot is fast to run on small sequences (e.g. plasmids) but can take a while to finish for longer sequences (e.g. chromosomes).

Manual step: examine dotplots

After trimming, you can visually inspect each cluster's dotplots, which can show the effects of trimming and reveal potential structural issues. See the Autocycler dotplot page for more information.

Manual step: examine Autocycler bridging

In this step, you can review how Autocycler has bridged the sequences to form a consensus. This can be useful for identifying regions where sequence ambiguity remains. In particular, it can be helpful to examine each cluster's 4_merged.gfa file to see if there is structural heterogeneity or conflicts between assemblies, which may suggest areas to review or adjust manually.

Step 7: combine resolved clusters into a final assembly

autocycler combine -a autocycler_out -i autocycler_out/clustering/qc_pass/cluster_*/5_final.gfa

The final consensus assembly will be saved as autocycler_out/consensus_assembly.fasta.

Manual step: remove any extraneous sequences

If the consensus assembly is not fully resolved, viewing the assembly graph (consensus_assembly.gfa) in Bandage can reveal any problematic parts of the assembly. It may then be possible to use Autocycler clean to remove unwanted tigs to allow for a fully resolved assembly.

Clone this wiki locally