Skip to content

Why didn't my assembly go well?

Ryan Wick edited this page Jan 14, 2025 · 8 revisions

There are two common reasons for Autocycler to fail to produce a completely resolved assembly:

  1. The input assemblies were low quality.
  2. The genome contains one or more linear sequences.

Low quality input assemblies

In order to generate a complete and clean consensus assembly, Autocycler requires that most input assemblies are complete – each sequence in the genome assembled to a single contig. If this is not the case, Autocycler will not run well. For example, if your bacterial genome has a 5 Mbp chromosome but each input assembly has the chromosome fragmented into a 2 Mbp piece and a 3 Mbp piece, then Autocycler will not be able to create a complete 5 Mbp consensus sequence for the chromosome.

The most common reason for low-quality input assemblies is an insufficient long read set: either the depth is too low or the reads are too short. Ideally, the depth will be 100× or more, but sometimes good assemblies can be made with <50× depth. For read length, it is important that there are plenty of reads longer than the longest repeat in the genome. For many bacterial genomes, the longest repeat is ~5–6 kbp (the rRNA operon), so a long read set with an N50 of 8 kbp or more will be sufficient. However, some bacterial genomes have much longer repeats (e.g. multiple copies of a prophage) necessitating much longer reads to get a complete assembly.

Linear sequences

Autocycler can struggle to fully resolve linear sequences for a few reasons:

  • Input contigs can extend past hairpin ends, leading to erratic contig lengths. The Autocycler trim step can often but not always repair this.
  • Input contigs can be inconsistent regarding where blunt ends terminate, leading to unresolved sequences in the Autocycler resolve step.

See the Linear sequences and Autocycler clean pages for a more thorough description of these problems.

Now what do I do?

If your Autocycler assembly went poorly due to low-quality input assemblies, you can try the following:

  • If the consensus_assembly.gfa file (made by Autocycler combine) is nearly complete, you might be able to use Autocycler clean to finish it. This is often required for linear sequences.
  • Try using different assemblers to generate your input assemblies. While Autocycler comes with helper scripts for some common ones, any long-read assembler can potentially work.
  • Try different parameters when making your input assemblies. Some assemblers (e.g. Canu) have a large number of parameters that can influence the result.
  • Manually curate your input assemblies before using them with Autocycler. Specifically, discard any assemblies that appear to be incomplete.

If none of the above work well, then your read set is likely insufficient, in which case you may need to sequence again aiming for deeper and longer reads.

If your Autocycler assembly went poorly due to a linear sequence, you can try the following:

  • In the input assemblies, manually trim the linear sequence to a consistent point. If each input contig for the linear sequence starts and ends at the same position, Autocycler will have an easier time generating a consensus.
  • Use Bandage to view and manipulate the graphs made by Autocycler resolve for the linear sequence. For example, the 5_final.gfa graph will contain the most-resolved version of the sequence, but there may still be unresolved parts (multiple unitigs) at the sequence ends. You can then decide which unitigs to keep/discard for the consensus sequence.

Manually resolving linear sequences in an Autocycler assembly is currently a cumbersome process, and future versions of Autocycler will aim to improve behaviour on linear sequences.

Clone this wiki locally