Autocycler cluster

Basics

The autocycler cluster command groups input contigs into clusters. A cluster is a group of contigs which represent the same genomic sequence. It also decides which of these clusters should be included in the final assembly (QC-pass) and which should not (QC-fail).

For example, in a bacterial genome with a chromosome and a plasmid, each input assembly should ideally contain a contig for the complete chromosome and a contig for the complete plasmid. This step should produce a cluster of chromosome contigs and a cluster of plasmid contigs, both of which should pass QC. Any additional contigs which do not cluster well (e.g. fragmented pieces of the genome from an input assembly that failed to complete) will end up in clusters that fail QC.

Example command

autocycler cluster -a autocycler_out

This command takes a directory created by Autocycler compress (named autocycler in the example) which must contain an input_assemblies.gfa file. It will create a subdirectory named clustering which in turn contains a qc_pass directory for the good clusters and a qc_fail directory for the bad clusters.

Full usage

Usage: autocycler cluster [OPTIONS] --autocycler_dir <AUTOCYCLER_DIR>

Options:
  -a, --autocycler_dir <AUTOCYCLER_DIR>
          Autocycler directory containing input_assemblies.gfa file (required)
      --cutoff <CUTOFF>
          cutoff distance threshold for hierarchical clustering [default: 0.2]
      --min_assemblies <MIN_ASSEMBLIES>
          exclude clusters with fewer than this many assemblies [default: automatic]
      --max_contigs <MAX_CONTIGS>
          refuse to run if mean contigs per assembly exceeds this value [default: 25]
      --manual <MANUAL>
          manually define clusters using tree node numbers [default: automatic]
  -h, --help
          Print help
  -V, --version
          Print version

Notes

The --max_contigs option exists to catch obviously bad input data. If the mean number of contigs per input assemblies exceeds this value (default of 25), Autocycler cluster will refuse to run and display an error message. For example, if you give Autocycler 10 input assemblies with a total of 1000 contigs, that is an average of 100 contigs per assembly, which almost certainly means that they are fragmented or contaminated and thus not appropriate for Autocycler.

Implementation details

Cluster QC

There are two reasons that a cluster might fail QC:

Present in too few assemblies: the cluster is only found in a small number of assemblies. This threshold is controlled by the --min_assemblies option. If not explicitly set, --min_assemblies is set with the following logic: If there is just 1 input assembly (unusual) then set to 1. Otherwise, set to one-quarter of the assembly count (rounded) but no less than 2.
Contained within another cluster: the cluster is contained within another larger cluster which has passed QC. This is to exclude clusters which contain a fragmented piece of a sequence. For example, consider input assemblies where most contain a complete chromosome but some contain the chromosome fragmented into two contigs. This will likely result in three clusters: complete chromosome, chromosome piece 1 and chromosome piece 2. The two chromosome piece clusters will fail QC because their sequences are contained within the larger complete chromosome cluster.

Distance calculation

To calculate the pairwise distance between contig A and contig B, Autocycler uses this formula: 1 - (A ∩ B) / A, where A ∩ B is the total length of all unitigs shared by both contigs and A is the total length of all unitigs in the first contig.. This means that the distances are asymmetric: the A-vs-B distance is likely different from the B-vs-A distance. You can see this if you view the clustering/pairwise_distances.phylip file, for example:

3
assembly_01.fasta contig_a (1044289 bp)   0.00000000   0.00052292   0.31332140
assembly_02.fasta contig_b (1044294 bp)   0.00052388   0.00000000   0.31329124
assembly_03.fasta contig_c (713618 bp)    0.00047100   0.00042614   0.00000000

In this case, contig_a-vs-contig_c has a large distance of 0.31, but contig_c-vs-contig_a has a small distance of 0.00047. This is because contig_c is contained within contig_a.

Autocycler uses this asymmetry to see whether a contig is contained within another contig, which is used for QCing clusters (see above). Before running UPGMA clustering, the matrix is made symmetrical by taking the mean distance for each pair.

Cluster refinement

Clusters are first defined using a threshold on the UPGMA tree, set using the --cutoff option. They are then are iteratively refined: Autocycler tries splitting each cluster to see if the overall clustering score improves (see Clustering metrics for an explanation of the score).

This cluster-splitting refinement can potentially improve the clustering score in two main ways:

If the genome contains two very similar sequences (e.g. two plasmids which share a lot of sequence), they may initially be grouped into a single cluster. Cluster refinement can divide such a fused cluster into two separate clusters.
A cluster may contain an outlier contig, e.g. most contigs contain a full chromosomal sequence but one contig only contains 90% of the chromosome. Cluster refinement can exclude the outlier in such a case to produce a tighter cluster.

Optional manual intervention

Subsequent Autocycler steps operate only on QC-pass clusters and ignore QC-fail clusters. You can therefore move cluster directories between the clustering/qc_pass and clustering/qc_fail directories to include/exclude them from the consensus assembly.

Autocycler cluster saves its UPGMA clustering tree in the clustering/clustering.newick file. You can view this tree (e.g. with Taxonium) to better understand the clusters. If you would like to manually specify which clades of the tree should be used to define clusters, their numbers can be given using the --manual option, e.g. --manual 31,38. Hovering over a clade root in Taxonium will display its number.

When examining the clustering, you might find that particular assemblers consistently struggle with a given genome (e.g. miniasm and Raven always produce fragmented assemblies). When this happens, you can go back to the Generating input assemblies and change which assemblers you use (e.g. replace miniasm and Raven with alternative assemblers).

Toy example

(The toy example is introduced on the Autocycler compress page.)

Running autocycler cluster on the toy example's unitig graph produces the following distance matrix for all input contigs:

autocycler cluster matrix

These distances are made by comparing each contig's set of unitigs in the graph, not by comparing sequences directly, making it very fast to compute. Each distance is calculated as 1 - (A ∩ B) / A, where A ∩ B is the total length of all unitigs shared by both contigs and A is the total length of all unitigs in the first contig. Note that this formula creates an asymmetric distance matrix which is useful because it allows us to distinguish when one contig is contained within another. For example, all of d2's unitigs are contained in b2's path but b2 also contains unitigs that are not in d2's path. These means that the d2-vs-b2 distance is 0 while the b2-vs-d2 distance is 0.06.

These distances are then used to perform UPGMA clustering:

autocycler cluster tree

Contigs a1, b1, c1 and d1 are all close in the UPGMA tree, producing cluster 1. Contigs a2, b2 and d2 also form a close group (cluster 2), despite not having a contig from all of the assemblies.

Contig d3 has no close neighbours in the tree, so it forms cluster 3 by itself. Since this cluster only contains a contig from one assembly (below the --min_assemblies threshold), it fails QC and will not be used in subsequent steps.

For each cluster, its unitigs are saved as a separate graph named 1_untrimmed.gfa. Here are the graphs for the two QC-pass clusters in this example:

untrimmed cluster graphs

The next two steps in the process (Autocycler trim and Autocycler resolve) operate on a per-cluster basis using these graphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autocycler cluster

Basics

Example command

Full usage

Notes

Implementation details

Cluster QC

Distance calculation

Cluster refinement

Optional manual intervention

Toy example

Overview

Quick start

Main Autocycler steps

Assessing Autocycler assemblies

Additional commands

Other pages

Clone this wiki locally