Skip to content

Metrics

Ryan Wick edited this page Oct 29, 2024 · 26 revisions

During an assembly, Autocycler generates a number of metric-containing YAML files. These can be saved to a TSV file using Autocycler table or viewed directly.

The full list of metrics and their descriptions can be found below, but some important ones (the default metrics for Autocycler table) are:

  • overall_clustering_score: a relative metric of how well the input assembly contigs clustered. Ranges from 0–1, with higher values being better.
  • consensus_assembly_fully_resolved: whether or not each cluster has resolved to a single sequence. 'True' is good and 'false' is bad.

It can be difficult to generalise about what constitutes 'good' or 'bad' values for many of these metrics, because they are very dependent on the genome being assembled. However, if you are performing many assemblies of the same species, then outlier values could be red flags. For example, imagine that you performed an Autocycler assembly on each of 100 S. aureus genomes, and most of these had a input_assemblies_compressed_unitig_count of approximately 2000–4000 unitigs, but one genome produced 10000 unitigs – that might indicate a problem with that genome's data.

Read subsampling metrics

These metrics are created by Autocycler subsample and can be found in subsample.yaml (one file for each assembly):

  • input_reads: details for the input read set:
    • count the number of reads in the read set (positive integer).
    • bases the number of bases in the read set(positive integer).
    • n50 the N50 read length for the read set (positive integer). Read of this length and above contain 50% of the bases in the read set.
  • output_reads: the same details (count, bases and n50) for each of the output read sets.

Input assembly metrics

These metrics are created by Autocycler compress and can be found in input_assemblies.yaml (one file for each assembly):

  • input_assemblies_count: the number of input assemblies used to build Autocycler's compacted De Bruijn graph (positive integer).
  • input_assemblies_total_contigs: the total number of contigs in all input assemblies (positive integer).
  • input_assemblies_total_length: the sum of the length of contigs in all input assemblies (positive integer).
  • input_assemblies_compressed_unitig_count: the number of unitigs in Autocycler's compacted De Bruijn graph (positive integer).
  • input_assemblies_compressed_unitig_total_length: the sum of the length of unitigs in Autocycler's compacted De Bruijn graph (positive integer).

Clustering metrics

These metrics are created by Autocycler cluster and can be found in clustering/clustering.yaml (one file for each assembly):

  • pass_cluster_count: the number of clusters which passed Autocycler cluster's QC (positive integer). This should ideally match the number of replicons in the genome. For example, if a genome has one chromosome and two plasmids, then an ideal value would be 3.
  • fail_cluster_count: the number of clusters which failed Autocycler cluster's QC (positive integer). Lower is better, but having some QC-fail clusters is normal and not a cause for concern.
  • pass_contig_count: the number of contigs in all of the QC-pass clusters (positive integer). This should ideally be close to the input assembly count times the pass cluster count, i.e. each input assembly produced one contig for each QC-pass cluster. However, it is often smaller, especially if the genome contains small plasmids (which are often omitted in long-read assemblies).
  • fail_contig_count: the number of contigs in all of the QC-fail clusters (positive integer).
  • pass_contig_fraction: the fraction of contigs which ended up in a QC-pass cluster (floating point from 0–1). Ideally, this value should be high (close to one), which indicates that the input assemblies were high quality and consistent.
  • fail_contig_fraction: the fraction of contigs which ended up in a QC-fail cluster (floating point from 0–1). This value and the previous value sum to 1. Ideally, this value should be low (close to zero).
  • cluster_balance_score: a value indicating how balanced the clustering was (floating point from 0–1). Ideally, this value should be high (close to one). A perfect score (1) indicates that each input assembly contributed one contig to each cluster. A low score indicates that input assemblies were uneven: contributing no contigs to some clusters and/or contributing multiple contigs to the same cluster.
  • cluster_tightness_score: a value indicating how tight the clustering was (floating point from 0–1). Ideally, this value should be high (close to one), which indicates that the sequences in each cluster are very similar to each other. A lower score indicates that some clusters have diverging sequences.
  • overall_clustering_score: a mean of the previous two scores: balance and tightness (floating point from 0–1). Ideally, this value should be high (close to one), which indicates that the input assemblies were consistent and clustered well.

Untrimmed cluster metrics

These metrics are created by Autocycler cluster and can be found in clustering/qc*/cluster_*/1_untrimmed.yaml (one file for each cluster):

  • untrimmed_cluster_size
  • untrimmed_sequence_lengths
  • untrimmed_sequence_length_mad
  • untrimmed_cluster_distance

Trimmed cluster metrics

These metrics are created by Autocycler trim and can be found in clustering/qc*/cluster_*/2_trimmed.yaml (one file for each cluster):

  • trimmed_cluster_size
  • trimmed_sequence_lengths
  • trimmed_sequence_length_mad

Consensus assembly metrics

These metrics are created by Autocycler combine and can be found in consensus_assembly.yaml (one file for each assembly):

  • consensus_assembly_total_length: the total number of bases in the consensus assembly (positive integer).
  • consensus_assembly_total_unitigs: the total number of unitigs in the consensus assembly (positive integer). Ideally, this will have the same value as pass_cluster_count, and if so, the following metric will be 'true'.
  • consensus_assembly_fully_resolved: whether or not each cluster has resolved to a single unitig (boolean). Will be 'true' if all clusters consist of only one unitig, 'false' if any of the clusters have more than one unitig.
  • consensus_assembly_clusters: details for each of the clusters:
    • length: the number of bases in the cluster (positive integer).
    • unitigs: the number of unitigs in the cluster (positive integer), ideally 1.
    • topology: the large-scale structure of the cluster. Will have one of the following values: 'circular' (one unitig with a circularising link), 'linear_blunt_blunt' (one unitig with two blunt ends, i.e. no links), 'linear_blunt_hairpin' (one unitig with a hairpin link on one end), 'linear_hairpin_hairpin' (one unitig with hairpin links on both ends), 'fragmented' (more than one unitig), 'other' (one unitig with unusual links, e.g. both circularising and hairpin).
Clone this wiki locally