Skip to content

How variants are chosen for the consensus sequence

Ryan Wick edited this page Jan 5, 2021 · 15 revisions

This page goes into more detail on how Trycycler produces a consensus sequence. Specifically, when faced with multiple different variants of a sequence, how does it choose which one is best?

Breaking the sequence into chunks

Take this hypothetical MSA as an input to Trycycler consensus:

GGAGGAGCTTTT-CGCCGCAGTCAACGAA-TAGCGTCTGAAAACGTGTATCATATCTTGCCTCGAAAAGCCGCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATCTCTTGCCTCGAAAATCCTCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--ATTAGCGTCTGAAAACGTGTATCATGTCTTGCCTCGAAAATCCTCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATCTCTTGCCTCGAAAAGCCGCACT
GGAGGAGCTTTT-CGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATGTCTTGCCTCGAAAATCCGCACT

Trycycler first divides the MSA into 'same' and 'different' chunks:

GGAGGAGCTTTT   -   CGCCGCAGTCAAC   GAA-   TAGCGTCTGAAAACGTGTATCAT   A   TCTTGCCTCGAAAA   GCCG   CACT
GGAGGAGCTTTT   T   CGCCGCAGTCAAC   --A-   TAGCGTCTGAAAACGTGTATCAT   C   TCTTGCCTCGAAAA   TCCT   CACT
GGAGGAGCTTTT   T   CGCCGCAGTCAAC   --AT   TAGCGTCTGAAAACGTGTATCAT   G   TCTTGCCTCGAAAA   TCCT   CACT
GGAGGAGCTTTT   T   CGCCGCAGTCAAC   --A-   TAGCGTCTGAAAACGTGTATCAT   C   TCTTGCCTCGAAAA   GCCG   CACT
GGAGGAGCTTTT   -   CGCCGCAGTCAAC   --A-   TAGCGTCTGAAAACGTGTATCAT   G   TCTTGCCTCGAAAA   TCCG   CACT

You can also think of it like a variant sequence graph, which splits at points of variation:

             ↗ - ↘               ↗ GAA- ↘                         ↗ A ↘                ↗ GCCG ↘     
             ↗ T ↘               ↗ --A- ↘                         ↗ C ↘                ↗ TCCT ↘     
GGAGGAGCTTTT → T → CGCCGCAGTCAAC → --AT → TAGCGTCTGAAAACGTGTATCAT → G → TCTTGCCTCGAAAA → TCCT → CACT
             ↘ T ↗               ↘ --A- ↗                         ↘ C ↗                ↘ GCCG ↗     
             ↘ - ↗               ↘ --A- ↗                         ↘ G ↗                ↘ TCCG ↗     

A decision must now be made for each 'different' chunk: which variant should go in the consensus?

Minimum total Hamming distance

The first thing Trycycler uses to choose variants is minimum total Hamming distance to the other variants.

Different chunk #1

Let's work through the first point of variation, where there are five options (-, T, T, T, -) and two unique options (-, T). For each unique option, we get the sum of its distances to each of the options. Since our sequences are already aligned (they came from an MSA), we can use Hamming distance which makes it easier.

Here are the values in a table, where the different options are in columns, the unique options are in rows, the values show Hamming distances, and the total Hamming distance for each unique option is in the rightmost column:

- T T T - total
- 0 1 1 1 0 3
T 1 0 0 0 1 2

Since T's total Hamming distance of 2 is lower than -'s total distance of 3, the preferred variant is T.

You can see that for simple cases like this, choosing the variant with the minimum total Hamming distance is equivalent to choosing the most common variant.

Different chunk #2

Let's apply the same logic to the second point of variation which has three unique options:

Clone this wiki locally