-
Notifications
You must be signed in to change notification settings - Fork 28
How variants are chosen for the consensus sequence
This page goes into more detail on how Trycycler produces a consensus sequence. Specifically, when faced with multiple different variants of a sequence, how does it choose which one is best?
Take this hypothetical MSA as an input to Trycycler consensus:
GGAGGAGCTTTT-CGCCGCAGTCAACGAA-TAGCGTCTGAAAACGTGTATCATATCTTGCCTCGAAAAGCCGCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATCTCTTGCCTCGAAAATCCTCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--ATTAGCGTCTGAAAACGTGTATCATGTCTTGCCTCGAAAATCCTCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATCTCTTGCCTCGAAAAGCCGCACT
GGAGGAGCTTTT-CGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATGTCTTGCCTCGAAAATCCGCACT
Trycycler first divides the MSA into 'same' and 'different' chunks:
GGAGGAGCTTTT - CGCCGCAGTCAAC GAA- TAGCGTCTGAAAACGTGTATCAT A TCTTGCCTCGAAAA GCCG CACT
GGAGGAGCTTTT T CGCCGCAGTCAAC --A- TAGCGTCTGAAAACGTGTATCAT C TCTTGCCTCGAAAA TCCT CACT
GGAGGAGCTTTT T CGCCGCAGTCAAC --AT TAGCGTCTGAAAACGTGTATCAT G TCTTGCCTCGAAAA TCCT CACT
GGAGGAGCTTTT T CGCCGCAGTCAAC --A- TAGCGTCTGAAAACGTGTATCAT C TCTTGCCTCGAAAA GCCG CACT
GGAGGAGCTTTT - CGCCGCAGTCAAC --A- TAGCGTCTGAAAACGTGTATCAT G TCTTGCCTCGAAAA TCCG CACT
You can also think of it like a variant sequence graph, which splits at points of variation:
↗ - ↘ ↗ GAA- ↘ ↗ A ↘ ↗ GCCG ↘
↗ T ↘ ↗ --A- ↘ ↗ C ↘ ↗ TCCT ↘
GGAGGAGCTTTT → T → CGCCGCAGTCAAC → --AT → TAGCGTCTGAAAACGTGTATCAT → G → TCTTGCCTCGAAAA → TCCT → CACT
↘ T ↗ ↘ --A- ↗ ↘ C ↗ ↘ GCCG ↗
↘ - ↗ ↘ --A- ↗ ↘ G ↗ ↘ TCCG ↗
A decision must now be made for each 'different' chunk: which variant should go in the consensus?
The first thing Trycycler uses to choose variants is minimum total Hamming distance to the other variants.
Let's work through the first point of variation, where there are five options (-
, T
, T
, T
, -
) and two unique options (-
, T
). For each unique option, we get the sum of its distances to each of the options. Since our sequences are already aligned (they came from an MSA), we can use Hamming distance which makes it easier.
Here are the values in a table, where the different options are in columns, the unique options are in rows, the values show Hamming distances, and the total Hamming distance for each unique option is in the rightmost column:
- |
T |
T |
T |
- |
total | |
---|---|---|---|---|---|---|
- |
0 | 1 | 1 | 1 | 0 | 3 |
T |
1 | 0 | 0 | 0 | 1 | 2 |
Since T
's total Hamming distance of 2 is lower than -
's total distance of 3, the preferred variant is T
.
You can see that for simple cases like this, choosing the variant with the minimum total Hamming distance is equivalent to choosing the most common variant.
Let's apply the same logic to the second point of variation which has three unique options:
GAA- |
--A- |
--AT |
--A- |
--A- |
total | |
---|---|---|---|---|---|---|
GAA- |
0 | 2 | 3 | 2 | 2 | 9 |
--A- |
2 | 0 | 1 | 0 | 0 | 3 |
--AT |
3 | 1 | 0 | 1 | 1 | 6 |
The minimum total Hamming distance is for the --A-
variant, which once again happens to be the most common option. So far so good!
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth