-
Notifications
You must be signed in to change notification settings - Fork 28
Multiple sequence alignment
Before running this step, you'll need to have completed the previous one (reconciling contigs). I.e. you should have a Trycycler output directory (which I'll assume is called trycycler
) with subdirectories for each of your good clusters, each of which contains a 1_contigs
subdirectory and a 2_all_seqs.fasta
file.
Trycycler msa uses MUSCLE, so that will need to be installed before running (see Requirements).
This step takes the reconciled contig sequences (2_all_seqs.fasta
) and runs a multiple sequence alignment.
For example, it would take sequences like this:
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
And produce an alignment like this:
GGCAGAG--CGACGTAAA-TTACGAGT-AAAGGAGGGGA-GAGCATTAAG-CATGCCTAAACTG
GGCAGAGCGCGACGTAAA-TTACGAGTAAAAGGA-GGGAGGAGCATTAAGCCATGCCT--ACTG
GGCAGAGCGCGAC-TAAATTTACGAGT-AAAGGA-GGGAGGAGCAT--AGCCATGCCTAAACTG
The Trycycler msa command must be run separately for each of your good clusters. Assuming your good clusters are numbers 1, 7 and 8, these are the commands you would run:
trycycler msa --cluster_dir trycycler/cluster_001
trycycler msa --cluster_dir trycycler/cluster_007
trycycler msa --cluster_dir trycycler/cluster_008
Unlike in previous steps of Trycycler, the msa step is hands-off. I.e. no manual intervention is required – just run it and wait for it to finish.
Trycycler msa will typically take a few minutes to complete. Longer sequences and larger numbers of sequences will be slower.
Trycycler msa's has the following parameters you can adjust:
-
--kmer
: the k-mer size used for sequence partitioning (default = 32). -
--step
: the step size used for sequence partitioning (default = 1000). -
--lookahead
: the look-ahead margin used for sequence partitioning (default = 10000). -
--threads
: this is how many parallel instances of MUSCLE will be used when aligning the sequence partitions. It will only affect the speed performance, so you'll probably want to use as many threads as you have available.
You probably won't need to adjust the partitioning parameters (--kmer
, --step
and --lookahead
) and can just leave them as the defaults. If you're curious exactly what they are used for, see How it works.
When finished, Trycycler reconcile will make 3_msa.fasta
in the cluster directory, a FASTA-formatted multiple sequence alignment of the contigs ready for use in generating a consensus.
Since MUSCLE cannot handle very long input sequences, Trycycler msa first partitions the sequence into pieces, then it conducts a multiple sequence alignment on each, and finally it stitches the alignments together:
Input sequences:
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
Partitioned sequences:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGC CTAAACTG
GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGC CTACTG
GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGC CTAAACTG
Aligned partitions:
GGCAGAG--CGA CGTAAA-TTAC GAGT-AAAGGAGGGGA-GAGCATTAAG-CATGC CTAAACTG
GGCAGAGCGCGA CGTAAA-TTAC GAGTAAAAGGA-GGGAGGAGCATTAAGCCATGC CT--ACTG
GGCAGAGCGCGA C-TAAATTTAC GAGT-AAAGGA-GGGAGGAGCAT--AGCCATGC CTAAACTG
Merged alignments:
GGCAGAG--CGACGTAAA-TTACGAGT-AAAGGAGGGGA-GAGCATTAAG-CATGCCTAAACTG
GGCAGAGCGCGACGTAAA-TTACGAGTAAAAGGA-GGGAGGAGCATTAAGCCATGCCT--ACTG
GGCAGAGCGCGAC-TAAATTTACGAGT-AAAGGA-GGGAGGAGCAT--AGCCATGCCTAAACTG
The sequence partitioning is done by grabbing a k-mer (size determined by --kmer
) in the first sequence and finding it each of the other sequences (looking a margin ahead as determined by --lookahead
). If the k-mer is found once and only once in each sequence, then it forms the boundary of a partition. Then the position jumps ahead (determined by --step
) and the process repeats.
To work through an example, imagine we had these input sequences:
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
and parameters of --kmer 4 --step 10 --lookahead 20
(unrealistically small but appropriate for a toy example).
The first k-mer (GCGA
)is taken 10 bp (from --step
) into the first sequence:
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And then that k-mer is searched for in each of the sequences (only looking in a chunk of each sequence defined by the lookahead margin):
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG -------------------- lookahead
And since that k-mer was found once and only once in each sequence, it is considered a safe place to divide the sequences:
GGCAGAGCGA CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA CGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA CTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
We then repeat the process, grabbing a k-mer 10 bp further on in the first sequence:
GGCAGAGCGA CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And search for it in all sequences:
GGCAGAGCGA CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGA CGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGA CTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG -------------------- lookahead
Once again, it was found once and only once, so it is a safe division point:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
And again, grab a k-mer from the first sequence:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And search for it in all sequences:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG -------------------- lookahead
But this time the k-mer was not found once and only once in each sequence, so it's not a safe division point. We jump ahead 10 bp and try again with a new k-mer:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And search for it in all sequences:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG ------------------------------ lookahead
Once again we failed to find a unique hit, so we do another 10 bp jump to get a new k-mer:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And search:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG -------------------------------------- lookahead
And now we've finally found a once-in-each-sequence hit so we have a safe division point:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGC CTAAACTG
GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGC CTACTG
GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGC CTAAACTG
In this way, Trycycler msa partitions the sequence into separate chunks while avoiding putting chunk boundaries in ambiguous or repetitive parts of the sequence.
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth