Skip to content

Multiple sequence alignment

Ryan Wick edited this page May 19, 2020 · 21 revisions

Requirements

Before running this step, you'll need to have completed the previous one (reconciling contigs). I.e. you should have a Trycycler output directory (which I'll assume is called trycycler) with subdirectories for each of your good clusters, each of which contains a 1_contigs subdirectory and a 2_all_seqs.fasta file.

Trycycler msa uses MUSCLE, so that will need to be installed before running (see Requirements).

Concept

This step takes the reconciled contig sequences (2_all_seqs.fasta) and runs a multiple sequence alignment.

For example, it would take sequences like this:

GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

And produce an alignment like this:

GGCAGAG--CGACGTAAA-TTACGAGT-AAAGGAGGGGA-GAGCATTAAG-CATGCCTAAACTG
GGCAGAGCGCGACGTAAA-TTACGAGTAAAAGGA-GGGAGGAGCATTAAGCCATGCCT--ACTG
GGCAGAGCGCGAC-TAAATTTACGAGT-AAAGGA-GGGAGGAGCAT--AGCCATGCCTAAACTG

Running Trycycler msa

The Trycycler msa command must be run separately for each of your good clusters. Assuming your good clusters are numbers 1, 7 and 8, these are the commands you would run:

trycycler msa --cluster_dir trycycler/cluster_001
trycycler msa --cluster_dir trycycler/cluster_007
trycycler msa --cluster_dir trycycler/cluster_008

Unlike in previous steps of Trycycler, the msa step is hands-off. I.e. no manual intervention is required – just run it and wait for it to finish.

Trycycler msa will typically take a few minutes to complete. Longer sequences and larger numbers of sequences will be slower.

Settings

Trycycler msa's has the following parameters you can adjust:

  • --kmer: the k-mer size used for sequence partitioning (default = 32).
  • --step: the step size used for sequence partitioning (default = 1000).
  • --lookahead: the look-ahead margin used for sequence partitioning (default = 10000).
  • --threads: this is how many parallel instances of MUSCLE will be used when aligning the sequence partitions. It will only affect the speed performance, so you'll probably want to use as many threads as you have available.

You probably won't need to adjust the partitioning parameters (--kmer, --step and --lookahead) and can just leave them as the defaults. If you're curious exactly what they are used for, see How it works.

Output

When finished, Trycycler reconcile will make 3_msa.fasta in the cluster directory, a FASTA-formatted multiple sequence alignment of the contigs ready for use in generating a consensus.

How it works

Since MUSCLE cannot handle very long input sequences, Trycycler msa first partitions the sequence into pieces, then it conducts a multiple sequence alignment on each, and finally it stitches the alignments together:

Input sequences:
  GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
  GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
  GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

Partitioned sequences:
  GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGC    CTAAACTG
  GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGC  CTACTG
  GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGC     CTAAACTG

Aligned partitions:
  GGCAGAG--CGA  CGTAAA-TTAC  GAGT-AAAGGAGGGGA-GAGCATTAAG-CATGC  CTAAACTG
  GGCAGAGCGCGA  CGTAAA-TTAC  GAGTAAAAGGA-GGGAGGAGCATTAAGCCATGC  CT--ACTG
  GGCAGAGCGCGA  C-TAAATTTAC  GAGT-AAAGGA-GGGAGGAGCAT--AGCCATGC  CTAAACTG

Merged alignments:
  GGCAGAG--CGACGTAAA-TTACGAGT-AAAGGAGGGGA-GAGCATTAAG-CATGCCTAAACTG
  GGCAGAGCGCGACGTAAA-TTACGAGTAAAAGGA-GGGAGGAGCATTAAGCCATGCCT--ACTG
  GGCAGAGCGCGAC-TAAATTTACGAGT-AAAGGA-GGGAGGAGCAT--AGCCATGCCTAAACTG

The sequence partitioning is done by grabbing a k-mer (size determined by --kmer) in the first sequence and finding it each of the other sequences (looking a margin ahead as determined by --lookahead). If the k-mer is found once and only once in each sequence, then it forms the boundary of a partition. Then the position jumps ahead (determined by --step) and the process repeats.

Example partitioning

To work through an example, imagine we had these input sequences:

GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

and parameters of --kmer 4 --step 10 --lookahead 20 (unrealistically small but appropriate for a toy example).

The first k-mer (GCGA)is taken 10 bp (from --step) into the first sequence:

GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And then that k-mer is searched for in each of the sequences (only looking in a chunk of each sequence defined by the lookahead margin):

GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
--------------------
     lookahead

And since that k-mer was found once and only once in each sequence, it is considered a safe place to divide the sequences:

GGCAGAGCGA    CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

We then repeat the process, grabbing a k-mer 10 bp further on in the first sequence:

GGCAGAGCGA    CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And search for it in all sequences:

GGCAGAGCGA    CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
              --------------------
                   lookahead

Once again, it was found once and only once, so it is a safe division point:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

And again, grab a k-mer from the first sequence:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And search for it in all sequences:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
                          --------------------
                               lookahead

But this time the k-mer was not found once and only once in each sequence, so it's not a safe division point. We jump ahead 10 bp and try again with a new k-mer:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And search for it in all sequences:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
                          ------------------------------
                                    lookahead

Once again we failed to find a unique hit, so we do another 10 bp jump to get a new k-mer:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And search:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
                          --------------------------------------
                                        lookahead

And now we've finally found a once-in-each-sequence hit so we have a safe division point:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGC    CTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGC  CTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGC     CTAAACTG

In this way, Trycycler msa partitions the sequence into separate chunks while avoiding putting chunk boundaries in ambiguous or repetitive parts of the sequence.

Clone this wiki locally