-
Notifications
You must be signed in to change notification settings - Fork 9
8: Post Alignment Tasks
- Overview
- Relabel Sequence Records
- Trim Sequence Alignments
- Calculate Sequence Length Heterogeneity Statistics
- Convert to Nexus and Phylip Format
- Concatenate Alignments
After multiple sequence alignment, there are several tasks that can help prepare datasets for downstream analyses. One important task involves relabeling sequences, such that sequence labels are composed of taxon labels, accession numbers, voucher codes, or some combination of the above. This allows sequences to be readily identified with a unique name, making them compatible with concatenation. Details on sequence relabeling are provided in the Relabel Sequence Records section. After relabeling, alignment trimming can be performed. Multiple options are available for trimming, with details provided in the Trim Sequence Alignments section. Files with relabeled sequences can be readily converted into other common formats (such as nexus and phylip), and supermatrices can be created by concatenating alignments. Details for these tasks can be found in the Convert to Nexus and Phylip Format and Concatenate Alignments sections below.
NEW - Alignments can be assessed for sequence length variation. For each alignment, the sequence lengths are used to calculate the mean, minimum, maximum, standard deviation, and coefficient of variation. These metrics can be calculated from the aligned sequence lengths (start and stop position in the alignment), or from the unaligned sequence lengths (just the number of base pairs in a sequence). The aligned sequence lengths coefficient of variation (ASL-CV; Portik & Wiens, in review) is a standardized measure for comparing length variation across alignments, with higher ASL-CV values indicating greater variability in sequence lengths. More details are provided in the Calculate Sequence Length Heterogeneity Statistics section.
Up to this point, the sequences used in other SuperCRUNCH steps have been labeled using modified versions of the original description line (containing an accession, taxon label, gene description, voucher code, etc.). Many downstream programs are unable to use these long labels, and so they require conversion to a simpler format. Relabeling can be performed in SuperCRUNCH using several different options. For example, sequences can be labeled using:
-
A taxon name. This option can include species (two-part) and subspecies (three-part) names. This will allow alignments to be concatenated.
-
An accession number. This will NOT allow alignments to be concatenated, but guarantees all sequences present in a particular alignment will have a unique name.
-
A combination of a taxon name (species, subspecies) and an accession number. This will NOT allow alignments to be concatenated, but guarantees all sequences present in a particular alignment will have a unique name.
-
A combination of a voucher code and any of the above three options. The
taxon + voucher
names could allow for concatenation (e.g., for phylogeographic studies), but theaccession + voucher
names orvoucher + taxon + accession
names will not.
The relabeling strategy selected will depend on the type of dataset being produced. For example, a species-level supermatrix would require all sequences be labeled with a taxon name to allow for concatenation. A population-level dataset for a single species could be labeled by accession number, or by taxon + accession number. This will ensure all sequences within an alignment are given a unique name, but it will not allow them to be linked across loci. To link samples in a population-level dataset, the voucher option can be used in combination with the taxon name option. Given that there are many possible ways to relabel sequences, instructions for usage and clear examples of all options are provided in the section below.
The Fasta_Relabel_Seqs.py
module will relabel the sequences within fasta files based on the relabeling option chosen. Three base options are available, and are used with the -r
flag:
-
-r species
- Relabel using a taxon name. This option can include species (two-part) and subspecies (three-part) names. This will allow alignments to be concatenated. Example:Hyperolius_viridiflavus
,Hyperolius_viridiflavus_pantherinus
. -
-r accession
- Relabel using an accession number. This will NOT allow alignments to be concatenated, but guarantees all sequences present in a particular alignment will have a unique name. Example:AF338336.1
. -
-r species_acc
- Relabel using a combination of a taxon name (species, subspecies) and an accession number. This will NOT allow alignments to be concatenated, but guarantees all sequences present in a particular alignment will have a unique name. Example:Draco_volans_AB023748.1
.
By default, only species (two-part) names are constructed for the -r species
and -r species_acc
options. Subspecies can be included by using the -s
flag. The full path to a text file containing the subspecies to include must be provided with this flag. This can be the same taxon list file used in previous steps - it can include a full list of taxa (species and subspecies). However, due to how this feature is implemented, only the subspecies names are pulled from this taxon list file.
The --voucherize
flag can be used to append a voucher code (if present) to the end of the name produced by any of the three base options listed above. This option is generally recommended only if you used the --vouchered
feature in the Filter_Seqs_and_Species.py
module. This previous step would have only allowed sequences that had voucher information to be selected. This guarantees all sequences have a voucher code. If sequences have the same voucher code, using the --voucherize
flag with the -r species
option for relabeling will produce a consistent name for a given sample across all loci it is found in. In other words, this feature can be used to link samples in phylogeographic datasets. This can allow for concatenation downstream, or sample cohesion in various phylogeographic analyses. Here are some examples of what the --voucherize
feature can produce, based on a record with an accession of GU931545.1
, taxon name of Trachylepis aurata
, and voucher ID tag of Voucher_ZFMK75837
:
-
-r species
+--voucherize
:Trachylepis_aurata_ZFMK75837
-
-r accession
+--voucherize
:GU931545.1_ZFMK75837
-
-r species_acc
+--voucherize
:Trachylepis_aurata_GU931545.1_ZFMK75837
More empirical examples of the relabeling options are provided below.
Regardless of the relabeling scheme selected, an output key (tab-delimited table) is generated that shows the original description line for a sequence and its relabeled name. This allows you to lookup relabeled sequences to find all the information present in the original description. This could be particularly useful if you decide to relabel sequences using accession numbers, which are devoid of all taxonomic and voucher information.
Fasta_Relabel_Seqs.py
is designed to work with a directory of alignment files in fasta format. To be recognized, the input fasta files should be labeled as NAME.fasta
or NAME.fa
. The NAME
portion should not contain any periods or spaces, but can contain underscores.
python Fasta_Relabel_Seqs.py -i <input directory> -o <output directory> -r <relabeling option>
Required: The full path to a directory which contains the input fasta files.
Required: The full path to an existing directory to write output files.
Required: The strategy for relabeling sequence records. Choices = species, accession, species_acc.
Optional: Allows subspecies names to be included in the relabeling. This flag requires a full path to a text file containing all subspecies names to cross-reference in the fasta file(s). This can be the same taxon list file used in previous steps, which likely contains a mix of species and subspecies labels.
Optional: If a
Voucher_[ID]
entry is included in the sequence description, append the ID component to the end of the species/subspecies label (e.g.,Hyperolius_nitidulus_MVZ236473
). TheVoucher_[ID]
field is generated by theParse_Loci.py
module.
python Fasta_Relabel_Seqs.py -i bin/Relabel/Input -o bin/Relabel/Output -r species -s bin/Relabel/Taxon-list.txt
Above command will relabel sequences in all the alignment files present in directory
bin/Relabel/Input
using the species option (taxon names). If one of the subspecies labels present in theTaxon-list.txt
file is found for a given sequence, that subspecies label will be used (rather than a species label).
python Fasta_Relabel_Seqs.py -i bin/Relabel/Input -o bin/Relabel/Output -r accession
Above command will relabel sequences in all the alignment files present in directory
bin/Relabel/Input
using the accession option (taxon names).
python Fasta_Relabel_Seqs.py -i bin/Relabel/Input -o bin/Relabel/Output -r species -s bin/Relabel/Taxon-list.txt --voucherize
Above command will relabel sequences in all the alignment files present in directory
bin/Relabel/Input
using the species option (taxon names). If one of the subspecies labels present in theTaxon-list.txt
file is found for a given sequence, that subspecies label will be used (rather than a species label). For all sequences with aVoucher_[ID]
field, the voucher code will be appended to the end of the species or subspecies label.
Several outputs are created in the output directory specified. A main directory is created, and the name depends on the -r
option used:
-
Directory
Relabeled-by-species
- Results from using the-r species
option. -
Directory
Relabeled-by-accession
- Results from using the-r accession
option. -
Directory
Relabeled-by-species_acc
- Results from using the-r species_acc
option.
The reason these directories are labeled by the -r
option is so that you can run all three options using the same output directory - the outputs won't be overwritten.
For any of the above directories created, two subdirectories will also be created inside:
-
Directory
Relabeled-fasta-files
- This directory will contain all the alignment files with sequences relabeled, calledNAME_relabeled.fasta
. -
Directory
Relabeling-key-files
- This directory will contain all the relabeling keys, calledNAME_label_key.txt
. This summary file contains three columns:-
Accession - The accession number of the sequence.
-
Taxon - The taxon name used for relabeling - note that this could be different for a given sample when the subspecies option is included (
-s
). -
Description - The original description of the sequence (minus accession number) that was used for relabeling.
-
Here is an example of the contents of a NAME_label_key.txt
file:
Accession Taxon Description
AF215137.1 Brookesia peyrierasi Brookesia peyrierasi DESCRIPTION 12s ribosomal rna gene partial sequence mitochondrial
GQ921666.1 Brookesia superciliaris Brookesia superciliaris Voucher_ZCMV150 DESCRIPTION voucher zcmv150 12s ribosomal rna rrns gene partial sequence mitochondrial
GQ921671.1 Brookesia therezieni Brookesia therezieni Voucher_DRV5924 DESCRIPTION voucher drv5924 12s ribosomal rna rrns gene partial sequence mitochondrial
To illustrate the effect of using the various relabeling options, two datasets will be shown. The first is a species-level dataset, which is intended to be used to create a supermatrix, and the second is a vouchered population-level dataset, which is intended to be used for phylogeographic analyses.
This dataset contains the following sequences in an alignment (16S):
>JX564853.1 Arthroleptis poecilonotus DESCRIPTION mitochondrion partial genome
---TATAAGAGGCCCAGCCTGCCCAGTGATAATTTCAACGGCCGCGGTATCCTAACCGTGCAAAGGTAGCACAATAACTTGT-TCTTTC
>EF621774.1 Cardioglossa gracilis Voucher_MCZ_A136796 DESCRIPTION voucher mcz a136796 16s ribosomal rna gene partial sequence mitochondrial
---TACAAAAGGCCTAGCCTGCCCAGTGATTATTTCAACGGCCGCGGTACCCTAACCGTGCAAAGGTAGCACAATCACTTGT-TCTTTAA
>JQ711164.1 Cardioglossa melanogaster Voucher_4628 DESCRIPTION isolate 4628 16s ribosomal rna gene partial sequence mitochondrial
---TATAAGAGGCCCCGCCTGCCCAGTGATTATTTCAACGGCCGCAGTAACCTAACCGCGCAAAGGTAGCACAATCACTTGT-TCTCTAA
>EF621776.1 Cardioglossa elegans Voucher_AMCC_117611 DESCRIPTION voucher amcc 117611 16s ribosomal rna gene partial sequence mitochondrial
----ATAAGAGGCCACGCCTGCCCAGTGATCATTTCAACGGCCACGGTACCCTAACCGCGCGAAGGTAGCACAATCACTTGT-TCTCTAA
>JX564856.1 Cardioglossa leucomystax DESCRIPTION mitochondrion partial genome
---TATAAGAGGCCCCGCCTGCCCAGTGATTATTTCAACGGCCACGGTACCCTAACCGCGCGAAGGTAGCACAATCACTTGT-TCTCTAA
For simplicity, the sequence descriptions are shown here without the intervening sequence data:
>JX564853.1 Arthroleptis poecilonotus DESCRIPTION mitochondrion partial genome
>EF621774.1 Cardioglossa gracilis Voucher_MCZ_A136796 DESCRIPTION voucher mcz a136796 16s ribosomal rna gene partial sequence mitochondrial
>JQ711164.1 Cardioglossa melanogaster Voucher_4628 DESCRIPTION isolate 4628 16s ribosomal rna gene partial sequence mitochondrial
>EF621776.1 Cardioglossa elegans Voucher_AMCC_117611 DESCRIPTION voucher amcc 117611 16s ribosomal rna gene partial sequence mitochondrial
>JX564856.1 Cardioglossa leucomystax DESCRIPTION mitochondrion partial genome
Three of the five sequences have voucher numbers. Here are the outcomes of each relabeling option (note - sequence data not shown for simplicity!):
>Arthroleptis_poecilonotus
>Cardioglossa_gracilis
>Cardioglossa_melanogaster
>Cardioglossa_elegans
>Cardioglossa_leucomystax
These taxon names will be consistent across loci, and will allow these alignments to be concatenated. Because this is a species-level dataset (one sequence per taxon), there will not be any duplicate names WITHIN any given alignment.
>JX564853.1
>EF621774.1
>JQ711164.1
>EF621776.1
>JX564856.1
Using accession numbers guarantees there will not be any duplicate names WITHIN an alignment, but we cannot concatenate alignments labeled this way.
>Arthroleptis_poecilonotus_JX564853.1
>Cardioglossa_gracilis_EF621774.1
>Cardioglossa_melanogaster_JQ711164.1
>Cardioglossa_elegans_EF621776.1
>Cardioglossa_leucomystax_JX564856.1
Here we see the taxon name, followed by the accession number. This produces a unique name for each sequence WITHIN an alignment, but we cannot concatenate alignments labeled this way.
>Arthroleptis_poecilonotus
>Cardioglossa_gracilis_MCZ_A136796
>Cardioglossa_melanogaster_4628
>Cardioglossa_elegans_AMCC_117611
>Cardioglossa_leucomystax
Here we see the taxon name, followed by the voucher number if it is available. We might be able to concatenate alignments labeled this way - but it requires:
- a taxon to be missing voucher information for all sequences or
- a taxon to have the exact same voucher number for all sequences Even one taxon with a mismatch of voucher numbers would interfere with concatenation. So, this is kind of a special case scenario.
In general, for a species-level dataset produced from GenBank there is no expectation that the sequences for each taxon come from the same sample. The --voucherize
option in this case would prevent alignment concatenation, because sequences of the same taxon would likely be labeled differently. If concatenation is not the goal, then labeling species with voucher codes could be useful (e.g., for constructing a particular gene tree). If you are using local sequences and all samples have consistent voucher codes, then this labeling option should work great for concatenation.
>JX564853.1
>EF621774.1_MCZ_A136796
>JQ711164.1_4628
>EF621776.1_AMCC_117611
>JX564856.1
Here we see the accession, followed by the voucher number if it is available. I'm not sure why you would want to do it this way, but it's available!
>Arthroleptis_poecilonotus_JX564853.1
>Cardioglossa_gracilis_EF621774.1_MCZ_A136796
>Cardioglossa_melanogaster_JQ711164.1_4628
>Cardioglossa_elegans_EF621776.1_AMCC_117611
>Cardioglossa_leucomystax_JX564856.1
Here we see the taxon name, followed by the accession number, followed by the voucher number if it is available. I'm not sure why you would want to do it this way, but it's available!
This dataset contains the following sequences in an alignment (ND2):
>GU931545.1 Trachylepis aurata Voucher_ZFMK75837 DESCRIPTION isolate zfmk75837 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
CCTTCCTATACTAATGAATCCACTTATATATTCACTTATCATATCGAGCATTGCAATGGGAACAATTATTACAATATCAAGCTTCCACTG
>GU931546.1 Trachylepis aurata Voucher_ZFMK84085 DESCRIPTION transcaucasica isolate zfmk84085 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
CCTTCCTATACTAATGAACCCACTTATATATTCACTCATTCTGTCAAGCATTGCAACAGGAACAATTATTACAATATCAAGTTTCCACTG
>GU931601.1 Trachylepis varia Voucher_KHH003 DESCRIPTION isolate khh003 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
>GU931602.1 Trachylepis varia Voucher_KHH006 DESCRIPTION isolate khh006 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
>GU931603.1 Trachylepis varia Voucher_MCZZ23277 DESCRIPTION isolate mczz23277 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
For simplicity, the sequence descriptions are shown here without the intervening sequence data:
>GU931545.1 Trachylepis aurata Voucher_ZFMK75837 DESCRIPTION isolate zfmk75837 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
>GU931546.1 Trachylepis aurata Voucher_ZFMK84085 DESCRIPTION transcaucasica isolate zfmk84085 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
>GU931601.1 Trachylepis varia Voucher_KHH003 DESCRIPTION isolate khh003 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
>GU931602.1 Trachylepis varia Voucher_KHH006 DESCRIPTION isolate khh006 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
>GU931603.1 Trachylepis varia Voucher_MCZZ23277 DESCRIPTION isolate mczz23277 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
These sequences come from a single phylogeographic study. They all have voucher IDs, and were filtered accordingly using the Filter_Seqs_and_Species.py
module. For this project, we know each sample can be linked across loci using their unique voucher code. Here are the outcomes of each relabeling option (note - sequence data not shown for simplicity!):
>Trachylepis_aurata
>Trachylepis_aurata
>Trachylepis_varia
>Trachylepis_varia
>Trachylepis_varia
The taxon name is written for each sequence. However, because multiple sequences come from each taxon, this will create duplicate names in the file. This is problematic for many programs (including RAxML), and makes it impossible to identify which sequence belongs to which sample within a given species.
>GU931545.1
>GU931546.1
>GU931601.1
>GU931602.1
>GU931603.1
The accession numbers are written for each sequence. This solves the problem of duplicate names within the file, but there is no way to link samples across loci.
>Trachylepis_aurata_GU931545.1
>Trachylepis_aurata_GU931546.1
>Trachylepis_varia_GU931601.1
>Trachylepis_varia_GU931602.1
>Trachylepis_varia_GU931603.1
Here we see the taxon name, followed by the accession number. This solves the problem of duplicate names within the file, but there is no way to link samples across loci.
>Trachylepis_aurata_ZFMK75837
>Trachylepis_aurata_ZFMK84085
>Trachylepis_varia_KHH003
>Trachylepis_varia_KHH006
>Trachylepis_varia_MCZZ23277
Here we see the taxon name, followed by the voucher number if it is available. This will allow identification of each unique sample, and allow the samples to be linked across loci. For a vouchered population-level dataset, this relabeling scheme will allow for alignment concatenation.
>GU931545.1_ZFMK75837
>GU931546.1_ZFMK84085
>GU931601.1_KHH003
>GU931602.1_KHH006
>GU931603.1_MCZZ23277
Here we see the accession number, followed by the voucher number if it is available. This won't allow concatenation, but all sequence IDs will be unique WITHIN an alignment.
>Trachylepis_aurata_GU931545.1_ZFMK75837
>Trachylepis_aurata_GU931546.1_ZFMK84085
>Trachylepis_varia_GU931601.1_KHH003
>Trachylepis_varia_GU931602.1_KHH006
>Trachylepis_varia_GU931603.1_MCZZ23277
Here we see the taxon name, followed by the accession number, followed by the voucher number if it is available. This won't allow concatenation, but all sequence IDs will be unique WITHIN an alignment.
Hopefully the above examples illustrate how the relabeling options work, and why you might choose one over another. In general, names within an alignment must be unique. Whether or not you want names to be consistent across loci is another topic, and depends on your goals. For a species-level dataset being used to construct a supermatrix, you will want to relabel by taxon (-r species
). This will make the names unique within an alignment and consistent across alignments, allowing for concatenation. For a vouchered population-level dataset, you may want to relabel by taxon and voucher (-r species
+ --voucherize
). This will also make the names unique within an alignment and consistent across alignments, allowing for concatenation or samples to be linked in other analyses.
For other datasets, it may not be necessary or possible to link names across loci. A good example is a population-level dataset that is produced from many studies. There may be great disparity in the number of samples sequenced for different loci, and it may be that each study used a separate set of samples. In this case, it is more important that sample names be unique within alignments, as it is unlikely that any name combinations will be consistent across loci. For this type of dataset, the -r accession
or -r species_acc
options would work well.
Trimming alignments can remove poorly aligned and gap-rich regions, and may be desirable under some circumstances. There are a variety of methods that perform some type of automated trimming. SuperCRUNCH offers two different approaches for trimming. The first approach uses trimAl to trim alignments (Trim_Alignments_Trimal.py), and specifically implements three of the options in trimAl (gap-threshold trimming, the gappyout algorithm, and remove columns of only gaps). The second approach is more customizable (Trim_Alignments_Custom.py), and is largely based on the trimming routine included in phyluce. The implementation of this routine in SuperCRUNCH allows for edge trimming, row trimming, or a combination of edge and row trimming. Parameters controlling each of these settings can be run using default values, or tuned to particular datasets.
There is always a tradeoff between keeping missing data (and potential phylogenetically informative sites) versus eliminating missing data. Running the trimming routines available in SuperCRUNCH is fast and easy, so it is worth generating untrimmed and trimmmed alignments to investigate what the downstream effects may be.
The Trim_Alignments_Trimal.py
module relies on the program trimAl to batch trim all alignment files in a directory. There are four options (-a
) for using trimAl in this script:
-
-a gt
- Implements the gap-threshold algorithm (trimal -gt
). A gap-threshold is the minimum fraction of sequences without a gap that you require to keep a column. A value of 1 would only keep columns that are free of gaps, and a value of 0.6 would remove columns with gaps in more than 40% of the sequences. In other words, at least 60% of the sequences must have data in the column for it to be kept. The default value is set to 0.05, meaning any column with gaps in more than 95% of the sequences will be removed. This default value is intended to remove very poorly aligned regions and long, overhanging edges produced by a few sequences in an alignment. You can change the value assigned by using the optional--gt
flag with any value from 0-1. -
-a noallgaps
- Implements the noallgaps algorithm (trimal -noallgaps
). This simply removes all columns that contain only gaps (100% missing data). This is equivalent to running the gap-threshold algorithm with a value of 1. -
-a both
- Runs the gap-threshold algorithm, followed by the noallgaps algorithm. This ensures that there will not be any columns composed of only gaps after trimming. The optional--gt
flag can be used with this option. -
-a gappyout
- Implements the gappyout algorithm (trimal -gappyout
). This automated method calculates a minimum gap-score cut-off based on the input alignment characteristics and trims all alignment columns falling below the threshold value. This method adapts parameters for each input alignment, rather than apply the same fixed parameters to all alignments (like the gap-threshold method).
Trim_Alignments_Trimal.py
is designed to work with a directory of alignment files. Unlike most other steps in SuperCRUNCH, multiple types of input formats can be used. The input alignment file formats are auto-detected and can be include fasta, nexus, or phylip formats. The input alignment files must be labeled with one of the following extensions to be read: NAME.fasta
, NAME.fa
, NAME.nexus
, NAME.nex
, NAME.phylip
, or NAME.phy
. The output format must be specified using the -f
flag, and includes fasta, nexus, and phylip formats.
It is essential that you relabel your sequences prior to trimming. If sequences have not been relabeled, trimAl will label output sequences using only their accession number. Please use the Fasta_Relabel_Seqs.py module for this task.
python Trim_Alignments_Trimal.py -i <input directory> -o <output directory> -a <trimming method in trimal> -f <output format>
Required: The full path to a directory which contains the input alignment files. File formats are auto-detected and can be include fasta, nexus, or phylip formats, but input alignment files must be labeled with one of the following extensions:
NAME.fasta
,NAME.fa
,NAME.nexus
,NAME.nex
,NAME.phylip
, orNAME.phy
.
Required: The full path to an existing directory to write output files.
Required: Specify the output file format for trimmed alignment. Choices = fasta, nexus, phylip.
Required: Specify the trimal method for trimming alignments. Choices = gt, noallgaps, both, gappyout.
Optional: Specify the gt (gap threshold) value for trimal - the minimum fraction of sequences without a gap. Default = 0.05.
python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f fasta -a gt
Above command will trim all alignments present in the directory
/bin/Trim/input
using the gap threshold method with a default value of 0.05, and output files in fasta format.
python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f nexus -a gt --gt 0.3
Above command will trim all alignments present in the directory
/bin/Trim/input
using the gap threshold method with a value of 0.3, and output files in nexus format.
python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f phylip -a noallgaps
Above command will trim alignments present in the directory
/bin/Trim/input
using the noallgaps method and output files in phylip format.
python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f nexus -a both --gt 0.08
Above command will trim alignments present in the directory
/bin/Trim/input
using the gap threshold method with a value of 0.08, then run the noallgaps method, and output files in nexus format.
python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f fasta -a gappyout
Above command will trim alignments present in the directory
/bin/Trim/input
using the gappyout method and output files in fasta format.
Outputs are created in the output directory specified, and depend on the output file format selected. The output directories can include:
-
Directory
/Trimmed-fasta-Files
- A directory containing a fasta-formatted trimmed alignment for each input alignment, labeled asNAME_trimmed.fasta
. -
Directory
/Trimmed-nexus-Files
- A directory containing a nexus-formatted trimmed alignment for each input alignment, labeled asNAME_trimmed.nex
. -
Directory
/Trimmed-phylip-Files
- A directory containing a phylip-formatted trimmed alignment for each input alignment, labeled asNAME_trimmed.phy
.
The Trim_Alignments_Custom.py
module uses a custom trimming function to batch trim all input alignments. This custom trimming function is a heavily modified version of the generic_align.py
module of phyluce. There are three main options, which can be specified using the required -a
flag:
-
-a edges_internal
- Executes edge (block) trimming and internal (row) trimming. -
-a edges
- Executes edge (block) trimming only. -
-a internal
- Executes internal (row) trimming only.
The trimming for each of these options is guided by the optional flags:
-
-w
- window: Sliding window size for trimming. Default is 20 base pairs. -
-p
- proportion: The proportion of taxa required to have sequence at alignment ends. Default is 0.65. -
-t
- threshold: The proportion of residues required across the window in proportion of taxa. Default is 0.65. -
-d
- max_divergence: The max proportion of sequence divergence allowed between any row of the alignment and the consensus sequence. Default is 0.20. -
-l
- min_length: The minimum length of alignments to keep. Default is 100 base pairs.
These optional flags have set default values, but they can be changed by using the relevant flags with a user-supplied value. After custom trimming, the alignments are 'cleaned up' using trimAl.
The output format must be specified using the required -f
flag, and includes fasta, nexus, and phylip).
In general, this trimming function is very aggressive in removing columns, particularly when sequence length heterogeneity is relatively high. An output file called Trimming_Summary.txt
is created, which has information about the starting lengths and trimmed lengths for every input alignment. If too many columns are being removed, you can try adjusting the optional arguments and inspect this output file to see the effect on the number of columns removed, and further adjust parameters accordingly. If changing parameter settings still results in the removal of too many columns, you should consider using the Trim_Alignments_Trimal.py module instead.
Trim_Alignments_Custom.py
is designed to work with a directory of fasta-formatted alignment files. To be recognized, the input fasta files should be labeled as NAME.fasta
or NAME.fa
. The NAME
portion should not contain any periods or spaces, but can contain underscores.
It is essential that you relabel your sequences prior to trimming. Please use the Fasta_Relabel_Seqs.py module for this task.
python Trim_Alignments_Custom.py -i <input directory> -o <output directory> -a <custom trimming method> -f <output format>
Required: The full path to a directory which contains the aligned and relabeled fasta files to trim
Required: The full path to an existing directory to write output files.
Required: Specify the method for trimming alignments. Choices = edges_internal, edges, internal.
Required: Specify the output file format for trimmed alignment. Choices = fasta, nexus, phylip.
Optional: Sliding window size for trimming. Default = 20.
Optional: The proportion of taxa required to have sequence at alignment ends. Default = 0.65.
Optional: The proportion of residues required across the window in proportion of taxa. Default = 0.65.
Optional: The max proportion of sequence divergence allowed between any row of the alignment and the consensus sequence. Default = 0.20.
Optional: The minimum length of alignments to keep. Default = 100.
python Trim_Alignments_Custom.py -i bin/Trim/input -o bin/Trim/output -a edges_internal -f fasta
Above command will trim all alignments present in the directory
/bin/Trim/input
using the edge and internal trimming methods with all default values, and produce output files in fasta format.
python Trim_Alignments_Custom.py -i bin/Trim/input -o bin/Trim/output -a edges -f nexus
Above command will trim all alignments present in the directory
/bin/Trim/input
using only the edge trimming method with all default values, and produce output files in nexus format.
python Trim_Alignments_Custom.py -i bin/Trim/input -o bin/Trim/output -a internal -f phylip
Above command will trim all alignments present in the directory
/bin/Trim/input
using only the internal trimming method with all default values, and produce output files in phylip format.
python Trim_Alignments_Custom.py -i bin/Trim/input -o bin/Trim/output -a edges_internal -f fasta -w 50 -d 0.3 -l 500
Above command will trim all alignments present in the directory
/bin/Trim/input
using the edge and internal trimming methods. Some default values are changed. The window size is set to 50 (vs. 20), the max divergence between seqs and consensus is set to 0.3 (vs. 0.2), and the minimum length of alignments must be 500 bp to keep them. This will produce output files in fasta format.
Outputs are created in the output directory specified, and depend on the output file format selected. The output directories can include:
-
Directory
/Trimmed-fasta-Files
- A directory containing a fasta-formatted trimmed alignment in for each input alignment, labeled asNAME_trimmed.fasta
. -
Directory
/Trimmed-nexus-Files
- A directory containing a nexus-formatted trimmed alignment in for each input alignment, labeled asNAME_trimmed.nex
. -
Directory
/Trimmed-phylip-Files
- A directory containing a phylip-formatted trimmed alignment in for each input alignment, labeled asNAME_trimmed.phy
.
In addition, the following directory will always be created:
-
Directory
Summary-File
- A directory containing a single summary file calledTrimming_Summary.txt
. This summary file contains four columns:-
Fasta - The name of the input fasta file.
-
Start_Length - The starting length of the alignment.
-
Trimmed_Length - The length of the alignment after trimming.
-
Bases_Trimmed - The number of columns trimmed from the alignment.
-
Here is an example of the contents of Trimming_Summary.txt
:
Fasta Start_Length Trimmed_Length Bases_Trimmed
12S_CLUSTALO_Aligned_relabeled.fasta 1435 696 739
16S_CLUSTALO_Aligned_relabeled.fasta 1829 **FAILED** **NA**
ADNP_MACSE_Aligned_relabeled.fasta 918 561 357
AHR_MACSE_Aligned_relabeled.fasta 648 424 224
AKAP9_MACSE_Aligned_relabeled.fasta 1479 458 1021
BACH2_MACSE_Aligned_relabeled.fasta 1446 1437 9
BDNF_MACSE_Aligned_relabeled.fasta 897 624 273
CO1_CLUSTALO_Aligned_relabeled.fasta 1602 **FAILED** **NA**
CXCR4_MACSE_Aligned_relabeled.fasta 720 529 191
CYTB_CLUSTALO_Aligned_relabeled.fasta 1183 **FAILED** **NA**
Note that any alignment with a **Failed**
tag in the Trimmed_Length column failed the minimum length filter. The above example was run using -a edges_internal
with the default parameter values. In this example, the 1,829 base pair 16S alignment was reduced to less than 100 bp. Similarly, the 1,602 bp CO1 and the 1,183 bp CYTB alignments were also reduced to less than 100 bp. Many of the nuclear loci alignments were also reduced to half of their starting length (or more!). This is what I mean when I say that this trimming function is rather aggressive in removing columns, especially for divergent or hard to align sequences (such as the mtDNA loci here). This is the built-in trimming routine (with identical default values) used for UCE loci processed using phyluce. I strongly recommend manipulating the optional parameter values to see their effects, and also comparing the results of Trim_Alignments_Custom.py
to Trim_Alignments_Trimal.py - the trimAl methods are likely to be far less aggressive in removing columns.
Alignments can be assessed for sequence length variation. For each alignment, the sequence lengths are used to calculate the mean, minimum, maximum, standard deviation, and coefficient of variation (the ratio of the standard deviation to the mean). These metrics can be calculated from:
-
Unaligned sequence lengths - For each sequence, this is the number of actual bases in the sequence. This obtained by stripping all gaps from the sequence, and simply counting the number of bases. Although obtained from an alignment, this is essentially calculating length variation in a set of unaligned sequences.
-
Aligned sequence lengths - For each sequence, this is the difference between the start and stop position of the sequence in the alignment. This can include gapped positions in the length calculation. This is calculating length variation in a sequence alignment.
The aligned sequence lengths coefficient of variation (ASL-CV; Portik & Wiens, in review) is a standardized measure for comparing length variation across alignments, with higher ASL-CV values indicating greater variability in sequence lengths.
The Seq_Length_Heterogeneity.py
module calculates several metrics in an effort to quantify sequence length variation within alignments. The user must select the sequence type to calculate metrics from (-s
), with two choices:
-
-s bases
- Only actual nucleotides are counted for a sequence (gaps ignored). The sequence length is obtained from the sum of the number of real bases present. -
-s alns
- The position number of the first nucleotide and last nucleotide in the sequence alignment are used to calculate the length of a sequence. This can include gapped positions in the length calculation.
From the lengths obtained for all the sequences in an alignment, the following metrics are calculated: mean, minimum, maximum, standard deviation, and the coefficient of variation.
The results for all alignments are written to a tab-delimited output file in the output directory specified using the -o
flag.
Seq_Length_Heterogeneity.py
is designed to work for a directory that contains either phylip or fasta alignment files, with the format specified by the -f
flag. The input alignment files must be labeled with one of the following extensions to be read: NAME.fasta
, NAME.fa
, NAME.phylip
, or NAME.phy
.
python Seq_Length_Heterogeneity.py -i <input directory> -o <output directory> -f <input format> -s <sequence length type>
Required: The full path to a directory which contains the input alignment files (fasta or phylip format).
Required: The full path to an existing directory to write output files.
Required: The file format of the INPUT alignments. Choices = fasta, phylip.
Required: Use de-aligned sequence lengths (
bases
) or aligned sequence lengths (alns
) for calculations. Choices = bases, alns.
python Seq_Length_Heterogeneity.py -i bin/SeqHet/input -o bin/SeqHet/output -f fasta -s bases
Above command will calculate sequence length heterogeneity metrics for all fasta files in directory
bin/SeqHet/input
based on the actual number of bases in each sequence.
python Seq_Length_Heterogeneity.py -i bin/SeqHet/input -o bin/SeqHet/output -f fasta -s alns
Above command will calculate sequence length heterogeneity metrics for all fasta files in directory
bin/SeqHet/input
based on the start and stop position of each sequence in the alignment.
Outputs are created in the output directory specified, and depend on the length method selected. The outputs can include:
-
File
Sequence_Length_Heterogeneity_Results_BasePairLengths.txt
- The output file created when option-s bases
is used. -
File
Sequence_Length_Heterogeneity_Results_AlignedLengths.txt
- The output file created when option-s alns
is used.
Both files will contain the following columns:
-
Alignment_File - The name of the input alignment file.
-
MeanLength - The average length of the sequences contained in the alignment.
-
MinLength - The minimum sequence length contained in the alignment.
-
MaxLength - The maximum sequence length contained in the alignment.
-
SD - The standard deviation of the sequence lengths.
-
CoV - The coefficient of variation (the ratio of the standard deviation to the mean) of the sequence lengths.
Here is an example of the contents:
Alignment_File MeanLength MinLength MaxLength SD CoV
UCE-1003_trimmed.phy 486.6 203 989 300.9 0.618
UCE-1004_trimmed.phy 510.3 202 1073 366.2 0.718
UCE-1005_trimmed.phy 517.8 208 1039 308.8 0.596
UCE-1008_trimmed.phy 583.7 198 1079 313.4 0.537
UCE-1010_trimmed.phy 526.1 228 1081 338.2 0.643
UCE-1012_trimmed.phy 540.6 213 1103 340.5 0.63
UCE-1013_trimmed.phy 1082.7 883 1121 58.7 0.054
UCE-1014_trimmed.phy 537.4 211 1078 351.2 0.653
UCE-1017_trimmed.phy 500.9 200 1084 332.2 0.663
UCE-1018_trimmed.phy 555.0 200 1087 377.1 0.68
Fasta-formatted alignment files can be quickly converted into nexus and phylip formats using SuperCRUNCH. These are two commonly used formats for many phylogenetic and population-genetic analyses.
The Fasta_Convert.py
module converts a directory of aligned fasta files into both phylip and nexus formats. This conversion should only be performed after records have been relabeled, as the original longer description lines containing whitespace will cause problems. Please use the Fasta_Relabel_Seqs.py module for this task.
Fasta_Convert.py
is designed to work with a directory of fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta
or NAME.fa
. The NAME
portion should not contain any periods or spaces, but can contain underscores.
python Fasta_Convert.py -i <input directory> -o <output directory>
Required: The full path to a directory which contains the fasta format alignments to convert.
Required: The full path to an existing directory to write output files.
python Fasta_Convert.py -i bin/Convert/input -o bin/Convert/output
Above command will convert all aligned fasta files found in the directory
bin/Convert/input
into nexus and phylip format, and write them in corresponding output directories in thebin/Convert/output
directory.
Several outputs are created in the output directory specified.
-
Directory
Nexus-Files
- For each input fasta alignment file, this directory will contain an corresponding nexus file namedNAME.nex
. -
Directory
Phylip-Files
- For each input fasta alignment file, this directory will contain an corresponding phylip file namedNAME.phy
.
Let's assume a file called ND2_Trachy.fasta
is included in the input directory. The contents contain four aligned sequences in fasta format:
>Trachylepis_aurata
--CCTTCCTATACTAATGAATCCACTTATATATTCACTTATCATATCGAGCATTGCAATGGGAACAATTATTACAATATCAAGCTTCCAC
>Trachylepis_punctulata
GTCCTTCCTATACTAATGAACCGTATCATATACTCAATTATCTTATCAAACGTAGCAACCGGCACAATTATTACAATAACAAGCTTCCAT
>Trachylepis_sulcata
ATCCTTCCTATACTAATGAACCCCACCATATATTCCATTATCCTTTCAAATGTAGCAATCGGCACCATCATTACAATAACAAGCCACCAC
>Trachylepis_varia
ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
Running Fasta_Convert.py
would produce a corresponding phylip file ND2_Trachy.phy
in the Phylip-Files
directory. The contents will look like this:
4 90
Trachylepis_aurata --CCTTCCTATACTAATGAATCCACTTATATATTCACTTATCATATCGAGCATTGCAATGGGAACAATTATTACAATATCAAGCTTCCAC
Trachylepis_punctulata GTCCTTCCTATACTAATGAACCGTATCATATACTCAATTATCTTATCAAACGTAGCAACCGGCACAATTATTACAATAACAAGCTTCCAT
Trachylepis_sulcata ATCCTTCCTATACTAATGAACCCCACCATATATTCCATTATCCTTTCAAATGTAGCAATCGGCACCATCATTACAATAACAAGCCACCAC
Trachylepis_varia ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
And a corresponding nexus file ND2_Trachy.nex
would be written to the Phylip-Files
directory. The contents will look like this:
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=4 NCHAR=90;
FORMAT DATATYPE=DNA MISSING=N GAP=-;
MATRIX
Trachylepis_aurata --CCTTCCTATACTAATGAATCCACTTATATATTCACTTATCATATCGAGCATTGCAATGGGAACAATTATTACAATATCAAGCTTCCAC
Trachylepis_punctulata GTCCTTCCTATACTAATGAACCGTATCATATACTCAATTATCTTATCAAACGTAGCAACCGGCACAATTATTACAATAACAAGCTTCCAT
Trachylepis_sulcata ATCCTTCCTATACTAATGAACCCCACCATATATTCCATTATCCTTTCAAATGTAGCAATCGGCACCATCATTACAATAACAAGCCACCAC
Trachylepis_varia ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
;
End;
A traditional supermatrix is created by combining the sequence alignments from all loci into a single sequence alignment, a process called concatenation. The concatenated alignment is created by matching the names of sequences across loci. SuperCRUNCH can be used to create a concatenated alignment, and offers some unique features for concatenation.
The Concatenation.py
module can be used to combine multiple sequence alignments into a single concatenated alignment. The input file format can be fasta or phylip, and is selected using the --informat
flag. The input alignment files must be labeled with one of the following extensions to be read: NAME.fasta
, NAME.fa
, NAME.phylip
, or NAME.phy
. The NAME
portion should not contain any periods or spaces, but can contain underscores.
The complete set of sequence names (generally corresponding to taxon or taxon + voucher labels) that will be used to create the final alignment is gathered from all alignments. All sequence labels must be unique within each alignment (no duplicate names) or an error will be thrown. If a sequence label is absent from an alignment, a missing sequence is generated using the symbol selected with the -s
flag (N
, -
, or ?
). The output format must be specified as fasta, phylip, nexus, or interleaved nexus using the --outformat
flag. Output files are written to the specified output directory specified with the -o
.
Concatenation will only work properly if samples have names that are consistent across loci. This includes species-level supermatrix datasets, where sequences are labeled using a taxon name, and vouchered population-level datasets, where sequences are labeled using a taxon name + voucher code. For more details about how this works, please read through the Relabel Sequence Records and Fasta_Relabel_Seqs.py sections.
python Concatenation.py i <input directory> -o <output directory> --informat <format> --outformat <format> -s <missing data symbol>
Required: The full path to a directory which contains the input fasta files.
Required: The full path to an existing directory to write output files.
Required: The file format of the INPUT alignments. Choices = fasta, phylip.
Required: The file format for the OUTPUT concatenated alignment. Choices = fasta, phylip, nexus, interleaved_nexus.
Required: A base pair symbol used to represent missing data when sequences are not available for a taxon. Choices = dash, N, ?.
Optional: For use with interleaved_nexus format option. The number of bp characters to include per line before splitting to an additional line. Default = 100.
python Concatenation.py -i /bin/Concat/Input -o /bin/Concat/output --in_format fasta --out_format phylip -s dash
Above command will concatenate all the fasta alignments in the directory
/bin/Concat/Input
and produce a concatenated phylip alignment in/bin/Concat/output
. Missing data are represented with the-
symbol.
python Concatenation.py -i /bin/Concat/Input -o /bin/Concat/output --in_format phylip --out_format fasta -s ?
Above command will concatenate all the phylip alignments in the directory
/bin/Concat/Input
and produce a concatenated fasta alignment in/bin/Concat/output
. Missing data are represented with the?
symbol.
Several outputs are created in the output directory specified.
-
File
Concatenated_Alignment.fasta
,Concatenated_Alignment.phylip
,Concatenated_Alignment.nex
, orConcatenated_Alignment_Interleaved.nex
- The final concatenated alignment in the output format specified. -
File
Data_Partitions.txt
- A text file displaying the order in which loci were concatenated and their corresponding base pairs within the alignment. These are the data partitions or charsets, and can be used for model testing or partitioning in analyses. You could copy/past these contents to a corresponding nexus file as charset blocks for programs such as MrBayes or RevBayes. Here is an example of the contents:
[data_blocks]
12S_CLUSTALO_Aligned_relabeled_trimmed.fasta = 1-985;
16S_CLUSTALO_Aligned_relabeled_trimmed.fasta = 986-2416;
ADNP_MACSE_Aligned_relabeled_trimmed.fasta = 2417-3259;
AHR_MACSE_Aligned_relabeled_trimmed.fasta = 3260-3856;
AKAP9_MACSE_Aligned_relabeled_trimmed.fasta = 3857-5335;
BACH1_MACSE_Aligned_relabeled_trimmed.fasta = 5336-6589;
BACH2_MACSE_Aligned_relabeled_trimmed.fasta = 6590-8020;
BDNF_MACSE_Aligned_relabeled_trimmed.fasta = 8021-8725;
BHLHB2_MACSE_Aligned_relabeled_trimmed.fasta = 8726-9493;
BMP2_MACSE_Aligned_relabeled_trimmed.fasta = 9494-10183;
-
File
Taxa_Loci_Count.log
- A count of the number of sequences that were available for each sequence label. in other words the number of alignments the sequence label was present in. If a species-level dataset was used, sequences are labeled using taxon names and this file will be a full list of the species included in the concatenated alignment. If a vouchered population-level dataset was used, it will be a list of all the samples (taxon name + voucher code) in the concatenated alignment.
Here is an example of the contents from a species-level supermatrix:
Taxon Loci
Acanthocercus_adramitanus 4
Acanthocercus_annectens 6
Acanthocercus_atricollis 6
Acanthocercus_cyanogaster 6
Acanthocercus_yemensis 8
Acanthosaura_armata 7
Acanthosaura_capra 2
Acanthosaura_crucigera 5
Acanthosaura_lepidogaster 44
Agama_aculeata 7
Agama_africana 5
Agama_agama 51
Agama_anchietae 9
Agama_armata 6
Agama_atra 8
Agama_boensis 5
Agama_bottegi 5
Agama_boueti 7
Agama_boulengeri 7
Here is an example of the contents from a vouchered population-level supermatrix:
Taxon Loci
Trachylepis_aurata_ZFMK75837 4
Trachylepis_aurata_ZFMK84085 4
Trachylepis_punctulata_MBUR00322 4
Trachylepis_punctulata_MCZA38906 4
Trachylepis_punctulata_MCZA38924 4
Trachylepis_sulcata_AMB4288 4
Trachylepis_sulcata_AMB4291 1
Trachylepis_sulcata_AMB4596 4
Trachylepis_sulcata_AMB4620 4
Trachylepis_sulcata_AMB4693 3
Trachylepis_sulcata_AMB4766 4
Trachylepis_sulcata_AMB4767 4
Trachylepis_sulcata_AMB4782 4
Trachylepis_sulcata_AMB4790 3
Trachylepis_sulcata_AMB6837 2
Trachylepis_sulcata_AMB6838 4
Trachylepis_sulcata_AMB6839 4
Last updated: June, 2020
For SuperCRUNCH v1.2