8: Post Alignment Tasks

Post-Alignment Tasks

Overview
Relabel Sequence Records
- Fasta_Relabel_Seqs.py
Trim Sequence Alignments
- Trim_Alignments_Trimal.py
- Trim_Alignments_Custom.py
Calculate Sequence Length Heterogeneity Statistics
- Seq_Length_Heterogeneity.py
Convert to Nexus and Phylip Format
- Fasta_Convert.py
Concatenate Alignments
- Concatenation.py

Overview

After multiple sequence alignment, there are several tasks that can help prepare datasets for downstream analyses. One important task involves relabeling sequences, such that sequence labels are composed of taxon labels, accession numbers, voucher codes, or some combination of the above. This allows sequences to be readily identified with a unique name, making them compatible with concatenation. Details on sequence relabeling are provided in the Relabel Sequence Records section. After relabeling, alignment trimming can be performed. Multiple options are available for trimming, with details provided in the Trim Sequence Alignments section. Files with relabeled sequences can be readily converted into other common formats (such as nexus and phylip), and supermatrices can be created by concatenating alignments. Details for these tasks can be found in the Convert to Nexus and Phylip Format and Concatenate Alignments sections below.

NEW - Alignments can be assessed for sequence length variation. For each alignment, the sequence lengths are used to calculate the mean, minimum, maximum, standard deviation, and coefficient of variation. These metrics can be calculated from the aligned sequence lengths (start and stop position in the alignment), or from the unaligned sequence lengths (just the number of base pairs in a sequence). The aligned sequence lengths coefficient of variation (ASL-CV; Portik & Wiens, in review) is a standardized measure for comparing length variation across alignments, with higher ASL-CV values indicating greater variability in sequence lengths. More details are provided in the Calculate Sequence Length Heterogeneity Statistics section.

Relabel Sequence Records

Up to this point, the sequences used in other SuperCRUNCH steps have been labeled using modified versions of the original description line (containing an accession, taxon label, gene description, voucher code, etc.). Many downstream programs are unable to use these long labels, and so they require conversion to a simpler format. Relabeling can be performed in SuperCRUNCH using several different options. For example, sequences can be labeled using:

A taxon name. This option can include species (two-part) and subspecies (three-part) names. This will allow alignments to be concatenated.
An accession number. This will NOT allow alignments to be concatenated, but guarantees all sequences present in a particular alignment will have a unique name.
A combination of a taxon name (species, subspecies) and an accession number. This will NOT allow alignments to be concatenated, but guarantees all sequences present in a particular alignment will have a unique name.
A combination of a voucher code and any of the above three options. The taxon + voucher names could allow for concatenation (e.g., for phylogeographic studies), but the accession + voucher names or voucher + taxon + accession names will not.

The relabeling strategy selected will depend on the type of dataset being produced. For example, a species-level supermatrix would require all sequences be labeled with a taxon name to allow for concatenation. A population-level dataset for a single species could be labeled by accession number, or by taxon + accession number. This will ensure all sequences within an alignment are given a unique name, but it will not allow them to be linked across loci. To link samples in a population-level dataset, the voucher option can be used in combination with the taxon name option. Given that there are many possible ways to relabel sequences, instructions for usage and clear examples of all options are provided in the section below.

Back to top

Fasta_Relabel_Seqs.py

The Fasta_Relabel_Seqs.py module will relabel the sequences within fasta files based on the relabeling option chosen. Three base options are available, and are used with the -r flag:

-r species - Relabel using a taxon name. This option can include species (two-part) and subspecies (three-part) names. This will allow alignments to be concatenated. Example: Hyperolius_viridiflavus, Hyperolius_viridiflavus_pantherinus.
-r accession - Relabel using an accession number. This will NOT allow alignments to be concatenated, but guarantees all sequences present in a particular alignment will have a unique name. Example: AF338336.1.
-r species_acc - Relabel using a combination of a taxon name (species, subspecies) and an accession number. This will NOT allow alignments to be concatenated, but guarantees all sequences present in a particular alignment will have a unique name. Example: Draco_volans_AB023748.1.

By default, only species (two-part) names are constructed for the -r species and -r species_acc options. Subspecies can be included by using the -s flag. The full path to a text file containing the subspecies to include must be provided with this flag. This can be the same taxon list file used in previous steps - it can include a full list of taxa (species and subspecies). However, due to how this feature is implemented, only the subspecies names are pulled from this taxon list file.

The --voucherize flag can be used to append a voucher code (if present) to the end of the name produced by any of the three base options listed above. This option is generally recommended only if you used the --vouchered feature in the Filter_Seqs_and_Species.py module. This previous step would have only allowed sequences that had voucher information to be selected. This guarantees all sequences have a voucher code. If sequences have the same voucher code, using the --voucherize flag with the -r species option for relabeling will produce a consistent name for a given sample across all loci it is found in. In other words, this feature can be used to link samples in phylogeographic datasets. This can allow for concatenation downstream, or sample cohesion in various phylogeographic analyses. Here are some examples of what the --voucherize feature can produce, based on a record with an accession of GU931545.1, taxon name of Trachylepis aurata, and voucher ID tag of Voucher_ZFMK75837:

-r species + --voucherize: Trachylepis_aurata_ZFMK75837
-r accession + --voucherize: GU931545.1_ZFMK75837
-r species_acc + --voucherize: Trachylepis_aurata_GU931545.1_ZFMK75837

More empirical examples of the relabeling options are provided below.

Regardless of the relabeling scheme selected, an output key (tab-delimited table) is generated that shows the original description line for a sequence and its relabeled name. This allows you to lookup relabeled sequences to find all the information present in the original description. This could be particularly useful if you decide to relabel sequences using accession numbers, which are devoid of all taxonomic and voucher information.

Fasta_Relabel_Seqs.py is designed to work with a directory of alignment files in fasta format. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores.

Basic Usage:

python Fasta_Relabel_Seqs.py -i <input directory> -o <output directory> -r <relabeling option>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the input fasta files.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`-r <choice>` or `--relabel <choice>`

Required: The strategy for relabeling sequence records. Choices = species, accession, species_acc.

`-s <path-to-file>` or `--subspecies <path-to-file>`

Optional: Allows subspecies names to be included in the relabeling. This flag requires a full path to a text file containing all subspecies names to cross-reference in the fasta file(s). This can be the same taxon list file used in previous steps, which likely contains a mix of species and subspecies labels.

`--voucherize`

Optional: If a Voucher_[ID] entry is included in the sequence description, append the ID component to the end of the species/subspecies label (e.g., Hyperolius_nitidulus_MVZ236473). The Voucher_[ID] field is generated by the Parse_Loci.py module.

Example Usage:

python Fasta_Relabel_Seqs.py -i bin/Relabel/Input -o bin/Relabel/Output -r species -s bin/Relabel/Taxon-list.txt

Above command will relabel sequences in all the alignment files present in directory bin/Relabel/Input using the species option (taxon names). If one of the subspecies labels present in the Taxon-list.txt file is found for a given sequence, that subspecies label will be used (rather than a species label).

python Fasta_Relabel_Seqs.py -i bin/Relabel/Input -o bin/Relabel/Output -r accession

Above command will relabel sequences in all the alignment files present in directory bin/Relabel/Input using the accession option (taxon names).

python Fasta_Relabel_Seqs.py -i bin/Relabel/Input -o bin/Relabel/Output -r species -s bin/Relabel/Taxon-list.txt --voucherize

Above command will relabel sequences in all the alignment files present in directory bin/Relabel/Input using the species option (taxon names). If one of the subspecies labels present in the Taxon-list.txt file is found for a given sequence, that subspecies label will be used (rather than a species label). For all sequences with a Voucher_[ID] field, the voucher code will be appended to the end of the species or subspecies label.

Outputs:

Several outputs are created in the output directory specified. A main directory is created, and the name depends on the -r option used:

Directory Relabeled-by-species - Results from using the -r species option.
Directory Relabeled-by-accession - Results from using the -r accession option.
Directory Relabeled-by-species_acc - Results from using the -r species_acc option.

The reason these directories are labeled by the -r option is so that you can run all three options using the same output directory - the outputs won't be overwritten.

For any of the above directories created, two subdirectories will also be created inside:

Directory Relabeled-fasta-files - This directory will contain all the alignment files with sequences relabeled, called NAME_relabeled.fasta.
Directory Relabeling-key-files - This directory will contain all the relabeling keys, called NAME_label_key.txt. This summary file contains three columns:
- Accession - The accession number of the sequence.
- Taxon - The taxon name used for relabeling - note that this could be different for a given sample when the subspecies option is included (-s ).
- Description - The original description of the sequence (minus accession number) that was used for relabeling.

Here is an example of the contents of a NAME_label_key.txt file:

Accession	Taxon	Description
AF215137.1	Brookesia peyrierasi	Brookesia peyrierasi DESCRIPTION 12s ribosomal rna gene partial sequence mitochondrial
GQ921666.1	Brookesia superciliaris	Brookesia superciliaris Voucher_ZCMV150 DESCRIPTION voucher zcmv150 12s ribosomal rna rrns gene partial sequence mitochondrial
GQ921671.1	Brookesia therezieni	Brookesia therezieni Voucher_DRV5924 DESCRIPTION voucher drv5924 12s ribosomal rna rrns gene partial sequence mitochondrial

Usage Examples

To illustrate the effect of using the various relabeling options, two datasets will be shown. The first is a species-level dataset, which is intended to be used to create a supermatrix, and the second is a vouchered population-level dataset, which is intended to be used for phylogeographic analyses.

Relabeling a species-level dataset

This dataset contains the following sequences in an alignment (16S):

>JX564853.1 Arthroleptis poecilonotus DESCRIPTION mitochondrion partial genome
---TATAAGAGGCCCAGCCTGCCCAGTGATAATTTCAACGGCCGCGGTATCCTAACCGTGCAAAGGTAGCACAATAACTTGT-TCTTTC
>EF621774.1 Cardioglossa gracilis Voucher_MCZ_A136796 DESCRIPTION voucher mcz a136796 16s ribosomal rna gene partial sequence mitochondrial
---TACAAAAGGCCTAGCCTGCCCAGTGATTATTTCAACGGCCGCGGTACCCTAACCGTGCAAAGGTAGCACAATCACTTGT-TCTTTAA
>JQ711164.1 Cardioglossa melanogaster Voucher_4628 DESCRIPTION isolate 4628 16s ribosomal rna gene partial sequence mitochondrial
---TATAAGAGGCCCCGCCTGCCCAGTGATTATTTCAACGGCCGCAGTAACCTAACCGCGCAAAGGTAGCACAATCACTTGT-TCTCTAA
>EF621776.1 Cardioglossa elegans Voucher_AMCC_117611 DESCRIPTION voucher amcc 117611 16s ribosomal rna gene partial sequence mitochondrial
----ATAAGAGGCCACGCCTGCCCAGTGATCATTTCAACGGCCACGGTACCCTAACCGCGCGAAGGTAGCACAATCACTTGT-TCTCTAA
>JX564856.1 Cardioglossa leucomystax DESCRIPTION mitochondrion partial genome
---TATAAGAGGCCCCGCCTGCCCAGTGATTATTTCAACGGCCACGGTACCCTAACCGCGCGAAGGTAGCACAATCACTTGT-TCTCTAA

For simplicity, the sequence descriptions are shown here without the intervening sequence data:

>JX564853.1 Arthroleptis poecilonotus DESCRIPTION mitochondrion partial genome

>EF621774.1 Cardioglossa gracilis Voucher_MCZ_A136796 DESCRIPTION voucher mcz a136796 16s ribosomal rna gene partial sequence mitochondrial

>JQ711164.1 Cardioglossa melanogaster Voucher_4628 DESCRIPTION isolate 4628 16s ribosomal rna gene partial sequence mitochondrial

>EF621776.1 Cardioglossa elegans Voucher_AMCC_117611 DESCRIPTION voucher amcc 117611 16s ribosomal rna gene partial sequence mitochondrial

>JX564856.1 Cardioglossa leucomystax DESCRIPTION mitochondrion partial genome

Three of the five sequences have voucher numbers. Here are the outcomes of each relabeling option (note - sequence data not shown for simplicity!):

Species-level dataset: `-r species`

>Arthroleptis_poecilonotus

>Cardioglossa_gracilis

>Cardioglossa_melanogaster

>Cardioglossa_elegans

>Cardioglossa_leucomystax

These taxon names will be consistent across loci, and will allow these alignments to be concatenated. Because this is a species-level dataset (one sequence per taxon), there will not be any duplicate names WITHIN any given alignment.

Species-level dataset: `-r accession`

>JX564853.1

>EF621774.1

>JQ711164.1

>EF621776.1

>JX564856.1

Using accession numbers guarantees there will not be any duplicate names WITHIN an alignment, but we cannot concatenate alignments labeled this way.

Species-level dataset: `-r species_acc`

>Arthroleptis_poecilonotus_JX564853.1

>Cardioglossa_gracilis_EF621774.1

>Cardioglossa_melanogaster_JQ711164.1

>Cardioglossa_elegans_EF621776.1

>Cardioglossa_leucomystax_JX564856.1

Here we see the taxon name, followed by the accession number. This produces a unique name for each sequence WITHIN an alignment, but we cannot concatenate alignments labeled this way.

Species-level dataset: `-r species` + `--voucherize`

>Arthroleptis_poecilonotus

>Cardioglossa_gracilis_MCZ_A136796

>Cardioglossa_melanogaster_4628

>Cardioglossa_elegans_AMCC_117611

>Cardioglossa_leucomystax

Here we see the taxon name, followed by the voucher number if it is available. We might be able to concatenate alignments labeled this way - but it requires:

a taxon to be missing voucher information for all sequences or
a taxon to have the exact same voucher number for all sequences Even one taxon with a mismatch of voucher numbers would interfere with concatenation. So, this is kind of a special case scenario.

In general, for a species-level dataset produced from GenBank there is no expectation that the sequences for each taxon come from the same sample. The --voucherize option in this case would prevent alignment concatenation, because sequences of the same taxon would likely be labeled differently. If concatenation is not the goal, then labeling species with voucher codes could be useful (e.g., for constructing a particular gene tree). If you are using local sequences and all samples have consistent voucher codes, then this labeling option should work great for concatenation.

Species-level dataset: `-r accession` + `--voucherize`

>JX564853.1

>EF621774.1_MCZ_A136796

>JQ711164.1_4628

>EF621776.1_AMCC_117611

>JX564856.1

Here we see the accession, followed by the voucher number if it is available. I'm not sure why you would want to do it this way, but it's available!

Species-level dataset: `-r species_acc` + `--voucherize`

>Arthroleptis_poecilonotus_JX564853.1

>Cardioglossa_gracilis_EF621774.1_MCZ_A136796

>Cardioglossa_melanogaster_JQ711164.1_4628

>Cardioglossa_elegans_EF621776.1_AMCC_117611

>Cardioglossa_leucomystax_JX564856.1

Here we see the taxon name, followed by the accession number, followed by the voucher number if it is available. I'm not sure why you would want to do it this way, but it's available!

Relabeling a vouchered population-level dataset

This dataset contains the following sequences in an alignment (ND2):

>GU931545.1 Trachylepis aurata Voucher_ZFMK75837 DESCRIPTION isolate zfmk75837 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
CCTTCCTATACTAATGAATCCACTTATATATTCACTTATCATATCGAGCATTGCAATGGGAACAATTATTACAATATCAAGCTTCCACTG
>GU931546.1 Trachylepis aurata Voucher_ZFMK84085 DESCRIPTION transcaucasica isolate zfmk84085 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
CCTTCCTATACTAATGAACCCACTTATATATTCACTCATTCTGTCAAGCATTGCAACAGGAACAATTATTACAATATCAAGTTTCCACTG
>GU931601.1 Trachylepis varia Voucher_KHH003 DESCRIPTION isolate khh003 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
>GU931602.1 Trachylepis varia Voucher_KHH006 DESCRIPTION isolate khh006 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
>GU931603.1 Trachylepis varia Voucher_MCZZ23277 DESCRIPTION isolate mczz23277 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial
ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC

For simplicity, the sequence descriptions are shown here without the intervening sequence data:

>GU931545.1 Trachylepis aurata Voucher_ZFMK75837 DESCRIPTION isolate zfmk75837 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial

>GU931546.1 Trachylepis aurata Voucher_ZFMK84085 DESCRIPTION transcaucasica isolate zfmk84085 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial

>GU931601.1 Trachylepis varia Voucher_KHH003 DESCRIPTION isolate khh003 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial

>GU931602.1 Trachylepis varia Voucher_KHH006 DESCRIPTION isolate khh006 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial

>GU931603.1 Trachylepis varia Voucher_MCZZ23277 DESCRIPTION isolate mczz23277 nadh dehydrogenase subunit 2 nd2 gene complete cds mitochondrial

These sequences come from a single phylogeographic study. They all have voucher IDs, and were filtered accordingly using the Filter_Seqs_and_Species.py module. For this project, we know each sample can be linked across loci using their unique voucher code. Here are the outcomes of each relabeling option (note - sequence data not shown for simplicity!):

Vouchered population-level dataset: `-r species`

>Trachylepis_aurata

>Trachylepis_aurata

>Trachylepis_varia

>Trachylepis_varia

>Trachylepis_varia

The taxon name is written for each sequence. However, because multiple sequences come from each taxon, this will create duplicate names in the file. This is problematic for many programs (including RAxML), and makes it impossible to identify which sequence belongs to which sample within a given species.

Vouchered population-level dataset: `-r accession`

>GU931545.1

>GU931546.1

>GU931601.1

>GU931602.1

>GU931603.1

The accession numbers are written for each sequence. This solves the problem of duplicate names within the file, but there is no way to link samples across loci.

Vouchered population-level dataset: `-r species_acc`

>Trachylepis_aurata_GU931545.1

>Trachylepis_aurata_GU931546.1

>Trachylepis_varia_GU931601.1

>Trachylepis_varia_GU931602.1

>Trachylepis_varia_GU931603.1

Here we see the taxon name, followed by the accession number. This solves the problem of duplicate names within the file, but there is no way to link samples across loci.

Vouchered population-level dataset: `-r species` + `--voucherize`

>Trachylepis_aurata_ZFMK75837

>Trachylepis_aurata_ZFMK84085

>Trachylepis_varia_KHH003

>Trachylepis_varia_KHH006

>Trachylepis_varia_MCZZ23277

Here we see the taxon name, followed by the voucher number if it is available. This will allow identification of each unique sample, and allow the samples to be linked across loci. For a vouchered population-level dataset, this relabeling scheme will allow for alignment concatenation.

Vouchered population-level dataset: `-r accession` + `--voucherize`

>GU931545.1_ZFMK75837

>GU931546.1_ZFMK84085

>GU931601.1_KHH003

>GU931602.1_KHH006

>GU931603.1_MCZZ23277

Here we see the accession number, followed by the voucher number if it is available. This won't allow concatenation, but all sequence IDs will be unique WITHIN an alignment.

Vouchered population-level dataset: `-r species_acc` + `--voucherize`

>Trachylepis_aurata_GU931545.1_ZFMK75837

>Trachylepis_aurata_GU931546.1_ZFMK84085

>Trachylepis_varia_GU931601.1_KHH003

>Trachylepis_varia_GU931602.1_KHH006

>Trachylepis_varia_GU931603.1_MCZZ23277

Here we see the taxon name, followed by the accession number, followed by the voucher number if it is available. This won't allow concatenation, but all sequence IDs will be unique WITHIN an alignment.

Relabeling recommendations

Hopefully the above examples illustrate how the relabeling options work, and why you might choose one over another. In general, names within an alignment must be unique. Whether or not you want names to be consistent across loci is another topic, and depends on your goals. For a species-level dataset being used to construct a supermatrix, you will want to relabel by taxon (-r species). This will make the names unique within an alignment and consistent across alignments, allowing for concatenation. For a vouchered population-level dataset, you may want to relabel by taxon and voucher (-r species + --voucherize). This will also make the names unique within an alignment and consistent across alignments, allowing for concatenation or samples to be linked in other analyses.

For other datasets, it may not be necessary or possible to link names across loci. A good example is a population-level dataset that is produced from many studies. There may be great disparity in the number of samples sequenced for different loci, and it may be that each study used a separate set of samples. In this case, it is more important that sample names be unique within alignments, as it is unlikely that any name combinations will be consistent across loci. For this type of dataset, the -r accession or -r species_acc options would work well.

Back to top

Trim Sequence Alignments

Trimming alignments can remove poorly aligned and gap-rich regions, and may be desirable under some circumstances. There are a variety of methods that perform some type of automated trimming. SuperCRUNCH offers two different approaches for trimming. The first approach uses trimAl to trim alignments (Trim_Alignments_Trimal.py), and specifically implements three of the options in trimAl (gap-threshold trimming, the gappyout algorithm, and remove columns of only gaps). The second approach is more customizable (Trim_Alignments_Custom.py), and is largely based on the trimming routine included in phyluce. The implementation of this routine in SuperCRUNCH allows for edge trimming, row trimming, or a combination of edge and row trimming. Parameters controlling each of these settings can be run using default values, or tuned to particular datasets.

There is always a tradeoff between keeping missing data (and potential phylogenetically informative sites) versus eliminating missing data. Running the trimming routines available in SuperCRUNCH is fast and easy, so it is worth generating untrimmed and trimmmed alignments to investigate what the downstream effects may be.

Back to top

Trim_Alignments_Trimal.py

The Trim_Alignments_Trimal.py module relies on the program trimAl to batch trim all alignment files in a directory. There are four options (-a) for using trimAl in this script:

-a gt - Implements the gap-threshold algorithm (trimal -gt). A gap-threshold is the minimum fraction of sequences without a gap that you require to keep a column. A value of 1 would only keep columns that are free of gaps, and a value of 0.6 would remove columns with gaps in more than 40% of the sequences. In other words, at least 60% of the sequences must have data in the column for it to be kept. The default value is set to 0.05, meaning any column with gaps in more than 95% of the sequences will be removed. This default value is intended to remove very poorly aligned regions and long, overhanging edges produced by a few sequences in an alignment. You can change the value assigned by using the optional --gt flag with any value from 0-1.
-a noallgaps - Implements the noallgaps algorithm (trimal -noallgaps). This simply removes all columns that contain only gaps (100% missing data). This is equivalent to running the gap-threshold algorithm with a value of 1.
-a both - Runs the gap-threshold algorithm, followed by the noallgaps algorithm. This ensures that there will not be any columns composed of only gaps after trimming. The optional --gt flag can be used with this option.
-a gappyout - Implements the gappyout algorithm (trimal -gappyout). This automated method calculates a minimum gap-score cut-off based on the input alignment characteristics and trims all alignment columns falling below the threshold value. This method adapts parameters for each input alignment, rather than apply the same fixed parameters to all alignments (like the gap-threshold method).

Trim_Alignments_Trimal.py is designed to work with a directory of alignment files. Unlike most other steps in SuperCRUNCH, multiple types of input formats can be used. The input alignment file formats are auto-detected and can be include fasta, nexus, or phylip formats. The input alignment files must be labeled with one of the following extensions to be read: NAME.fasta, NAME.fa, NAME.nexus, NAME.nex, NAME.phylip, or NAME.phy. The output format must be specified using the -f flag, and includes fasta, nexus, and phylip formats.

It is essential that you relabel your sequences prior to trimming. If sequences have not been relabeled, trimAl will label output sequences using only their accession number. Please use the Fasta_Relabel_Seqs.py module for this task.

Basic Usage:

python Trim_Alignments_Trimal.py -i <input directory> -o <output directory> -a <trimming method in trimal> -f <output format>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the input alignment files. File formats are auto-detected and can be include fasta, nexus, or phylip formats, but input alignment files must be labeled with one of the following extensions: NAME.fasta, NAME.fa, NAME.nexus, NAME.nex, NAME.phylip, or NAME.phy.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`-f <choice>` or `--format <choice>`

Required: Specify the output file format for trimmed alignment. Choices = fasta, nexus, phylip.

`-a <choice>` or `--analysis <choice>`

Required: Specify the trimal method for trimming alignments. Choices = gt, noallgaps, both, gappyout.

`--gt <float>`

Optional: Specify the gt (gap threshold) value for trimal - the minimum fraction of sequences without a gap. Default = 0.05.

Example Usage:

python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f fasta -a gt

Above command will trim all alignments present in the directory /bin/Trim/input using the gap threshold method with a default value of 0.05, and output files in fasta format.

python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f nexus -a gt --gt 0.3

Above command will trim all alignments present in the directory /bin/Trim/input using the gap threshold method with a value of 0.3, and output files in nexus format.

python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f phylip -a noallgaps

Above command will trim alignments present in the directory /bin/Trim/input using the noallgaps method and output files in phylip format.

python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f nexus -a both --gt 0.08

Above command will trim alignments present in the directory /bin/Trim/input using the gap threshold method with a value of 0.08, then run the noallgaps method, and output files in nexus format.

python Trim_Alignments.py -i /bin/Trim/input -o /bin/Trim/output -f fasta -a gappyout

Above command will trim alignments present in the directory /bin/Trim/input using the gappyout method and output files in fasta format.

Outputs:

Outputs are created in the output directory specified, and depend on the output file format selected. The output directories can include:

Directory /Trimmed-fasta-Files - A directory containing a fasta-formatted trimmed alignment for each input alignment, labeled as NAME_trimmed.fasta.
Directory /Trimmed-nexus-Files - A directory containing a nexus-formatted trimmed alignment for each input alignment, labeled as NAME_trimmed.nex.
Directory /Trimmed-phylip-Files - A directory containing a phylip-formatted trimmed alignment for each input alignment, labeled as NAME_trimmed.phy.

Back to top

Trim_Alignments_Custom.py

The Trim_Alignments_Custom.py module uses a custom trimming function to batch trim all input alignments. This custom trimming function is a heavily modified version of the generic_align.py module of phyluce. There are three main options, which can be specified using the required -a flag:

-a edges_internal - Executes edge (block) trimming and internal (row) trimming.
-a edges - Executes edge (block) trimming only.
-a internal - Executes internal (row) trimming only.

The trimming for each of these options is guided by the optional flags:

-w - window: Sliding window size for trimming. Default is 20 base pairs.
-p - proportion: The proportion of taxa required to have sequence at alignment ends. Default is 0.65.
-t - threshold: The proportion of residues required across the window in proportion of taxa. Default is 0.65.
-d - max_divergence: The max proportion of sequence divergence allowed between any row of the alignment and the consensus sequence. Default is 0.20.
-l - min_length: The minimum length of alignments to keep. Default is 100 base pairs.

These optional flags have set default values, but they can be changed by using the relevant flags with a user-supplied value. After custom trimming, the alignments are 'cleaned up' using trimAl.

The output format must be specified using the required -f flag, and includes fasta, nexus, and phylip).

In general, this trimming function is very aggressive in removing columns, particularly when sequence length heterogeneity is relatively high. An output file called Trimming_Summary.txt is created, which has information about the starting lengths and trimmed lengths for every input alignment. If too many columns are being removed, you can try adjusting the optional arguments and inspect this output file to see the effect on the number of columns removed, and further adjust parameters accordingly. If changing parameter settings still results in the removal of too many columns, you should consider using the Trim_Alignments_Trimal.py module instead.

Trim_Alignments_Custom.py is designed to work with a directory of fasta-formatted alignment files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores.

It is essential that you relabel your sequences prior to trimming. Please use the Fasta_Relabel_Seqs.py module for this task.

Basic Usage:

python Trim_Alignments_Custom.py -i <input directory> -o <output directory> -a <custom trimming method> -f <output format>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the aligned and relabeled fasta files to trim

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`-a <choice>` or `--analysis <choice>`

Required: Specify the method for trimming alignments. Choices = edges_internal, edges, internal.

`-f <choice>` or `--format <choice>`

Required: Specify the output file format for trimmed alignment. Choices = fasta, nexus, phylip.

`-w <integer>` or `--window <integer>`

Optional: Sliding window size for trimming. Default = 20.

`-p <float>` or `--proportion <float>`

Optional: The proportion of taxa required to have sequence at alignment ends. Default = 0.65.

`-t <float>` or `--threshold <float>`

Optional: The proportion of residues required across the window in proportion of taxa. Default = 0.65.

`-d <float>` or --max_divergence `

Optional: The max proportion of sequence divergence allowed between any row of the alignment and the consensus sequence. Default = 0.20.

`-l <integer>` or `--min_length <integer>`

Optional: The minimum length of alignments to keep. Default = 100.

Example Usage:

python Trim_Alignments_Custom.py -i bin/Trim/input -o bin/Trim/output -a edges_internal -f fasta

Above command will trim all alignments present in the directory /bin/Trim/input using the edge and internal trimming methods with all default values, and produce output files in fasta format.

python Trim_Alignments_Custom.py -i bin/Trim/input -o bin/Trim/output -a edges -f nexus

Above command will trim all alignments present in the directory /bin/Trim/input using only the edge trimming method with all default values, and produce output files in nexus format.

python Trim_Alignments_Custom.py -i bin/Trim/input -o bin/Trim/output -a internal -f phylip

Above command will trim all alignments present in the directory /bin/Trim/input using only the internal trimming method with all default values, and produce output files in phylip format.

python Trim_Alignments_Custom.py -i bin/Trim/input -o bin/Trim/output -a edges_internal -f fasta -w 50 -d 0.3 -l 500

Above command will trim all alignments present in the directory /bin/Trim/input using the edge and internal trimming methods. Some default values are changed. The window size is set to 50 (vs. 20), the max divergence between seqs and consensus is set to 0.3 (vs. 0.2), and the minimum length of alignments must be 500 bp to keep them. This will produce output files in fasta format.

Outputs:

Outputs are created in the output directory specified, and depend on the output file format selected. The output directories can include:

Directory /Trimmed-fasta-Files - A directory containing a fasta-formatted trimmed alignment in for each input alignment, labeled as NAME_trimmed.fasta.
Directory /Trimmed-nexus-Files - A directory containing a nexus-formatted trimmed alignment in for each input alignment, labeled as NAME_trimmed.nex.
Directory /Trimmed-phylip-Files - A directory containing a phylip-formatted trimmed alignment in for each input alignment, labeled as NAME_trimmed.phy.

In addition, the following directory will always be created:

Directory Summary-File - A directory containing a single summary file called Trimming_Summary.txt. This summary file contains four columns:
- Fasta - The name of the input fasta file.
- Start_Length - The starting length of the alignment.
- Trimmed_Length - The length of the alignment after trimming.
- Bases_Trimmed - The number of columns trimmed from the alignment.

Here is an example of the contents of Trimming_Summary.txt:

Fasta	Start_Length	Trimmed_Length	Bases_Trimmed
12S_CLUSTALO_Aligned_relabeled.fasta	1435	696	739
16S_CLUSTALO_Aligned_relabeled.fasta	1829	**FAILED**	**NA**
ADNP_MACSE_Aligned_relabeled.fasta	918	561	357
AHR_MACSE_Aligned_relabeled.fasta	648	424	224
AKAP9_MACSE_Aligned_relabeled.fasta	1479	458	1021
BACH2_MACSE_Aligned_relabeled.fasta	1446	1437	9
BDNF_MACSE_Aligned_relabeled.fasta	897	624	273
CO1_CLUSTALO_Aligned_relabeled.fasta	1602	**FAILED**	**NA**
CXCR4_MACSE_Aligned_relabeled.fasta	720	529	191
CYTB_CLUSTALO_Aligned_relabeled.fasta	1183	**FAILED**	**NA**

Note that any alignment with a **Failed** tag in the Trimmed_Length column failed the minimum length filter. The above example was run using -a edges_internal with the default parameter values. In this example, the 1,829 base pair 16S alignment was reduced to less than 100 bp. Similarly, the 1,602 bp CO1 and the 1,183 bp CYTB alignments were also reduced to less than 100 bp. Many of the nuclear loci alignments were also reduced to half of their starting length (or more!). This is what I mean when I say that this trimming function is rather aggressive in removing columns, especially for divergent or hard to align sequences (such as the mtDNA loci here). This is the built-in trimming routine (with identical default values) used for UCE loci processed using phyluce. I strongly recommend manipulating the optional parameter values to see their effects, and also comparing the results of Trim_Alignments_Custom.py to Trim_Alignments_Trimal.py - the trimAl methods are likely to be far less aggressive in removing columns.

Back to top

Calculate Sequence Length Heterogeneity Statistics

Alignments can be assessed for sequence length variation. For each alignment, the sequence lengths are used to calculate the mean, minimum, maximum, standard deviation, and coefficient of variation (the ratio of the standard deviation to the mean). These metrics can be calculated from:

Unaligned sequence lengths - For each sequence, this is the number of actual bases in the sequence. This obtained by stripping all gaps from the sequence, and simply counting the number of bases. Although obtained from an alignment, this is essentially calculating length variation in a set of unaligned sequences.
Aligned sequence lengths - For each sequence, this is the difference between the start and stop position of the sequence in the alignment. This can include gapped positions in the length calculation. This is calculating length variation in a sequence alignment.

The aligned sequence lengths coefficient of variation (ASL-CV; Portik & Wiens, in review) is a standardized measure for comparing length variation across alignments, with higher ASL-CV values indicating greater variability in sequence lengths.

Back to top

Seq_Length_Heterogeneity.py

The Seq_Length_Heterogeneity.py module calculates several metrics in an effort to quantify sequence length variation within alignments. The user must select the sequence type to calculate metrics from (-s), with two choices:

-s bases - Only actual nucleotides are counted for a sequence (gaps ignored). The sequence length is obtained from the sum of the number of real bases present.
-s alns - The position number of the first nucleotide and last nucleotide in the sequence alignment are used to calculate the length of a sequence. This can include gapped positions in the length calculation.

From the lengths obtained for all the sequences in an alignment, the following metrics are calculated: mean, minimum, maximum, standard deviation, and the coefficient of variation.

The results for all alignments are written to a tab-delimited output file in the output directory specified using the -o flag.

Seq_Length_Heterogeneity.py is designed to work for a directory that contains either phylip or fasta alignment files, with the format specified by the -f flag. The input alignment files must be labeled with one of the following extensions to be read: NAME.fasta, NAME.fa, NAME.phylip, or NAME.phy.

Basic Usage:

python Seq_Length_Heterogeneity.py -i <input directory> -o <output directory> -f <input format> -s <sequence length type>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the input alignment files (fasta or phylip format).

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`-f <choice>` or `--format <choice>`

Required: The file format of the INPUT alignments. Choices = fasta, phylip.

`-s <choice>` or `--seqtype <choice>`

Required: Use de-aligned sequence lengths (bases) or aligned sequence lengths (alns) for calculations. Choices = bases, alns.

Example Usage:

python Seq_Length_Heterogeneity.py -i bin/SeqHet/input -o bin/SeqHet/output -f fasta -s bases

Above command will calculate sequence length heterogeneity metrics for all fasta files in directory bin/SeqHet/input based on the actual number of bases in each sequence.

python Seq_Length_Heterogeneity.py -i bin/SeqHet/input -o bin/SeqHet/output -f fasta -s alns

Above command will calculate sequence length heterogeneity metrics for all fasta files in directory bin/SeqHet/input based on the start and stop position of each sequence in the alignment.

Outputs:

Outputs are created in the output directory specified, and depend on the length method selected. The outputs can include:

File Sequence_Length_Heterogeneity_Results_BasePairLengths.txt - The output file created when option -s bases is used.
File Sequence_Length_Heterogeneity_Results_AlignedLengths.txt - The output file created when option -s alns is used.

Both files will contain the following columns:

Alignment_File - The name of the input alignment file.
MeanLength - The average length of the sequences contained in the alignment.
MinLength - The minimum sequence length contained in the alignment.
MaxLength - The maximum sequence length contained in the alignment.
SD - The standard deviation of the sequence lengths.
CoV - The coefficient of variation (the ratio of the standard deviation to the mean) of the sequence lengths.

Here is an example of the contents:

Alignment_File	MeanLength	MinLength	MaxLength	SD	CoV
UCE-1003_trimmed.phy	486.6	203	989	300.9	0.618
UCE-1004_trimmed.phy	510.3	202	1073	366.2	0.718
UCE-1005_trimmed.phy	517.8	208	1039	308.8	0.596
UCE-1008_trimmed.phy	583.7	198	1079	313.4	0.537
UCE-1010_trimmed.phy	526.1	228	1081	338.2	0.643
UCE-1012_trimmed.phy	540.6	213	1103	340.5	0.63
UCE-1013_trimmed.phy	1082.7	883	1121	58.7	0.054
UCE-1014_trimmed.phy	537.4	211	1078	351.2	0.653
UCE-1017_trimmed.phy	500.9	200	1084	332.2	0.663
UCE-1018_trimmed.phy	555.0	200	1087	377.1	0.68

Back to top

Convert to Nexus and Phylip Format

Fasta-formatted alignment files can be quickly converted into nexus and phylip formats using SuperCRUNCH. These are two commonly used formats for many phylogenetic and population-genetic analyses.

Back to top

Fasta_Convert.py

The Fasta_Convert.py module converts a directory of aligned fasta files into both phylip and nexus formats. This conversion should only be performed after records have been relabeled, as the original longer description lines containing whitespace will cause problems. Please use the Fasta_Relabel_Seqs.py module for this task.

Fasta_Convert.py is designed to work with a directory of fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores.

Basic Usage:

python Fasta_Convert.py -i <input directory> -o <output directory>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the fasta format alignments to convert.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

Example Usage:

python Fasta_Convert.py -i bin/Convert/input -o bin/Convert/output

Above command will convert all aligned fasta files found in the directory bin/Convert/input into nexus and phylip format, and write them in corresponding output directories in the bin/Convert/output directory.

Outputs:

Several outputs are created in the output directory specified.

Directory Nexus-Files - For each input fasta alignment file, this directory will contain an corresponding nexus file named NAME.nex.
Directory Phylip-Files - For each input fasta alignment file, this directory will contain an corresponding phylip file named NAME.phy.

Let's assume a file called ND2_Trachy.fasta is included in the input directory. The contents contain four aligned sequences in fasta format:

>Trachylepis_aurata
--CCTTCCTATACTAATGAATCCACTTATATATTCACTTATCATATCGAGCATTGCAATGGGAACAATTATTACAATATCAAGCTTCCAC
>Trachylepis_punctulata
GTCCTTCCTATACTAATGAACCGTATCATATACTCAATTATCTTATCAAACGTAGCAACCGGCACAATTATTACAATAACAAGCTTCCAT
>Trachylepis_sulcata
ATCCTTCCTATACTAATGAACCCCACCATATATTCCATTATCCTTTCAAATGTAGCAATCGGCACCATCATTACAATAACAAGCCACCAC
>Trachylepis_varia
ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC

Running Fasta_Convert.py would produce a corresponding phylip file ND2_Trachy.phy in the Phylip-Files directory. The contents will look like this:

4 90
Trachylepis_aurata --CCTTCCTATACTAATGAATCCACTTATATATTCACTTATCATATCGAGCATTGCAATGGGAACAATTATTACAATATCAAGCTTCCAC
Trachylepis_punctulata GTCCTTCCTATACTAATGAACCGTATCATATACTCAATTATCTTATCAAACGTAGCAACCGGCACAATTATTACAATAACAAGCTTCCAT
Trachylepis_sulcata ATCCTTCCTATACTAATGAACCCCACCATATATTCCATTATCCTTTCAAATGTAGCAATCGGCACCATCATTACAATAACAAGCCACCAC
Trachylepis_varia ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC

And a corresponding nexus file ND2_Trachy.nex would be written to the Phylip-Files directory. The contents will look like this:

#NEXUS 
BEGIN DATA;
	DIMENSIONS  NTAX=4 NCHAR=90;
	FORMAT DATATYPE=DNA  MISSING=N GAP=-;
MATRIX

Trachylepis_aurata --CCTTCCTATACTAATGAATCCACTTATATATTCACTTATCATATCGAGCATTGCAATGGGAACAATTATTACAATATCAAGCTTCCAC
Trachylepis_punctulata GTCCTTCCTATACTAATGAACCGTATCATATACTCAATTATCTTATCAAACGTAGCAACCGGCACAATTATTACAATAACAAGCTTCCAT
Trachylepis_sulcata ATCCTTCCTATACTAATGAACCCCACCATATATTCCATTATCCTTTCAAATGTAGCAATCGGCACCATCATTACAATAACAAGCCACCAC
Trachylepis_varia ATCCTTCCTATACTAATGAACCCTATTATGTCCTCCATCATCCTATCAAGCGTAGCAATTGGTACAATTATTACAATATCAAGCTATCAC
;
End;

Back to top

Concatenate Alignments

A traditional supermatrix is created by combining the sequence alignments from all loci into a single sequence alignment, a process called concatenation. The concatenated alignment is created by matching the names of sequences across loci. SuperCRUNCH can be used to create a concatenated alignment, and offers some unique features for concatenation.

Back to top

Concatenation.py

The Concatenation.py module can be used to combine multiple sequence alignments into a single concatenated alignment. The input file format can be fasta or phylip, and is selected using the --informat flag. The input alignment files must be labeled with one of the following extensions to be read: NAME.fasta, NAME.fa, NAME.phylip, or NAME.phy. The NAME portion should not contain any periods or spaces, but can contain underscores.

The complete set of sequence names (generally corresponding to taxon or taxon + voucher labels) that will be used to create the final alignment is gathered from all alignments. All sequence labels must be unique within each alignment (no duplicate names) or an error will be thrown. If a sequence label is absent from an alignment, a missing sequence is generated using the symbol selected with the -s flag (N, -, or ?). The output format must be specified as fasta, phylip, nexus, or interleaved nexus using the --outformat flag. Output files are written to the specified output directory specified with the -o.

Concatenation will only work properly if samples have names that are consistent across loci. This includes species-level supermatrix datasets, where sequences are labeled using a taxon name, and vouchered population-level datasets, where sequences are labeled using a taxon name + voucher code. For more details about how this works, please read through the Relabel Sequence Records and Fasta_Relabel_Seqs.py sections.

Basic Usage:

python Concatenation.py i <input directory> -o <output directory> --informat <format> --outformat <format> -s <missing data symbol>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the input fasta files.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`--informat <choice>`

Required: The file format of the INPUT alignments. Choices = fasta, phylip.

`--outformat <choice>`

Required: The file format for the OUTPUT concatenated alignment. Choices = fasta, phylip, nexus, interleaved_nexus.

`-s <choice>` or `--symbol <choice>`

Required: A base pair symbol used to represent missing data when sequences are not available for a taxon. Choices = dash, N, ?.

`--seqwrap <integer>`

Optional: For use with interleaved_nexus format option. The number of bp characters to include per line before splitting to an additional line. Default = 100.

Example Usage:

python Concatenation.py -i /bin/Concat/Input -o /bin/Concat/output --in_format fasta --out_format phylip -s dash

Above command will concatenate all the fasta alignments in the directory /bin/Concat/Input and produce a concatenated phylip alignment in /bin/Concat/output. Missing data are represented with the - symbol.

python Concatenation.py -i /bin/Concat/Input -o /bin/Concat/output --in_format phylip --out_format fasta -s ?

Above command will concatenate all the phylip alignments in the directory /bin/Concat/Input and produce a concatenated fasta alignment in /bin/Concat/output. Missing data are represented with the ? symbol.

Outputs:

Several outputs are created in the output directory specified.

File Concatenated_Alignment.fasta, Concatenated_Alignment.phylip, Concatenated_Alignment.nex, or Concatenated_Alignment_Interleaved.nex - The final concatenated alignment in the output format specified.
File Data_Partitions.txt - A text file displaying the order in which loci were concatenated and their corresponding base pairs within the alignment. These are the data partitions or charsets, and can be used for model testing or partitioning in analyses. You could copy/past these contents to a corresponding nexus file as charset blocks for programs such as MrBayes or RevBayes. Here is an example of the contents:

[data_blocks]
12S_CLUSTALO_Aligned_relabeled_trimmed.fasta = 1-985;
16S_CLUSTALO_Aligned_relabeled_trimmed.fasta = 986-2416;
ADNP_MACSE_Aligned_relabeled_trimmed.fasta = 2417-3259;
AHR_MACSE_Aligned_relabeled_trimmed.fasta = 3260-3856;
AKAP9_MACSE_Aligned_relabeled_trimmed.fasta = 3857-5335;
BACH1_MACSE_Aligned_relabeled_trimmed.fasta = 5336-6589;
BACH2_MACSE_Aligned_relabeled_trimmed.fasta = 6590-8020;
BDNF_MACSE_Aligned_relabeled_trimmed.fasta = 8021-8725;
BHLHB2_MACSE_Aligned_relabeled_trimmed.fasta = 8726-9493;
BMP2_MACSE_Aligned_relabeled_trimmed.fasta = 9494-10183;

File Taxa_Loci_Count.log - A count of the number of sequences that were available for each sequence label. in other words the number of alignments the sequence label was present in. If a species-level dataset was used, sequences are labeled using taxon names and this file will be a full list of the species included in the concatenated alignment. If a vouchered population-level dataset was used, it will be a list of all the samples (taxon name + voucher code) in the concatenated alignment.

Here is an example of the contents from a species-level supermatrix:

Taxon	Loci
Acanthocercus_adramitanus	4
Acanthocercus_annectens	6
Acanthocercus_atricollis	6
Acanthocercus_cyanogaster	6
Acanthocercus_yemensis	8
Acanthosaura_armata	7
Acanthosaura_capra	2
Acanthosaura_crucigera	5
Acanthosaura_lepidogaster	44
Agama_aculeata	7
Agama_africana	5
Agama_agama	51
Agama_anchietae	9
Agama_armata	6
Agama_atra	8
Agama_boensis	5
Agama_bottegi	5
Agama_boueti	7
Agama_boulengeri	7

Here is an example of the contents from a vouchered population-level supermatrix:

Taxon	Loci
Trachylepis_aurata_ZFMK75837	4
Trachylepis_aurata_ZFMK84085	4
Trachylepis_punctulata_MBUR00322	4
Trachylepis_punctulata_MCZA38906	4
Trachylepis_punctulata_MCZA38924	4
Trachylepis_sulcata_AMB4288	4
Trachylepis_sulcata_AMB4291	1
Trachylepis_sulcata_AMB4596	4
Trachylepis_sulcata_AMB4620	4
Trachylepis_sulcata_AMB4693	3
Trachylepis_sulcata_AMB4766	4
Trachylepis_sulcata_AMB4767	4
Trachylepis_sulcata_AMB4782	4
Trachylepis_sulcata_AMB4790	3
Trachylepis_sulcata_AMB6837	2
Trachylepis_sulcata_AMB6838	4
Trachylepis_sulcata_AMB6839	4

Back to top

Last updated: June, 2020

For SuperCRUNCH v1.2

8: Post Alignment Tasks

Post-Alignment Tasks

Overview

Relabel Sequence Records

Fasta_Relabel_Seqs.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

-r <choice> or --relabel <choice>

-s <path-to-file> or --subspecies <path-to-file>

--voucherize

Example Usage:

Outputs:

Usage Examples

Relabeling a species-level dataset

Species-level dataset: -r species

Species-level dataset: -r accession

Species-level dataset: -r species_acc

Species-level dataset: -r species + --voucherize

Species-level dataset: -r accession + --voucherize

Species-level dataset: -r species_acc + --voucherize

Relabeling a vouchered population-level dataset

Vouchered population-level dataset: -r species

Vouchered population-level dataset: -r accession

Vouchered population-level dataset: -r species_acc

Vouchered population-level dataset: -r species + --voucherize

Vouchered population-level dataset: -r accession + --voucherize

Vouchered population-level dataset: -r species_acc + --voucherize

Relabeling recommendations

Trim Sequence Alignments

Trim_Alignments_Trimal.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

-f <choice> or --format <choice>

-a <choice> or --analysis <choice>

--gt <float>

Example Usage:

Outputs:

Trim_Alignments_Custom.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

-a <choice> or --analysis <choice>

-f <choice> or --format <choice>

-w <integer> or --window <integer>

-p <float> or --proportion <float>

-t <float> or --threshold <float>

-d <float> or --max_divergence `

-l <integer> or --min_length <integer>

Example Usage:

Outputs:

Calculate Sequence Length Heterogeneity Statistics

Seq_Length_Heterogeneity.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

-f <choice> or --format <choice>

-s <choice> or --seqtype <choice>

Example Usage:

Outputs:

Convert to Nexus and Phylip Format

Fasta_Convert.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

Example Usage:

Outputs:

Concatenate Alignments

Concatenation.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

--informat <choice>

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`-r <choice>` or `--relabel <choice>`

`-s <path-to-file>` or `--subspecies <path-to-file>`

`--voucherize`

Species-level dataset: `-r species`

Species-level dataset: `-r accession`

Species-level dataset: `-r species_acc`

Species-level dataset: `-r species` + `--voucherize`

Species-level dataset: `-r accession` + `--voucherize`

Species-level dataset: `-r species_acc` + `--voucherize`

Vouchered population-level dataset: `-r species`

Vouchered population-level dataset: `-r accession`

Vouchered population-level dataset: `-r species_acc`

Vouchered population-level dataset: `-r species` + `--voucherize`

Vouchered population-level dataset: `-r accession` + `--voucherize`

Vouchered population-level dataset: `-r species_acc` + `--voucherize`

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`-f <choice>` or `--format <choice>`

`-a <choice>` or `--analysis <choice>`

`--gt <float>`

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`-a <choice>` or `--analysis <choice>`

`-f <choice>` or `--format <choice>`

`-w <integer>` or `--window <integer>`

`-p <float>` or `--proportion <float>`

`-t <float>` or `--threshold <float>`

`-d <float>` or --max_divergence `

`-l <integer>` or `--min_length <integer>`

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`-f <choice>` or `--format <choice>`

`-s <choice>` or `--seqtype <choice>`

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`--informat <choice>`

`--outformat <choice>`

`-s <choice>` or `--symbol <choice>`

`--seqwrap <integer>`