-
Notifications
You must be signed in to change notification settings - Fork 4
09. Assessing Evolutionary Linkage of BGCs with their Genome wide Contexts
Inspired by studies on BGC evolution in Salinispora by the Jensen lab at UCSD, we implemented lsaBGC-Divergence.py
to measure the sequence divergence of GCFs to their genomic backgrounds. The program calculates the beta-rd statistic between pairs of samples, which is the sequence similarity across shared homolog groups normalized by the estimated genome-wide sequence similarity (which can be computed based on single copy core gene alignments from GToTree, now part of lsaBGC-Ready.py
.
The output report, called Relative_Divergence_Report.txt
, is relatively simple and consists of 5 columns:
Column | Description |
---|---|
gcf_id | The GCF identifier |
sample_1 | 1st paired sample identifier |
sample_2 | 2nd paired sample identifier |
beta_rd | The beta-rd statistic value |
gw_seq_sim | The genome-wide sequence similarity: Originally computed using the ANI/AAI estimates from MASH/FastANI/CompareM, since incorporation of GToTree, now computed using pairwise sequence similarity of single copy core genes used for phylogeny construction. |
gcf_seq_sim | The GCF-wide sequence similarity for positions along shared homolog groups where one of the samples has a valid allele. |
gcf_content_sim | The Jaccard Index for intersection of homolog groups found in both samples divided by the total number of homolog groups observed by either sample. |
If run through lsaBGC-AutoAnalyze.py
, an automatic visualization for lsaBGC-Divergence.py
results will be generated at the end which depicts the beta-rd spread across GCFs, including pairs which have GCF homolog group profiles similar at 90, 75, and 50 Jaccard index thresholds:
usage: lsaBGC-Divergence.py [-h] -g GCF_LISTING -l INPUT_LISTING -a CODON_ALIGNMENTS -w EXPECTED_SIMILARITIES [-i GCF_ID] -o
OUTPUT_DIRECTORY [-k SAMPLE_SET] [-n USE_CODON] [-c CPUS]
Program: lsaBGC-Divergence.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
This program will calculate Beta-RD, the ratio of the estimated amino acid distances between orthologous BGCs from
two samples to the expected differences based on core protein alignments performed by requesting GToTree analysis in
lsaBGC-Ready, for all pairs of samples featuring a BGC belonging to a focal GCF of interest.
optional arguments:
-h, --help show this help message and exit
-g GCF_LISTING, --gcf_listing GCF_LISTING
BGC specifications file. Tab delimited: 1st column contains path to BGC Genbank and 2nd column contains sample name.
-l INPUT_LISTING, --input_listing INPUT_LISTING
Path to tab delimited file listing: (1) sample name (2) path to Prokka Genbank and (3) path to Prokka predicted proteome. This file is produced by lsaBGC-Process.py.
-a CODON_ALIGNMENTS, --codon_alignments CODON_ALIGNMENTS
File listing the codon alignments for each homolog group in the GCF. Can be found as part of PopGene output.
-w EXPECTED_SIMILARITIES, --expected_similarities EXPECTED_SIMILARITIES
Path to file listing expected similarities between genomes/samples. This is
computed most easily by running lsaBGC-Ready.py with '-t' specified, which will estimate
sample to sample similarities based on alignment used to create species phylogeny.
-i GCF_ID, --gcf_id GCF_ID
GCF identifier.
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Prefix for output files.
-k SAMPLE_SET, --sample_set SAMPLE_SET
Sample set to keep in analysis. Should be file with one sample id per line.
-n USE_CODON, --use_codon USE_CODON
Expected sample to sample similarities are reflective of DNA distances instead of protein distances (e.g. if FastANI or MASH were used in computeGenomeWideDistances.py).
-c CPUS, --cpus CPUS The number of cpus to use.