09. Assessing Evolutionary Linkage of BGCs with their Genome wide Contexts

lsaBGC-Divergence.py

Inspired by studies on BGC evolution in Salinispora by the Jensen lab at UCSD, we implemented lsaBGC-Divergence.py to measure the sequence divergence of GCFs to their genomic backgrounds. The program calculates the beta-rd statistic between pairs of samples, which is the sequence similarity across shared homolog groups normalized by the estimated genome-wide sequence similarity (which can be computed based on single copy core gene alignments from GToTree, now part of lsaBGC-Ready.py.

Output

The output report, called Relative_Divergence_Report.txt, is relatively simple and consists of 5 columns:

Column	Description
gcf_id	The GCF identifier
sample_1	1st paired sample identifier
sample_2	2nd paired sample identifier
beta_rd	The beta-rd statistic value
gw_seq_sim	The genome-wide sequence similarity: Originally computed using the ANI/AAI estimates from MASH/FastANI/CompareM, since incorporation of GToTree, now computed using pairwise sequence similarity of single copy core genes used for phylogeny construction.
gcf_seq_sim	The GCF-wide sequence similarity for positions along shared homolog groups where one of the samples has a valid allele.
gcf_content_sim	The Jaccard Index for intersection of homolog groups found in both samples divided by the total number of homolog groups observed by either sample.

Visualization Across GCFs by `lsaBGC-AutoAnalyze.py`

If run through lsaBGC-AutoAnalyze.py, an automatic visualization for lsaBGC-Divergence.py results will be generated at the end which depicts the beta-rd spread across GCFs, including pairs which have GCF homolog group profiles similar at 90, 75, and 50 Jaccard index thresholds:

Usage

usage: lsaBGC-Divergence.py [-h] -g GCF_LISTING -l INPUT_LISTING -a CODON_ALIGNMENTS -w EXPECTED_SIMILARITIES [-i GCF_ID] -o
                            OUTPUT_DIRECTORY [-k SAMPLE_SET] [-n USE_CODON] [-c CPUS]

        Program: lsaBGC-Divergence.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        This program will calculate Beta-RD, the ratio of the estimated amino acid distances between orthologous BGCs from
        two samples to the expected differences based on core protein alignments performed by requesting GToTree analysis in
        lsaBGC-Ready, for all pairs of samples featuring a BGC belonging to a focal GCF of interest.


optional arguments:
  -h, --help            show this help message and exit
  -g GCF_LISTING, --gcf_listing GCF_LISTING
                        BGC specifications file. Tab delimited: 1st column contains path to BGC Genbank and 2nd column contains sample name.
  -l INPUT_LISTING, --input_listing INPUT_LISTING
                        Path to tab delimited file listing: (1) sample name (2) path to Prokka Genbank and (3) path to Prokka predicted proteome. This file is produced by lsaBGC-Process.py.
  -a CODON_ALIGNMENTS, --codon_alignments CODON_ALIGNMENTS
                        File listing the codon alignments for each homolog group in the GCF. Can be found as part of PopGene output.
  -w EXPECTED_SIMILARITIES, --expected_similarities EXPECTED_SIMILARITIES
                        Path to file listing expected similarities between genomes/samples. This is
                        computed most easily by running lsaBGC-Ready.py with '-t' specified, which will estimate
                        sample to sample similarities based on alignment used to create species phylogeny.
  -i GCF_ID, --gcf_id GCF_ID
                        GCF identifier.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Prefix for output files.
  -k SAMPLE_SET, --sample_set SAMPLE_SET
                        Sample set to keep in analysis. Should be file with one sample id per line.
  -n USE_CODON, --use_codon USE_CODON
                        Expected sample to sample similarities are reflective of DNA distances instead of protein distances (e.g. if FastANI or MASH were used in computeGenomeWideDistances.py).
  -c CPUS, --cpus CPUS  The number of cpus to use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

09. Assessing Evolutionary Linkage of BGCs with their Genome wide Contexts

lsaBGC-Divergence.py

Output

Visualization Across GCFs by `lsaBGC-AutoAnalyze.py`

Usage

Clone this wiki locally

09. Assessing Evolutionary Linkage of BGCs with their Genome wide Contexts

lsaBGC-Divergence.py

Output

Visualization Across GCFs by lsaBGC-AutoAnalyze.py

Usage

Clone this wiki locally

Visualization Across GCFs by `lsaBGC-AutoAnalyze.py`