Skip to content

09. Assessing Evolutionary Linkage of BGCs with their Genome wide Contexts

Rauf Salamzade edited this page Jun 21, 2023 · 11 revisions

lsaBGC-Divergence.py

Inspired by studies on BGC evolution in Salinispora by the Jensen lab at UCSD, we implemented lsaBGC-Divergence.py to measure the sequence divergence of GCFs to their genomic backgrounds. The program calculates the beta-rd statistic between pairs of samples, which is the sequence similarity across shared homolog groups normalized by the estimated genome-wide sequence similarity (which can be computed based on single copy core gene alignments from GToTree, now part of lsaBGC-Ready.py.

Output

The output report, called Relative_Divergence_Report.txt, is relatively simple and consists of 5 columns:

Column Description
gcf_id The GCF identifier
sample_1 1st paired sample identifier
sample_2 2nd paired sample identifier
beta_rd The beta-rd statistic value
gw_seq_sim The genome-wide sequence similarity: Originally computed using the ANI/AAI estimates from MASH/FastANI/CompareM, since incorporation of GToTree, now computed using pairwise sequence similarity of single copy core genes used for phylogeny construction.
gcf_seq_sim The GCF-wide sequence similarity for positions along shared homolog groups where one of the samples has a valid allele.
gcf_content_sim The Jaccard Index for intersection of homolog groups found in both samples divided by the total number of homolog groups observed by either sample.

Visualization Across GCFs by lsaBGC-AutoAnalyze.py

If run through lsaBGC-AutoAnalyze.py, an automatic visualization for lsaBGC-Divergence.py results will be generated at the end which depicts the beta-rd spread across GCFs, including pairs which have GCF homolog group profiles similar at 90, 75, and 50 Jaccard index thresholds:

Usage

usage: lsaBGC-Divergence.py [-h] -g GCF_LISTING -l INPUT_LISTING -a CODON_ALIGNMENTS -w EXPECTED_SIMILARITIES [-i GCF_ID] -o
                            OUTPUT_DIRECTORY [-k SAMPLE_SET] [-n USE_CODON] [-c CPUS]

        Program: lsaBGC-Divergence.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        This program will calculate Beta-RD, the ratio of the estimated amino acid distances between orthologous BGCs from
        two samples to the expected differences based on core protein alignments performed by requesting GToTree analysis in
        lsaBGC-Ready, for all pairs of samples featuring a BGC belonging to a focal GCF of interest.


optional arguments:
  -h, --help            show this help message and exit
  -g GCF_LISTING, --gcf_listing GCF_LISTING
                        BGC specifications file. Tab delimited: 1st column contains path to BGC Genbank and 2nd column contains sample name.
  -l INPUT_LISTING, --input_listing INPUT_LISTING
                        Path to tab delimited file listing: (1) sample name (2) path to Prokka Genbank and (3) path to Prokka predicted proteome. This file is produced by lsaBGC-Process.py.
  -a CODON_ALIGNMENTS, --codon_alignments CODON_ALIGNMENTS
                        File listing the codon alignments for each homolog group in the GCF. Can be found as part of PopGene output.
  -w EXPECTED_SIMILARITIES, --expected_similarities EXPECTED_SIMILARITIES
                        Path to file listing expected similarities between genomes/samples. This is
                        computed most easily by running lsaBGC-Ready.py with '-t' specified, which will estimate
                        sample to sample similarities based on alignment used to create species phylogeny.
  -i GCF_ID, --gcf_id GCF_ID
                        GCF identifier.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Prefix for output files.
  -k SAMPLE_SET, --sample_set SAMPLE_SET
                        Sample set to keep in analysis. Should be file with one sample id per line.
  -n USE_CODON, --use_codon USE_CODON
                        Expected sample to sample similarities are reflective of DNA distances instead of protein distances (e.g. if FastANI or MASH were used in computeGenomeWideDistances.py).
  -c CPUS, --cpus CPUS  The number of cpus to use.