Skip to content
Rauf Salamzade edited this page Aug 4, 2021 · 35 revisions

lsaBGC Suite Overview

lsaBGC consists of several individual programs which provide a broad suite of functions for comparative analysis of biosynthetic gene clusters across a single focal lineage or taxa (recommended/tested at species or genus levels), to understand the allelic variability observed for BGC genes, and mine for novel SNVs within such genes representative of previously unidentified allelic variants.

Installation

To learn more about the installation of lsaBGC and its dependencies, please take a look at the Installation wiki page.

Background / Introduction

What functionalities does lsaBGC offer to users? Learn more about the suite's intended usages and where it should not be used, along with recommendations to other great software for exploring and wrangling comparative analysis of secondary metabolite genetic architectures Background wiki page!

Walk-through / Tutorial - Exploring the Biosynthetic Potential of Micrococcus luteus

Micrococcus luteus is a common constituent of the skin microbiome and harbors several BGCs across its compact genome. We use the publically available genomes of M. luteus as a small and simple test set to demonstrate the exploratory power of lsaBGC. Please have a look at the Tutorial wiki page for further details!

Main Programs

lsaBGC comprises of 7 primary programs:

Many of the main programs utilize an object oriented infrastructure for processing and analysis. More information on this infrastructure can be found on the wiki page OOP Framework.

Program Description Input Output
lsaBGC-Cluster.py Takes the comprehensive list of BGCs and clusters using MCL into GCFs
  • List of all AntiSMASH BGC Genbanks
  • OrthoFinder Homolog Group vs. Sample Matrix
  • Summary of GCFs
  • Automated report to inform on best clustering parameter choices (if requested)
  • List for each GCF of BGC members
    lsaBGC-Refiner.py Refines boundaries of BGCs belonging to a single GCF according to user specifications.
    • AntiSMASH BGC Genbanks for Single GCF
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Boundary Homolog Group ID #1
    • Boundary Homolog Group ID #2
    • New list of refined AntiSMASH Genbanks for BGCs belonging to GCF
    lsaBGC-Expansion.py Constructs HMMs for each homolog group observed in a GCF and finds additional instances in new genomes
    • AntiSMASH BGC Genbanks for Single GCF
    • Genomic assemblies (comprehensive, including draft)
    • Expanded list of BGCs belonging to GCF
    • Expanded OrthoFinder Homolog Group vs Sample Matrix
    lsaBGC-See.py For a single GCF, visualizes each BGC across a phylogeny (also, modifies phylogeny if multiple BGCs in GCF per sample)
    • AntiSMASH BGC Genbanks for Single GCF
    • Species phylogeny (Optional)
    • Modified species phyogeny to expand samples which feature multiple BGCs for the GCF (if species phylogeny was provided)
    • Single-copy-core phylogeny of GCF (if possible and requested)
    • Automated visualization of BGC gene architectures across species or BGC phylogeny in PDF format
    • Track file for visualization of gene architecture for BGCs in GCF to be input into iTol (for interactive visualization).
    lsaBGC-Divergence.py Determines 𝜷-RT statistic for assessing BGC divergence relative to genome-wide divergence between isolate pairs.
    • AntiSMASH BGC Genbanks for a single GCF
    • Genomic assemblies
    • Report with the 𝜷-RT statistic showcasing the ratio of the genome-wide similarity to the GCF-specific similarity between pairs of isolates with the GCF.
    lsaBGC-PopGene.py Looks at sequence conservation and performs population genetic analyses for each homolog group found in GCF.
    • AntiSMASH BGC Genbanks for a single GCF
    • Expanded OrthoFinder Homolog Group vs Sample Matrix
    • Report with conservation and population-genetic relevant statistic for each homolog group associated with the GCF.
    • Automated visualization of genetic variability present in the lineage for each homolog group in PDF format.
    lsaBGC-DiscoVary.py Looks for base-resolution novelty of genes found in GCF from raw sequencing data directly, allowing for rapid detection without need for culturing.

    Also provided are three important workflow programs, lsaBGC-AutoProcess.py, lsaBGC-AutoExpansion.py, and lsaBGC-AutoAnalyze.py, which simplify the generation of inputs necessary for the lsaBGC framework and the automatic processing of each GCF post-clustering through standard analyses, respectively:

    Program Description Input Output
    lsaBGC-AutoProcess.py Automatically runs Prokka, AntiSMASH, and OrthoFinder
    • Genomic assemblies (High Quality / Completed)
    • AntiSMASH BGC Genbanks
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Species Tree
    lsaBGC-AutoExpansion.py Automatically runs Prokka, AntiSMASH, and OrthoFinder
    • Genomic assemblies (High Quality / Completed)
    • AntiSMASH BGC Genbanks
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Species Tree
    lsaBGC-AutoAnalyze.py Automatically runs lsaBGC-See.py, lsaBGC-PopGene.py, lsaBGC-Divergence.py, and lsaBGC-DiscoVary for each GCF.
    • Genomic assemblies (High Quality / Completed)
    • AntiSMASH BGC Genbanks
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Species Tree