Skip to content
Rauf Salamzade edited this page Mar 13, 2022 · 35 revisions

lsaBGC Suite Overview

lsaBGC consists of several individual programs which provide a broad suite of functions for comparative analysis of biosynthetic gene clusters across a single focal lineage or taxa (recommended/tested at species or genus levels), to understand the allelic variability observed for BGC genes, and mine for novel SNVs within such genes representative of previously unidentified allelic variants.

Installation

To learn more about the installation of lsaBGC and its dependencies, please take a look at the Installation wiki page.

Background / Introduction

What functionalities does lsaBGC offer to users? Learn more about the suite's intended usages and where it should not be used, along with recommendations to other great software for exploring and wrangling comparative analysis of secondary metabolite genetic architectures Background wiki page!

Tutorial - Exploring the Biosynthetic Potential of Micrococcus luteus

Micrococcus luteus is a common member of the skin microbiome and harbors several BGCs across its compact genome. We use the publicly available genomes of M. luteus as a small and simple test set to demonstrate the exploratory power of lsaBGC. Please have a look at the Tutorial wiki page for further details!

Main Programs

lsaBGC comprises of 7 primary programs:

Many of the main programs utilize an object oriented infrastructure for processing and analysis. More information on this infrastructure can be found on the wiki page OOP Framework.

Program Description Input Output
lsaBGC-Cluster.py Takes the comprehensive list of BGCs and clusters using MCL into GCFs
  • List of all AntiSMASH BGC Genbanks
  • OrthoFinder Homolog Group vs. Sample Matrix
  • Summary of GCFs
  • Automated report to inform on best clustering parameter choices (if requested)
  • List for each GCF of BGC members
    lsaBGC-Refiner.py Refines boundaries of BGCs belonging to a single GCF according to user specifications.
    • AntiSMASH BGC Genbanks for Single GCF
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Boundary Homolog Group ID #1
    • Boundary Homolog Group ID #2
    • New list of refined AntiSMASH Genbanks for BGCs belonging to GCF
    lsaBGC-Expansion.py Constructs HMMs for each homolog group observed in a GCF and finds additional instances in new genomes
    • AntiSMASH BGC Genbanks for Single GCF
    • Genomic assemblies (comprehensive, including draft)
    • Expanded list of BGCs belonging to GCF
    • Expanded OrthoFinder Homolog Group vs Sample Matrix
    lsaBGC-See.py For a single GCF, visualizes each BGC across a phylogeny (also, modifies phylogeny if multiple BGCs in GCF per sample)
    • AntiSMASH BGC Genbanks for Single GCF
    • Species phylogeny (Optional)
    • Modified species phyogeny to expand samples which feature multiple BGCs for the GCF (if species phylogeny was provided)
    • Single-copy-core phylogeny of GCF (if possible and requested)
    • Automated visualization of BGC gene architectures across species or BGC phylogeny in PDF format
    • Track file for visualization of gene architecture for BGCs in GCF to be input into iTol (for interactive visualization).
    lsaBGC-Divergence.py Determines 𝜷-RT statistic for assessing BGC divergence relative to genome-wide divergence between isolate pairs.
    • AntiSMASH BGC Genbanks for a single GCF
    • Genomic assemblies
    • Report with the 𝜷-RT statistic showcasing the ratio of the genome-wide similarity to the GCF-specific similarity between pairs of isolates with the GCF.
    lsaBGC-PopGene.py Looks at sequence conservation and performs population genetic analyses for each homolog group found in GCF.
    • AntiSMASH BGC Genbanks for a single GCF
    • Expanded OrthoFinder Homolog Group vs Sample Matrix
    • Report with conservation and population-genetic relevant statistic for each homolog group associated with the GCF.
    • Automated visualization of genetic variability present in the lineage for each homolog group in PDF format.
    lsaBGC-DiscoVary.py Identifies GCF instances in metagenomes and looks for base-resolution novelty within genes from raw sequencing data not observed in genomic assemblies for the taxonomy.
    • BGC Genbank instances for GCF
    • Metagenomic/sequencing readsets
    • Codon alignments for homolog groups in GCF
      Listing of which metagenomic/sequencing readsets are predicted to contain the GCF
    • Table report with novel variants never previously observed in genomic assemblies
    • (Optional) Phased homolog group alleles found in metagenomic/sequencing data. [uses DESMAN]

    Also provided are three important workflow programs, lsaBGC-AutoProcess.py, lsaBGC-AutoExpansion.py, and lsaBGC-AutoAnalyze.py, which simplify the generation of inputs necessary for the lsaBGC framework and allow for the automatic processing of each GCF post-clustering through standard analyses:

    Program Description Input Output
    lsaBGC-AutoProcess.py Automatically runs Prokka, AntiSMASH, and OrthoFinder
    • Genomic assemblies (High Quality / Completed)
    • AntiSMASH BGC Genbanks
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Species Tree
    lsaBGC-AutoExpansion.py Automatically runs Prokka, AntiSMASH, and OrthoFinder
    • Genomic assemblies (High Quality / Completed)
    • AntiSMASH BGC Genbanks
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Species Tree
    lsaBGC-AutoAnalyze.py Automatically runs lsaBGC-See.py, lsaBGC-PopGene.py, lsaBGC-Divergence.py, and lsaBGC-DiscoVary for each GCF.
    • Genomic assemblies (High Quality / Completed)
    • AntiSMASH BGC Genbanks
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Species Tree

    Future to-do's involve getting these workflows re-written in a DSL framework such as NextFlow.