Skip to content

08. High throughput Detection of New GCF Instances Across Draft Genome Assemblies

Rauf Salamzade edited this page Jan 23, 2023 · 41 revisions

lsaBGC-Expansion.py

The program lsaBGC-Expansion.py serves a critical role in the lsaBGC suite because it allows for rapidly identifying instances of GCFs across draft genomes - avoiding the need to run OrthoFinder and AntiSMASH on and across these additional genomes (thereby saving significant computational runtime and resources). Comprehensive identification of GCF instances across available assemblies for a lineage is a key necessity in allowing us to later identify novel SNVs in GCF related genes directly from raw sequencing data, including metagenomic data.

An Overview of the Algorithm

1. Homolog Group Identification Through Profile HMMs

To identify new instances of GCFs in draft assemblies, lsaBGC-Expansion.py begins by first creating protein alignments for homolog groups observed in known instances of the GCF, e.g. from the BGCs identified in high quality completed genomic assemblies used on the initial run through OrthoFinder and AntiSMASH. After the protein alignments are constructed using MAFFT local alignment with standard settings, it then uses the HMMER3 software to construct profile HMMs for each homolog group. The profile HMMs are then used to search the predicted proteomes of the very genomes used to create them to identify the largest (least significant) E-value for a gene used to create the profile HMM [true positives] as well as the smallest (most significant) E-value for a homologous gene not used in creating the HMM [false negative]. If the least significant E-value of a true-hit is more significant than the most significant hit for a false-hit, this is noted and the threshold used for detection of the profile HMM in new genomes (from the expansion set) will be the E-value of the most significant false positive hit multiplied by a factor of 1E-5. If however E-values of true-hits and false-hits are intermingled, then a default threshold of 1E-10 is used for detection of the profile HMM. Additionally, a predicted protein identified as matching a homolog group profile HMM needs to have a length similar to the proteins from the high-quality assemblies known to belong to the homolog group. Specifically, ORFs corresponding to matching proteins must be at most 1.5 times the max(25, median absolute divergence) base-pairs shorter or longer than the median ORF length of genes belonging to the homolog group.

IMPORTANT NOTE: Requiring homolog groups to be detected at an E-value threshold determined from the input set of genomes requires the target genomes to be phylogenetically interspersed among the input genomes. That is you cannot apply HMM on an entire genus after simply training on a subset of genomes which merely represent a single species.

Very fast homology detection using Diamond BLASTp

Inspired by the approach employed by Melnyk et al. 2019, we also offer an option to greatly boost the speed of this step which performs Diamond alignment of consensus sequences emitted from profile-HMMs for each homolog group. While HMMER3 is perhaps the most reliable in detecting homology to a particular homolog group, if expansion is aimed to look for GCFs in 500+ genomes, it is recommended to use this alternate Diamond-alignment-based mode, through specifying the --quick_mode flag.

2. Detection of Neighborhoods of Proteins Matching GCF Associated Homolog Groups using a Classical HMM

A classical HMM is used to scan across draft assembly scaffolds (for samples in the expansion set) with characters corresponding to predicted ORFs and states to either the GCF or Background. Emission probabilities are based on homology ORFs depict to homolog groups of the GCF in question.

Emission Probabilities

Once homolog group profile HMMs are detected using established thresholds in the predicted proteome of a sample in the expansion set, we estimate the emission probability of each homolog group based on whether it's profile HMM is able to distinguish true hits from false hits from reflexive alignments (see section above). If it is able to distinguish true-hits from false-hits, then detecting a homolog group at the E-value threshold described above is very likely to represent a predicted protein belongs to the GCF and the homolog group will have an emission probability of 0.99 for the "GCF State" and a emission probability of 0.01 for the "Background State". If the profile HMM is unable to distinguish true-hits from false-hits, then the emission probability for the "Background State" is set to a maximum of either: (i) 1.0 - (# of GCF instances / # of total instances) or (ii) 0.2. The emission probability for the "GCF State" in this case will be set to the complement of the "Background State" probability.

Transition Probabilities

The transition probabilities between states are set to 0.1 by default and to 0.9 for transitioning to the same state. The start and end of a scaffold are equally likely to correspond to either the states of GCF or Background. In summary, the transition probabilities are by default the following, however they can be configured by users as optional parameters in lsaBGC-Expansion:

3. Conditional Checks on Detected Neighborhoods to Curb False Positive Detection

Each genomic neighborhood of predicted ORFs detected by the HMM as corresponding to the GCF (herein referred to as "segments") are next conditionally assessed to avoid reporting false positive neighborhoods which are likely not related to the GCF in question.

Each segment must feature at least 3 homolog groups found in a known instance of the GCF. Additionally, all segments must display reasonable syntenic similarity (gene positioning) to a known instance of the GCF found in the high quality genomic assemblies. Unlike the syntenic similarity filtering applied in the clustering of BGCs into GCFs, here we use Pearson's correlation instead of Spearman's correlation to more reliably assess syntenic similarity of shorter segments to known GCF instances. Correlations are only calculated if the genes in the segment under consideration and the comparing known BGC instance have genes in the same relative direction. A default correlation of 0.8 to at least one known GCF instance is required by each segment (with p-value < 0.1).

Inclusion Criteria 1: "Smoking guns"

Segments are reported as part of the GCF automatically if any of the following criteria are met:

  1. The segment features >= 5 homolog groups and segment features >= 3 "core" homolog groups. "Core" homolog groups are those which are observed in all known instances of a GCF from the high-quality genomic assemblies. The thresholds here are possible to configure by the user as optional parameters.
  2. The segment features a GCF "specific" homolog group. These are homolog groups which is only observed within the GCF in the high-quality genome assemblies, thus there presence is a good indication the segment belongs to the GCF and is not a false positive.
  3. The segment features homolog groups which overlap with the core of protoclusters in GCFs as delineated by AntiSMASH.

Inclusion Criteria 2: "Edge" cases

If segments do not meet the "smoking guns" criteria above, they can still be reported if they are on the "edge" of scaffolds, specifically include an ORF within 2000 bp from the end of a scaffold. These segments do not need to meet (1) individually, but do need to feature >=5 homolog groups and >=3 core homolog groups of the GCF in unison. Additionally, only one edge segment per scaffold is allowed.

4. Final Polishing of Segment Homolog Group Assignment

To ensure we are not missing alternate variants of homolog groups not represented in the initial lsaBGC analysis, we perform a final "polishing" step for segments which are considered to belong to a GCF. This involves reassessing genes on the segment which are unassigned to any homolog group and additionally searching for surrounding genes (up to 5 genes on each side) which display homology to GCF associated homolog groups. A gene only needs to display homology at a level of less than 1E-10 Evalue to be assigned to a homolog group. Critically, we only assign these genes to homolog groups after the segment is determined as belonging to the focal GCF to ensure our detection of segments is based on more concrete/significant homology-based evidence.

Illustrative Overview of the lsaBGC-Expansion Algorithm

Usage

usage: lsaBGC-Expansion.py [-h] -g GCF_LISTING -m ORTHOFINDER_MATRIX -l INITIAL_LISTING -e EXPANSION_LISTING -o
                           OUTPUT_DIRECTORY [-i GCF_ID] [-p BGC_PREDICTION_SOFTWARE] [-ms MIN_SEGMENT_SIZE]
                           [-mcs MIN_SEGMENT_CORE_SIZE] [-sct SYNTENIC_CORRELATION_THRESHOLD]
                           [-tg TRANSITION_FROM_GCF_TO_GCF] [-tb TRANSITION_FROM_BG_TO_BG] [-c CPUS] [-q] [-no] [-w]
                           [-ph PROTOCORE_HOMOLOGS [PROTOCORE_HOMOLOGS ...]] [-ap] [-z PICKLE_EXPANSION_ANNOTATION_DATA]

        Program: lsaBGC-Expansion.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        This program takes in a list of representative BGCs belonging to a single GCF (for instance those identified in
        high quality genomic assemblies) and then searches for new instances of the GCF in an expansion set of draft
        assemblies.

        The program builds profile HMMs for homolog groups identified by OrthoFinder from the representative BGCs,
        searches for their presence in the comprehensive set of BGCs (from draft genomes), applys a classical HMM to
        identify neighborhoods of genes on draft assembly scaffolds which match the GCF profile, and performs conditional
        checks to filter out false-positive findings.


optional arguments:
  -h, --help            show this help message and exit
  -g GCF_LISTING, --gcf_listing GCF_LISTING
                        BGC listings file for a gcf. Tab delimited: 1st column lists sample
                        name while the 2nd column is the path to a BGC prediction in Genbank format.
  -m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
                        OrthoFinder matrix.
  -l INITIAL_LISTING, --initial_listing INITIAL_LISTING
                        Tab delimited text file for primary samples with three columns: (1) sample name
                        (2) path to whole-genome generated Genbank file (*.gbk), and (3)path to whole-genome generated
                        predicted-proteome file (*.faa).
  -e EXPANSION_LISTING, --expansion_listing EXPANSION_LISTING
                        Tab delimited text file for additional/draft samples in the expansion set with three columns: (1) sample name
                        (2)path to whole-genome Genbank file (*.gbk), and (3)path to whole-genome
                        predicted-proteome file (*.faa).
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to output directory.
  -i GCF_ID, --gcf_id GCF_ID
                        GCF identifier.
  -p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
                        Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
                        Default is antiSMASH.
  -ms MIN_SEGMENT_SIZE, --min_segment_size MIN_SEGMENT_SIZE
                        The minimum number of homolog groups (>3) needed to report discrete segments of the GCF. Ignored if GCF specific or functionally core (e.g. harboring key domain for detection of BGC) homolog group is found in segment. Default is 5.
  -mcs MIN_SEGMENT_CORE_SIZE, --min_segment_core_size MIN_SEGMENT_CORE_SIZE
                        The minimum number of core (to the known instances of the GCF) homololog groups needed
                        to report discrete segments of the GCF. Default is 3.
  -sct SYNTENIC_CORRELATION_THRESHOLD, --syntenic_correlation_threshold SYNTENIC_CORRELATION_THRESHOLD
                        The minimum syntenic correlation needed to at least one known
                        GCF instance to report segment.
  -tg TRANSITION_FROM_GCF_TO_GCF, --transition_from_gcf_to_gcf TRANSITION_FROM_GCF_TO_GCF
                        GCF to GCF state transition probability for HMM. Should be between
                        0.0 and 1.0. Default is 0.9.
  -tb TRANSITION_FROM_BG_TO_BG, --transition_from_bg_to_bg TRANSITION_FROM_BG_TO_BG
                        Background to background state transition probability for HMM. Should be
                        between 0.0 and 1.0. Default is 0.9.
  -c CPUS, --cpus CPUS  The number of cpus to use.
  -q, --quick_mode      Run in quick-mode. Instead of running HMMScan for each homolog group, a consensus
                        sequence is emitted and Diamond is used for searching instead. Method inspired by Melnyk et al. 2019
  -no, --no_orthogroup_matrix
                        Avoid writing the updated OrthoFinder matrix at the end.
  -w, --loose           Remove requirement for proto-core/rule-based homolog group being detected in a single
                        neighborhood for GCF to be reported as present.
  -ph PROTOCORE_HOMOLOGS [PROTOCORE_HOMOLOGS ...], --protocore_homologs PROTOCORE_HOMOLOGS [PROTOCORE_HOMOLOGS ...]
                        List of homolog group identifiers comprising the core of the BGC/GCF of which at
                        least one is required for GCF to be reported. Please provide as space-seperated list
                        with quotes surrounding: "OG1 OG2 ..."
  -ap, --all_primary    Treat all known GCF instances as primary (use if BGC Genbanks were not processed through lsaBGC-Ready).
  -z PICKLE_EXPANSION_ANNOTATION_DATA, --pickle_expansion_annotation_data PICKLE_EXPANSION_ANNOTATION_DATA
                        Pickle file with serialization of annotation data in the expansion listing file.

lsaBGC-AutoExpansion.py

To run lsaBGC-Expansion.py on all GCFs across an expansion set of draft quality genomic assemblies, we have created a wrapper workflow program called lsaBGC-AutoExpansion.py. After running lsaBGC-Expansion.py individually per GCF, it consolidates results and resolves any conflict in genes from the same genomic assembly being assigned to multiple GCFs.

It resolves conflicts, where coordinates of two distinct GCFs overlap in a draft assembly by performing pairwise comparisons of the gene sets between every BGC with each other. Overlap between two BGCs from distinct GCFs is considered a conflict if they overlap with more than 5% of the number genes of one of the BGCs. If this is the case, the sum of the exponents E-values of each BGC's genes to their respective homolog group profile HMMs are compared and the BGC with the lower sum (indicating more genes in the BGC match a GCF profile or that genes match better to the GCF profile) is retained while the other is discarded.

lsaBGC-AutoExpansion.py also creates consolidated result files, including:

  1. An updated/expanded sample vs. homolog group gene matrix
  2. An updated/expanded sample listings file (sample Genbank assembly file predicted proteome file)
  3. Updated listing files of BGCs for each GCF

Usage

usage: lsaBGC-AutoExpansion.py [-h] -g GCF_LISTING_DIR -m ORTHOFINDER_MATRIX -l INITIAL_LISTING -e EXPANSION_LISTING -o
                               OUTPUT_DIRECTORY [-p BGC_PREDICTION_SOFTWARE] [-q] [-z PICKLE_EXPANSION_ANNOTATION_DATA]
                               [-c CPUS] [-ph PROTOCORE_HOMOLOGS]

        Program: lsaBGC-AutoExpansion.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        Program to run lsaBGC-Expansion on each GCF, resolve conflicts across GCFs, and then consolidate result files when
        overlapping BGC predictions for different GCFs exist.


optional arguments:
  -h, --help            show this help message and exit
  -g GCF_LISTING_DIR, --gcf_listing_dir GCF_LISTING_DIR
                        Directory with GCF listing files.
  -m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
                        OrthoFinder homolog group by sample matrix.
  -l INITIAL_LISTING, --initial_listing INITIAL_LISTING
                        Path to tab delimited text file for samples with three columns: (1) sample name (2) Prokka generated Genbank file (*.gbk), and (3) Prokka generated predicted-proteome file (*.faa). Please remove troublesome characters in the sample name.
  -e EXPANSION_LISTING, --expansion_listing EXPANSION_LISTING
                        Path to tab delimited file listing: (1) sample name (2) path to Prokka Genbank and (3) path to Prokka predicted proteome. This file is produced by lsaBGC-AutoProcess.py.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Parent output/workspace directory.
  -p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
                        Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
                        Default is antiSMASH.
  -q, --quick_mode      Whether to run lsaBGC-Expansion in quick mode?
  -z PICKLE_EXPANSION_ANNOTATION_DATA, --pickle_expansion_annotation_data PICKLE_EXPANSION_ANNOTATION_DATA
                        Pickle file with serialization of annotation data in the expansion listing file.
  -c CPUS, --cpus CPUS  Total number of cpus to use.
  -ph PROTOCORE_HOMOLOGS, --protocore_homologs PROTOCORE_HOMOLOGS
                        File with manual listings of proto-core homolog groups.
                        This should be provided as 2 column tab-delmited file: (1) GCF id and
                        (2) space delmited listing of homolog groups.
Clone this wiki locally