Skip to content

2. more info on fai

Rauf Salamzade edited this page Dec 2, 2024 · 24 revisions

fai

fai allows for identification of homologous instances of a query or set of query gene-cluster(s). It has many options, including two different approaches for delineation of gene-cluster boundaries, requesting filtering of paralogous instances of gene-clusters, and piecing together gene-clusters which are fragmented across scaffolds in genomic assemblies.

The final set of homologous gene clusters from target genomes in GenBank format can be found in the fai results subdirectory: /path/to/fai_Results/Homologous_GenBanks_Directory/

Accounting for paralogous gene-clusters: -fp / --filter_paralogs

While paralogy is often thought off with regards to individual genes, full gene-clusters can also be paralogous within individual genomes. We thus allow filtering of paralogous gene-clusters if more than two distinct reference proteins/homolog groups are shared - suggesting paralogy beyond fragmentation that might have split a gene in two.

Working with poor quality assemblies or MAGs? Consider: -dm / --draft_mode

Similar to lsaBGC-Expansion.py, fai allows for detection of gene-clusters fragmented across multiple scaffolds by accounting for proximity to scaffold edges. If --draft_mode is requested, thresholds needed for discovery of gene-clusters can simply be met in aggregate, putting together other homologous gene-clusters. Note, each individual gene-cluster segment must still contain three distinct query homolog groups.

Modes for Gene-Cluster Discovery

Similar to lsaBGC-Expansion and cblaster, fai looks for homologs in target genomes by "BLASTp"ing proteins to predicted proteomes using DIAMOND. Afterwards, the programs differ in how they identify candidate/valid homologous gene-clusters.

cblaster uses a maximum distance parameter (default of 20kb) to determine whether genes should be grouped together in one gene-cluster segment. The cblaster suite also offers an intuitive approach to select the optimal value for this parameter. Similarly, fai offers "Gene-Clumper" mode (the current default) which simply groups genes together if they are within 5 genes of each other. fai also offers an "HMM" based approach which can be used to identify stringent blocks of gene-sets and then merged together into larger blocks through the same parameter, --max_genes_disconnect, used by "Gene-Clumper" mode to group individual genes. Unlike the "HMM" based approach in lsaBGC-Expansion, the emission probability parameters are not automatically determined based on reflexive alignment of the query gene-cluster proteins to the background genome from which it was extracted to determine cutoffs to distinguish paralogous hits from orthologous hits. Similar to lsaBGC-Expansion though, HMM probability parameters are user-adjustable.

Visuals to Assess Quality of Detection of Homologous Gene-Clusters and Guide User Parameter Adjustment

fai will by default produce a "Tiny-AAI" plot which depicts the average-amino acid identity of genes from homologous gene clusters in target genomes to the query gene cluster (x-axis) and the proportion of query genes/ortholog groups found.

image

fai can also produce a multi-page PDF with plots such as the following for showing the quality of homologous gene-clusters detected. This report can be requested via the -gp or --generate_plots argument.

These plots showcase syntenic order and similarity to reference genes (height is the CDS to reference protein ratio - ideal match should be at 1, indicating they are the same length) and colored on a scale from 0 (grey) to 100 (red) corresponding to percent identity. Black borders indicate key query proteins - if provided by the user.

image

Explanation of Report

Column Description Notes
sample The identifier of the target genome.
gene-cluster-id The identifier of a discrete neighborhood of genes identified as homologous to the query gene cluster. Only in the "Gene Cluster Instance - Report" tab
aggregate-bitscore The aggregate bitscore of hits to the query gene cluster genes. Only the best hit for each query gene/ortholog-group is retained (based on bitscore).
aai-to-query The average amino-acid identity of the proteins in the target genome to the query gene cluster genes. Only the best hit for each query gene/ortholog-group is retained (based on bitscore).
mean-sequence-to-query-ratio The average sequence-to-query ratio of the proteins. Only the best hit for each query gene/ortholog-group is retained (based on bitscore).
proportion-query-genes-found The proportion of query genes/ortholog-groups found in homologous gene clusters across the target genome ("Genome Wide - Report") or in a specific discrete neighborhood ("Gene Cluster Instance - Report")
avg-syntenic-correlation Pearson product-moment correlation coefficient for global syntenic similarity of a specific discrete neighborhood to the query gene cluster or the average of these values across all discrete neighborhoods which meet user defined filters.
number-background-genes The number of background genes in the delineated region within query gene hits which are not represented in the query.
number-gene-clusters The number of discrete gene neighborhoods which individually meet reporting criteria. Only in the "Genome Wide - Report" tab
copy-counts A string separated by commas listing the copy-count of individual query genes/ortholog-groups.

Usage

Usage: fai [-h] -tg TARGET_GENOMES_DB [-i QUERY_INPUTS [QUERY_INPUTS ...]] [-r REFERENCE_GENOME] [-rc REFERENCE_CONTIG] [-rs REFERENCE_START] [-re REFERENCE_END]
           [-pq PROTEIN_QUERIES] [-sq SINGLE_QUERY] -o OUTPUT_DIR [-dm] [-fp] [-rgv] [-e EVALUE_CUTOFF] [-m MIN_PROP] [-kpq KEY_PROTEIN_QUERIES]
           [-kpe KEY_PROTEIN_EVALUE_CUTOFF] [-kpm KEY_PROTEIN_MIN_PROP] [-sct SYNTENIC_CORRELATION_THRESHOLD] [-gdm GC_DELINEATION_MODE] [-f FLANKING_CONTEXT]
           [-mgd MAX_GENES_DISCONNECT] [-gt GC_TRANSITION] [-bt BG_TRANSITION] [-ge GC_EMISSION] [-be BG_EMISSION] [-gp] [-ds DIAMOND_SENSITIVITY]
           [-st SPECIES_TREE] [-phl PHYLOHEATMAP_LENGTH] [-phw PHYLOHEATMAP_WIDTH] [-c THREADS] [-cl] [-mm MAX_MEMORY] [-v]

        Program: fai
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

                 .o88o.            o8o
                 888 `"            `"'
                o888oo   .oooo.   oooo
                 888    `P  )88b  `888
                 888     .oP"888   888
                 888    d8(  888   888
                o888o   `Y888""8o o888o

    MODES OF INPUT:
    *********************************************************************************

    Type 1: Directory of Homologous Gene-Cluster GenBanks
    (GenBanks must have CDS features with locus_tag or protein_id names)
        -----------------------------------------------------
    $ fai -i Known_GeneCluster.gbk -tg prepTG_Database/ -o fai_Results/

    Type 2: Reference Genome with Gene-Cluster/Locus Coordinates
    (proteins are collapsed for high-similarity using cd-hit)
    -----------------------------------------------------
    $ fai -r Reference.fasta -rc scaffold01 -rs 40201 -re 45043 \
          -tg prepTG_Database/ -o fai_Results/

    Type 3: Set of Query Proteins (not compatible with the syntenic similarity cutoff)
    (proteins are collapsed for high-similarity using cd-hit)
    Similar to input for cblaster
    -----------------------------------------------------
    $ fai -pq Gene-Cluster_Query_Proteins.faa -tg prepTG_Database/ -o fai_Results/

    Type 4: Single Query (provide the amino acid sequence directly)
    Similar to CORASON
    ----------------------------------------------------
    $ fai -sq Single_Query_Protein.fa -tg prepTG_Database/ -o fai_Results/

    The final set of homologous gene cluster instances within target genomes which meet
    the specified criteria can be found in the subdirectory named:
        Final_Results/Homologous_Gene_Cluster_GenBanks/

Options:
  -h, --help            show this help message and exit
  -tg, --target-genomes-db TARGET_GENOMES_DB
                        Result directory from running prepTG for target genomes
                        of interest.
  -i, --query-inputs QUERY_INPUTS [QUERY_INPUTS ...]
                        Paths to locus-specific GenBank(s)  to use as
                        queries for searching for homologous/orthologous instances
                        in target genomes. Files must end with ".gbk", ".gbff",
                        or ".genbank".
  -r, --reference-genome REFERENCE_GENOME
                        Path to reference genome in FASTA or GenBank format on
                        which to define query gene cluster coordinates.
  -rc, --reference-contig REFERENCE_CONTIG
                        Scaffold name (up to first space) in refernece genome
                        which features the query gene cluster.
  -rs, --reference-start REFERENCE_START
                        Start position of query gene cluster on scaffold in
                        reference genome.
  -re, --reference-end REFERENCE_END
                        End position of query gene cluster on scaffold in
                        reference genome.
  -pq, --protein-queries PROTEIN_QUERIES
                        Path to protein multi-FASTA file to use as query.
  -sq, --single-query SINGLE_QUERY
                        Path to protein FASTA file containing a single protein
                        to use as a query.
  -o, --output-dir OUTPUT_DIR
                        Output directory.
  -dm, --draft-mode     Run in draft-mode to also report segments on scaffold
                        edges which in aggregate (with other such segments) they
                        meet criteria for reporting.
  -fp, --filter-paralogs
                        Filter paralogous instances of gene-cluster identified
                        in a single target genome.
  -rgv, --use-prodigal-gv-for-reference
                        If the --reference-genome option is used to define gene
                        cluster coordinates, use prodigal-gv instead of pyrodigal
                        to call genes in the region.
  -e, --evalue-cutoff EVALUE_CUTOFF
                        E-value cutoff for DIAMOND blastp to consider a gene in a
                        target genome a hit to a query protein. [Default
                        is 1e-10].
  -m, --min-prop MIN_PROP
                        The minimum proportion of distinct proteins/ortholog groups
                        needed to report discrete segments of the gene-cluster.
                        Note, that a minimum of 3 distinct query proteins/homolog
                        groups are needed per segment reported [Default is 0.5].
  -kpq, --key-protein-queries KEY_PROTEIN_QUERIES
                        Path to protein multi-FASTA file containing key query
                        sequences which some proportin of are required to be
                        present in a gene cluster at a specific E-value cutoff.
  -kpe, --key-protein-evalue-cutoff KEY_PROTEIN_EVALUE_CUTOFF
                        E-value cutoff for finding key query sequences in putative
                        gene cluster homolog segments. [Default
                        is 1e-20]. Disregarded if less strict than the
                        general --evalue-cutoff.
  -kpm, --key-protein-min-prop KEY_PROTEIN_MIN_PROP
                        The minimum proportion of distinct ortholog groups matching
                        key proteins needed to report a homologous gene-cluster.
                        [Default is 0.0].
  -sct, --syntenic-correlation-threshold SYNTENIC_CORRELATION_THRESHOLD
                        The minimum syntenic correlation needed to at least
                        one known GCF instance to report segment. [Default
                        is 0.0; no syntenic assessment performed]
  -gdm, --gc-delineation-mode GC_DELINEATION_MODE
                        Method/mode for delineation of gene-cluster boundaries.
                        Options are "Gene-Clumper" or "HMM" [Default is
                        Gene-Clumper].
  -f, --flanking-context FLANKING_CONTEXT
                        Number of bases to append to the beginning/end of the
                        gene-cluster segments identified [Default is 1000].
  -mgd, --max-genes-disconnect MAX_GENES_DISCONNECT
                        Maximum number of genes between gene cluster segments
                        detected by HMM to merge segments together. Alternatively
                        the number of genes separating hits if Gene-Clumper mode
                        is used. Allows for more inclusivity of novel auxiliary
                        genes. [Default is 5].
  -gt, --gc-transition GC_TRANSITION
                        Probability for gene-cluster to gene cluster transition
                        in HMM. Should be between 0.0 and 1.0. [Default
                        is 0.9].
  -bt, --bg-transition BG_TRANSITION
                        Probability for background to background transition
                        in HMM. Should be between
                        0.0 and 1.0. [Default
                        is 0.9].
  -ge, --gc-emission GC_EMISSION
                        Emission probability of gene being in gene cluster
                        state assuming a orthologis found at the e-value cutoff.
                        [Default is 0.95].
  -be, --bg-emission BG_EMISSION
                        Emission probability of gene being in gene cluster
                        state assuming no homolog is
                        found at the e-value cutoff.
                        [Default is 0.2].
  -gp, --generate-plots
                        Generate red barplots for assessing gene cluster
                        segments identified.
  -ds, --diamond-sensitivity DIAMOND_SENSITIVITY
                        DIAMOND alignment sensitivity. Options include: fast,
                        mid-sensitive, sensitive, more-sensitive, very-sensitive,
                        and ultra-sensitive [Default is very-sensitive].
  -st, --species-tree SPECIES_TREE
                        Phylogeny in Newick format - with names matching
                        target genomes db [Optional]. Will be used for
                        creating an extra visual.
  -phl, --phyloheatmap-length PHYLOHEATMAP_LENGTH
                        Specify the height/length of the phylo-heatmap
                        plot [Default is 7].
  -phw, --phyloheatmap-width PHYLOHEATMAP_WIDTH
                        Specify the width of the phylo-heatmap plot
                        [Default is 10].
  -c, --threads THREADS
                        The number of threads to use [Default is 1].
  -cl, --clean-up       Clean up disk-heavy files/folders.
  -mm, --max-memory MAX_MEMORY
                        Uses resource module to set soft memory limit. Provide
                        in Giga-bytes. Configured in the shell environment
                        [Default is None; experimental].
  -v, --version         Get version and exit.

Detailed Usage on Major Options:

Option (Short) Option (Long) Description
-i --query-inputs Paths to locus-specific GenBank(s) to use for queries for searching for homologous/orthologous instances in the target genome. Note this is optional, users might prefer to specify the query gene cluster in some other form (position along reference or protein FASTA file), see example commands for running fai in the help function or the basic usage examples. Files must end with ".gbk", ".gbff", or ".genbank".
-r --reference-genome Path to a reference genome in FASTA or GenBank format to use for specifying the query gene cluster. This is optional and there are other options/ways for specifying the gene cluster, see example commands for running fai in the help function or the basic usage examples.
-rc --reference-contig The scaffold name (up to the first space character) which features the query gene cluster of interest.
-rs --reference-start The start position of gene-cluster on the scaffold in the reference genome.
-re --reference-end The end position of gene-cluster on scaffold in the reference genome.
-pq --protein-queries Path to protein multi-FASTA file containing to use as queries.
-sq --single-query Path to protein FASTA file containing a single protein to use as a query. This option is inspired by CORASON and is intended to allow users to apply fai to pull out genomic contexts of the query protein of interest.
-tg --target-genomes-db Result directory from running prepTG for target genomes of interest.
-o --output-dir Parent output/workspace directory. Is required.
-st --species-tree Phylogeny in Newick format - with names matching target genomes. Will show detected gene clusters along the tree - not a great visual - but can be used to quickly check if there is some sort of phylogenetic pattern.
-dm --draft-mode Run in draft-mode to also report segments on scaffold edges which in aggregate (with other such segments) meet criteria for reporting.
-fp --filter-paralogs Request for paralogous or secondary instances of gene-cluster identified in a single target genome to be filtered. This results in retention of only the best matching gene cluster instance to the query per target genome.
-rgv --use-prodigal-gv-for-reference If the --reference_genome option is used to define some coordinate along a reference genome as the query, and the query is a virus, use prodigal-gv instead of pyrodigal to call genes in the region.
-e --evalue-cutoff E-value cutoff for DIAMOND blastp to consider a gene in a target genome a hit to a query protein. [Default is 1e-10].
-m --min-prop The minimum proportion of distinct proteins/ortholog groups needed to report discrete segments of the gene-cluster. Note, that a minimum of 3 distinct query proteins/homolog groups are needed per segment reported [Default is 0.5]. If a syntenic threshold (off by default in newer versions of zol) is specified, a minimum of 3 single-copy distinct query proteins/homolog groups are needed.
-kpq -key-protein-queries Path to protein multi-FASTA file containing key query sequences which some proportion of are required to be present in a gene cluster at a specific E-value cutoff.
-kpe --key-protein-evalue-cutoff E-value cutoff for finding key query sequences in putative gene cluster homolog segments [Default is 1e-20]. The value is disregarded if less strict than the general --evalue cutoff.
-kpm --key-protein-min-prop The minimum proportion of distinct ortholog groups matching key proteins needed to report a homologous gene-cluster [Default is 0.0].
-sct --syntenic-correlation-threshold The minimum syntenic correlation coefficient needed by a target gene cluster instance to the query gene cluster for reporting [Default is 0.0]. Possible values range from 0.0 to 1.0, where 1.0 indicates a perfect order match between the query and target gene cluster instances and a value of 0 turns off syntenic similarity assessment entirely. A correlation is computed based on gene coordinates using single-copy shared orthologs only. The absolute value of the correlation is used to allow for flipped instances of the query gene cluster in the target genome.
-gdm --gc-delineation-mode Method/mode for delineation of gene-cluster boundaries. Options are "Gene-Clumper" or "HMM". Default is Gene-Clumper. See section above for more details on the two methods.
-f --flanking-context Number of bases to append to the beginning/end of the gene-cluster segments identified. [Default is 1000]. This is particularly of interest if you are using the single-query option or planning on using zol downstream, where such boundary genes will also be considered.
-mgd --max-genes-disconnect The number of genes separating hits if Gene-Clumper mode is used. Allows for more inclusivity of novel auxiliary genes. Alternatively, the maximum number of genes between gene-cluster segments detected by HMM to merge segments together. [Default is 5].
-gt --gc-transition The probability for gene-cluster to gene-cluster transition in HMM. Should be between 0.0 and 1.0. [Default is 0.9].
-bt --bg-transition Probability for background to background transition in HMM. Should be between 0.0 and 1.0. [Default is 0.9].
-ge --gc-emission Emission probability of gene being in gene-cluster state assuming a ortholog is found at the e-value cutoff. [Default is 0.95].
-be --bg-emission Emission probability of gene being in gene-cluster state assuming no homolog is found at the e-value cutoff. [Default is 0.2].
-gp --generate-plots Generate plots for assessing gene-cluster segments identified. Note, this can add substantial runtime and other plots are produced regardless of this option being specified or not. This option specifically produces the red barplot described above on this page.
-ds --diamond-sensitivity DIAMOND alignment sensitivity. Options include: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, and ultra-sensitive [Default is very-sensitive]. If working with a single species mid-sensitive or sensitive might be more appropriate and quicker.
-cl --clean-up Clean up disk-heavy files/folders.