-
Notifications
You must be signed in to change notification settings - Fork 4
2. more info on fai
fai allows for identification of homologous instances of a query or set of query gene-cluster(s). It has many options, including two different approaches for delineation of gene-cluster boundaries, requesting filtering of paralogous instances of gene-clusters, and piecing together gene-clusters which are fragmented across scaffolds in genomic assemblies.
The final set of homologous gene clusters from target genomes in GenBank format can be found in the fai results subdirectory: /path/to/fai_Results/Homologous_GenBanks_Directory/
While paralogy is often thought off with regards to individual genes, full gene-clusters can also be paralogous within individual genomes. We thus allow filtering of paralogous gene-clusters if more than two distinct reference proteins/homolog groups are shared - suggesting paralogy beyond fragmentation that might have split a gene in two.
Similar to lsaBGC-Expansion.py, fai allows for detection of gene-clusters fragmented across multiple scaffolds by accounting for proximity to scaffold edges. If --draft_mode
is requested, thresholds needed for discovery of gene-clusters can simply be met in aggregate, putting together other homologous gene-clusters. Note, each individual gene-cluster segment must still contain three distinct query homolog groups.
Similar to lsaBGC-Expansion and cblaster, fai looks for homologs in target genomes by "BLASTp"ing proteins to predicted proteomes using DIAMOND. Afterwards, the programs differ in how they identify candidate/valid homologous gene-clusters.
cblaster uses a maximum distance parameter (default of 20kb) to determine whether genes should be grouped together in one gene-cluster segment. The cblaster suite also offers an intuitive approach to select the optimal value for this parameter. Similarly, fai offers "Gene-Clumper" mode (the current default) which simply groups genes together if they are within 5 genes of each other. fai also offers an "HMM" based approach which can be used to identify stringent blocks of gene-sets and then merged together into larger blocks through the same parameter, --max_genes_disconnect
, used by "Gene-Clumper" mode to group individual genes. Unlike the "HMM" based approach in lsaBGC-Expansion, the emission probability parameters are not automatically determined based on reflexive alignment of the query gene-cluster proteins to the background genome from which it was extracted to determine cutoffs to distinguish paralogous hits from orthologous hits. Similar to lsaBGC-Expansion though, HMM probability parameters are user-adjustable.
Visuals to Assess Quality of Detection of Homologous Gene-Clusters and Guide User Parameter Adjustment
fai will by default produce a "Tiny-AAI" plot which depicts the average-amino acid identity of genes from homologous gene clusters in target genomes to the query gene cluster (x-axis) and the proportion of query genes/ortholog groups found.
fai can also produce a multi-page PDF with plots such as the following for showing the quality of homologous gene-clusters detected. This report can be requested via the -gp
or --generate_plots
argument.
These plots showcase syntenic order and similarity to reference genes (height is the CDS to reference protein ratio - ideal match should be at 1, indicating they are the same length) and colored on a scale from 0 (grey) to 100 (red) corresponding to percent identity. Black borders indicate key query proteins - if provided by the user.
Column | Description | Notes |
---|---|---|
sample | The identifier of the target genome. | |
gene-cluster-id | The identifier of a discrete neighborhood of genes identified as homologous to the query gene cluster. | Only in the "Gene Cluster Instance - Report" tab |
aggregate-bitscore | The aggregate bitscore of hits to the query gene cluster genes. | Only the best hit for each query gene/ortholog-group is retained (based on bitscore). |
aai-to-query | The average amino-acid identity of the proteins in the target genome to the query gene cluster genes. | Only the best hit for each query gene/ortholog-group is retained (based on bitscore). |
mean-sequence-to-query-ratio | The average sequence-to-query ratio of the proteins. | Only the best hit for each query gene/ortholog-group is retained (based on bitscore). |
proportion-query-genes-found | The proportion of query genes/ortholog-groups found in homologous gene clusters across the target genome ("Genome Wide - Report") or in a specific discrete neighborhood ("Gene Cluster Instance - Report") | |
avg-syntenic-correlation | Pearson product-moment correlation coefficient for global syntenic similarity of a specific discrete neighborhood to the query gene cluster or the average of these values across all discrete neighborhoods which meet user defined filters. | |
number-background-genes | The number of background genes in the delineated region within query gene hits which are not represented in the query. | |
number-gene-clusters | The number of discrete gene neighborhoods which individually meet reporting criteria. | Only in the "Genome Wide - Report" tab |
copy-counts | A string separated by commas listing the copy-count of individual query genes/ortholog-groups. |
Usage: fai [-h] -tg TARGET_GENOMES_DB [-i QUERY_INPUTS [QUERY_INPUTS ...]] [-r REFERENCE_GENOME] [-rc REFERENCE_CONTIG] [-rs REFERENCE_START] [-re REFERENCE_END]
[-pq PROTEIN_QUERIES] [-sq SINGLE_QUERY] -o OUTPUT_DIR [-dm] [-fp] [-rgv] [-e EVALUE_CUTOFF] [-m MIN_PROP] [-kpq KEY_PROTEIN_QUERIES]
[-kpe KEY_PROTEIN_EVALUE_CUTOFF] [-kpm KEY_PROTEIN_MIN_PROP] [-sct SYNTENIC_CORRELATION_THRESHOLD] [-gdm GC_DELINEATION_MODE] [-f FLANKING_CONTEXT]
[-mgd MAX_GENES_DISCONNECT] [-gt GC_TRANSITION] [-bt BG_TRANSITION] [-ge GC_EMISSION] [-be BG_EMISSION] [-gp] [-ds DIAMOND_SENSITIVITY]
[-st SPECIES_TREE] [-phl PHYLOHEATMAP_LENGTH] [-phw PHYLOHEATMAP_WIDTH] [-c THREADS] [-cl] [-mm MAX_MEMORY] [-v]
Program: fai
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
.o88o. o8o
888 `" `"'
o888oo .oooo. oooo
888 `P )88b `888
888 .oP"888 888
888 d8( 888 888
o888o `Y888""8o o888o
MODES OF INPUT:
*********************************************************************************
Type 1: Directory of Homologous Gene-Cluster GenBanks
(GenBanks must have CDS features with locus_tag or protein_id names)
-----------------------------------------------------
$ fai -i Known_GeneCluster.gbk -tg prepTG_Database/ -o fai_Results/
Type 2: Reference Genome with Gene-Cluster/Locus Coordinates
(proteins are collapsed for high-similarity using cd-hit)
-----------------------------------------------------
$ fai -r Reference.fasta -rc scaffold01 -rs 40201 -re 45043 \
-tg prepTG_Database/ -o fai_Results/
Type 3: Set of Query Proteins (not compatible with the syntenic similarity cutoff)
(proteins are collapsed for high-similarity using cd-hit)
Similar to input for cblaster
-----------------------------------------------------
$ fai -pq Gene-Cluster_Query_Proteins.faa -tg prepTG_Database/ -o fai_Results/
Type 4: Single Query (provide the amino acid sequence directly)
Similar to CORASON
----------------------------------------------------
$ fai -sq Single_Query_Protein.fa -tg prepTG_Database/ -o fai_Results/
The final set of homologous gene cluster instances within target genomes which meet
the specified criteria can be found in the subdirectory named:
Final_Results/Homologous_Gene_Cluster_GenBanks/
Options:
-h, --help show this help message and exit
-tg, --target-genomes-db TARGET_GENOMES_DB
Result directory from running prepTG for target genomes
of interest.
-i, --query-inputs QUERY_INPUTS [QUERY_INPUTS ...]
Paths to locus-specific GenBank(s) to use as
queries for searching for homologous/orthologous instances
in target genomes. Files must end with ".gbk", ".gbff",
or ".genbank".
-r, --reference-genome REFERENCE_GENOME
Path to reference genome in FASTA or GenBank format on
which to define query gene cluster coordinates.
-rc, --reference-contig REFERENCE_CONTIG
Scaffold name (up to first space) in refernece genome
which features the query gene cluster.
-rs, --reference-start REFERENCE_START
Start position of query gene cluster on scaffold in
reference genome.
-re, --reference-end REFERENCE_END
End position of query gene cluster on scaffold in
reference genome.
-pq, --protein-queries PROTEIN_QUERIES
Path to protein multi-FASTA file to use as query.
-sq, --single-query SINGLE_QUERY
Path to protein FASTA file containing a single protein
to use as a query.
-o, --output-dir OUTPUT_DIR
Output directory.
-dm, --draft-mode Run in draft-mode to also report segments on scaffold
edges which in aggregate (with other such segments) they
meet criteria for reporting.
-fp, --filter-paralogs
Filter paralogous instances of gene-cluster identified
in a single target genome.
-rgv, --use-prodigal-gv-for-reference
If the --reference-genome option is used to define gene
cluster coordinates, use prodigal-gv instead of pyrodigal
to call genes in the region.
-e, --evalue-cutoff EVALUE_CUTOFF
E-value cutoff for DIAMOND blastp to consider a gene in a
target genome a hit to a query protein. [Default
is 1e-10].
-m, --min-prop MIN_PROP
The minimum proportion of distinct proteins/ortholog groups
needed to report discrete segments of the gene-cluster.
Note, that a minimum of 3 distinct query proteins/homolog
groups are needed per segment reported [Default is 0.5].
-kpq, --key-protein-queries KEY_PROTEIN_QUERIES
Path to protein multi-FASTA file containing key query
sequences which some proportin of are required to be
present in a gene cluster at a specific E-value cutoff.
-kpe, --key-protein-evalue-cutoff KEY_PROTEIN_EVALUE_CUTOFF
E-value cutoff for finding key query sequences in putative
gene cluster homolog segments. [Default
is 1e-20]. Disregarded if less strict than the
general --evalue-cutoff.
-kpm, --key-protein-min-prop KEY_PROTEIN_MIN_PROP
The minimum proportion of distinct ortholog groups matching
key proteins needed to report a homologous gene-cluster.
[Default is 0.0].
-sct, --syntenic-correlation-threshold SYNTENIC_CORRELATION_THRESHOLD
The minimum syntenic correlation needed to at least
one known GCF instance to report segment. [Default
is 0.0; no syntenic assessment performed]
-gdm, --gc-delineation-mode GC_DELINEATION_MODE
Method/mode for delineation of gene-cluster boundaries.
Options are "Gene-Clumper" or "HMM" [Default is
Gene-Clumper].
-f, --flanking-context FLANKING_CONTEXT
Number of bases to append to the beginning/end of the
gene-cluster segments identified [Default is 1000].
-mgd, --max-genes-disconnect MAX_GENES_DISCONNECT
Maximum number of genes between gene cluster segments
detected by HMM to merge segments together. Alternatively
the number of genes separating hits if Gene-Clumper mode
is used. Allows for more inclusivity of novel auxiliary
genes. [Default is 5].
-gt, --gc-transition GC_TRANSITION
Probability for gene-cluster to gene cluster transition
in HMM. Should be between 0.0 and 1.0. [Default
is 0.9].
-bt, --bg-transition BG_TRANSITION
Probability for background to background transition
in HMM. Should be between
0.0 and 1.0. [Default
is 0.9].
-ge, --gc-emission GC_EMISSION
Emission probability of gene being in gene cluster
state assuming a orthologis found at the e-value cutoff.
[Default is 0.95].
-be, --bg-emission BG_EMISSION
Emission probability of gene being in gene cluster
state assuming no homolog is
found at the e-value cutoff.
[Default is 0.2].
-gp, --generate-plots
Generate red barplots for assessing gene cluster
segments identified.
-ds, --diamond-sensitivity DIAMOND_SENSITIVITY
DIAMOND alignment sensitivity. Options include: fast,
mid-sensitive, sensitive, more-sensitive, very-sensitive,
and ultra-sensitive [Default is very-sensitive].
-st, --species-tree SPECIES_TREE
Phylogeny in Newick format - with names matching
target genomes db [Optional]. Will be used for
creating an extra visual.
-phl, --phyloheatmap-length PHYLOHEATMAP_LENGTH
Specify the height/length of the phylo-heatmap
plot [Default is 7].
-phw, --phyloheatmap-width PHYLOHEATMAP_WIDTH
Specify the width of the phylo-heatmap plot
[Default is 10].
-c, --threads THREADS
The number of threads to use [Default is 1].
-cl, --clean-up Clean up disk-heavy files/folders.
-mm, --max-memory MAX_MEMORY
Uses resource module to set soft memory limit. Provide
in Giga-bytes. Configured in the shell environment
[Default is None; experimental].
-v, --version Get version and exit.
Option (Short) | Option (Long) | Description |
---|---|---|
-i |
--query-inputs |
Paths to locus-specific GenBank(s) to use for queries for searching for homologous/orthologous instances in the target genome. Note this is optional, users might prefer to specify the query gene cluster in some other form (position along reference or protein FASTA file), see example commands for running fai in the help function or the basic usage examples. Files must end with ".gbk", ".gbff", or ".genbank". |
-r |
--reference-genome |
Path to a reference genome in FASTA or GenBank format to use for specifying the query gene cluster. This is optional and there are other options/ways for specifying the gene cluster, see example commands for running fai in the help function or the basic usage examples. |
-rc |
--reference-contig |
The scaffold name (up to the first space character) which features the query gene cluster of interest. |
-rs |
--reference-start |
The start position of gene-cluster on the scaffold in the reference genome. |
-re |
--reference-end |
The end position of gene-cluster on scaffold in the reference genome. |
-pq |
--protein-queries |
Path to protein multi-FASTA file containing to use as queries. |
-sq |
--single-query |
Path to protein FASTA file containing a single protein to use as a query. This option is inspired by CORASON and is intended to allow users to apply fai to pull out genomic contexts of the query protein of interest. |
-tg |
--target-genomes-db |
Result directory from running prepTG for target genomes of interest. |
-o |
--output-dir |
Parent output/workspace directory. Is required. |
-st |
--species-tree |
Phylogeny in Newick format - with names matching target genomes. Will show detected gene clusters along the tree - not a great visual - but can be used to quickly check if there is some sort of phylogenetic pattern. |
-dm |
--draft-mode |
Run in draft-mode to also report segments on scaffold edges which in aggregate (with other such segments) meet criteria for reporting. |
-fp |
--filter-paralogs |
Request for paralogous or secondary instances of gene-cluster identified in a single target genome to be filtered. This results in retention of only the best matching gene cluster instance to the query per target genome. |
-rgv |
--use-prodigal-gv-for-reference |
If the --reference_genome option is used to define some coordinate along a reference genome as the query, and the query is a virus, use prodigal-gv instead of pyrodigal to call genes in the region. |
-e |
--evalue-cutoff |
E-value cutoff for DIAMOND blastp to consider a gene in a target genome a hit to a query protein. [Default is 1e-10]. |
-m |
--min-prop |
The minimum proportion of distinct proteins/ortholog groups needed to report discrete segments of the gene-cluster. Note, that a minimum of 3 distinct query proteins/homolog groups are needed per segment reported [Default is 0.5]. If a syntenic threshold (off by default in newer versions of zol) is specified, a minimum of 3 single-copy distinct query proteins/homolog groups are needed. |
-kpq |
-key-protein-queries |
Path to protein multi-FASTA file containing key query sequences which some proportion of are required to be present in a gene cluster at a specific E-value cutoff. |
-kpe |
--key-protein-evalue-cutoff |
E-value cutoff for finding key query sequences in putative gene cluster homolog segments [Default is 1e-20]. The value is disregarded if less strict than the general --evalue cutoff. |
-kpm |
--key-protein-min-prop |
The minimum proportion of distinct ortholog groups matching key proteins needed to report a homologous gene-cluster [Default is 0.0]. |
-sct |
--syntenic-correlation-threshold |
The minimum syntenic correlation coefficient needed by a target gene cluster instance to the query gene cluster for reporting [Default is 0.0]. Possible values range from 0.0 to 1.0, where 1.0 indicates a perfect order match between the query and target gene cluster instances and a value of 0 turns off syntenic similarity assessment entirely. A correlation is computed based on gene coordinates using single-copy shared orthologs only. The absolute value of the correlation is used to allow for flipped instances of the query gene cluster in the target genome. |
-gdm |
--gc-delineation-mode |
Method/mode for delineation of gene-cluster boundaries. Options are "Gene-Clumper" or "HMM". Default is Gene-Clumper. See section above for more details on the two methods. |
-f |
--flanking-context |
Number of bases to append to the beginning/end of the gene-cluster segments identified. [Default is 1000]. This is particularly of interest if you are using the single-query option or planning on using zol downstream, where such boundary genes will also be considered. |
-mgd |
--max-genes-disconnect |
The number of genes separating hits if Gene-Clumper mode is used. Allows for more inclusivity of novel auxiliary genes. Alternatively, the maximum number of genes between gene-cluster segments detected by HMM to merge segments together. [Default is 5]. |
-gt |
--gc-transition |
The probability for gene-cluster to gene-cluster transition in HMM. Should be between 0.0 and 1.0. [Default is 0.9]. |
-bt |
--bg-transition |
Probability for background to background transition in HMM. Should be between 0.0 and 1.0. [Default is 0.9]. |
-ge |
--gc-emission |
Emission probability of gene being in gene-cluster state assuming a ortholog is found at the e-value cutoff. [Default is 0.95]. |
-be |
--bg-emission |
Emission probability of gene being in gene-cluster state assuming no homolog is found at the e-value cutoff. [Default is 0.2]. |
-gp |
--generate-plots |
Generate plots for assessing gene-cluster segments identified. Note, this can add substantial runtime and other plots are produced regardless of this option being specified or not. This option specifically produces the red barplot described above on this page. |
-ds |
--diamond-sensitivity |
DIAMOND alignment sensitivity. Options include: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, and ultra-sensitive [Default is very-sensitive]. If working with a single species mid-sensitive or sensitive might be more appropriate and quicker. |
-cl |
--clean-up |
Clean up disk-heavy files/folders. |