strainFlye is a pipeline for calling, analyzing, and phasing rare single-nucleotide mutations in metagenome-assembled genomes produced from long and accurate reads—for example, PacBio HiFi reads.
The main inputs to strainFlye are 1) reads and 2) contigs. However, most steps in the pipeline can be "jumped to" if you already have other files prepared (e.g. an alignment of reads to contigs, or existing single-nucleotide mutation calls within contigs). See below for a flow chart of strainFlye's pipeline (created with Mermaid); if you have trouble viewing this flow chart, here are some other ways to view it.
flowchart LR
classDef art fill:#acfeb1,stroke:#086408,color:black
classDef pgm fill:#88bbff,stroke:#3333aa,color:black
classDef ext fill:#ffbb88,stroke:#755237,color:black
0(("Contigs<br/>(FASTA)")):::art --> A:::pgm;
1(("Reads<br/>((gzipped) FASTA<br/>or FASTQ)")):::art ---> A
2(("Assembly graph<br/>(GFA 1)")):::art -.->|"(Optional)"| A
A[strainFlye align] --> 3(("Alignment of<br/>reads to contigs<br/>(Indexed BAM)")):::art
3 --> C["strainFlye call p-mutation<br/>(Call mutations<br/>using frequencies)"]:::pgm
0 --> C
3 --> D["strainFlye call r-mutation<br/>(Call mutations<br/>using read counts)"]:::pgm
0 --> D
C --> 4(("Called mutations<br/>(Indexed BCF)")):::art
C --> 5(("Diversity indices<br/>(TSV)")):::art
D --> 4
D --> 5
0 --> E[strainFlye fdr estimate]:::pgm
3 --> E
4 --> E
5 -..->|"(Optional)"| E
E --> 6(("FDR estimates<br/>(TSV)")):::art
E --> 7(("Number of<br/>mutations / Mb<br/>(TSV)")):::art
7 --> Z{"Plotting FDR curves<br/>(see tutorial)"}:::ext
6 --> Z
6 --> F[strainFlye fdr fix]:::pgm
4 --> F --> 11(("Mutations with<br/>fixed FDR<br/>(Indexed BCF)")):::art
11 --> G[strainFlye spot hot-features]:::pgm
8(("Contig #quot;features#quot;<br/>[probably genes]<br/>(GFF3)")):::art --> G
G --> 9(("Hotspot features<br/>(TSV)")):::art
11 --> H[strainFlye spot cold-gaps]:::pgm --> 10(("Coldspot gaps<br/>(TSV)")):::art
4 -.->|If desired, you can skip FDR estimation and fixing and use the unfiltered BCF file for downstream analyses| 11
11 --> I[strainFlye smooth create]:::pgm --> 12(("Smoothed and<br/>virtual reads<br/>(gzipped FASTA)")):::art
0 --> I
3 --> I
5 -.->|"(Optional)"| I
12 --> J[strainFlye smooth assemble]:::pgm --> 13(("LJA assemblies")):::art
11 --> K
0 --> K[strainFlye link nt]:::pgm --> 14(("Nucleotide<br/>(co-)occurrence<br/>information<br/>(Pickle)")):::art
3 --> K
14 --> L[strainFlye link graph]:::pgm --> 15(("Link graphs<br/>(DOT or Pickle)")):::art
0 --> M[strainFlye matrix count]:::pgm
3 --> M
8 ---> M
M --> 16(("3-mer count<br/>information<br/>(Pickle)")):::art
16 --> N[strainFlye matrix fill]:::pgm --> 17(("Mutation matrices<br/>(TSV or JSON)")):::art
0 --> O[strainFlye dynam covskew]:::pgm --> 18(("Coverage and<br/>skew statistics<br/>(TSV)")):::art
3 --> O
2 --> P[strainFlye utils gfa-to-fasta]:::pgm --> 0
Long story short, strainFlye is an ordinary Python package (currently supporting Python 3.6 and 3.7). strainFlye depends on a few non-Python tools (e.g. minimap2, Prodigal, SAMtools), also.
The simplest way to install strainFlye is by using conda (or mamba). strainFlye is available through the bioconda channel:
conda install -c bioconda strainflye
-
If you run into dependency conflicts when installing strainFlye into an existing conda environment, you may want to just create a new conda environment and install strainFlye into that. (See the "workaround solution" given in this issue for an example.)
-
If that still doesn't work, you can try installing from source -- see below.
-
And if that doesn't work, no worries -- installing this stuff can be finicky. Feel free to open an issue and I'll try to help out.
Please see the development documentation (CONTRIBUTING.md
) for instructions.
strainFlye's smooth
module includes two commands. The first,
strainFlye smooth create
, creates smoothed and virtual reads for each contig;
the second, strainFlye smooth assemble
, assembles these reads using
LJA. LJA is not installed using the
conda installation instructions above, so—in order to run the
strainFlye smooth assemble
command—you will need to
install the LJA software (in particular, the
simple_ec
branch of
LJA).
Please see LJA's manual for the most up-to-date installation instructions. Assuming that you have all of LJA's requirements installed, something like the following should work:
git clone https://github.com/AntonBankevich/LJA.git
cd LJA
git checkout simple_ec
cmake .
make
... but this is subject to change as LJA is updated.
This Jupyter Notebook provides a tutorial demonstrating how to use strainFlye.
First, make sure you've activated the conda environment into which you installed strainFlye. From there, you can use strainFlye like any other command-line tool:
Usage: strainFlye [OPTIONS] COMMAND [ARGS]...
Pipeline for the analysis of rare mutations in metagenomes.
Please consult https://github.com/fedarko/strainFlye if you have any
questions, comments, etc. about strainFlye. Thank you for using this tool!
Options:
-v, --version Show the version and exit.
-h, --help Show this message and exit.
Commands:
align Align reads to contigs, and filter the resulting alignment.
call [+] Call mutations in contigs naïvely & compute diversity indices.
fdr [+] Estimate and fix FDRs for contigs' naïve mutation calls.
spot [+] Identify putative mutational hotspots or coldspots.
smooth [+] Create and assemble smoothed and virtual reads.
link [+] Create link graphs showing co-occurring alleles.
matrix [+] Create codon and amino acid mutation matrices.
dynam [+] Compute simple information about growth dynamics.
utils [+] Miscellaneous utility commands provided with strainFlye.
Click on the headers below (e.g. "Alignment") to show details about the corresponding command(s).
Input files / folders are listed in parentheses next to each command; optional
inputs are given in [square brackets]. Parameters are omitted for the sake of brevity.
For more details, try running these commands from the terminal (e.g. strainFlye align --help
).
Alignment
-
strainFlye align
(contigs, reads, [GFA]) → BAM- Aligns reads to contigs using minimap2.
- Filters this alignment to remove certain types of problematic alignments.
Naïve mutation identification, FDR estimation, and FDR fixing
-
strainFlye call p-mutation
andstrainFlye call r-mutation
(contigs, BAM) → (BCF, diversity indices)- Naïvely calls either p-mutations (frequency-based) or r-mutations (read-count-based).
- Computes diversity indices.
-
strainFlye fdr estimate
(contigs, BAM, BCF, [diversity indices]) → (FDR estimates, # mutations / Mb)- Estimates the FDRs of naïve mutation calls from
strainFlye call
using the target-decoy approach. - Computes FDR estimates, and the number of naïvely called mutations per megabase, for each "target" (non-decoy) contig. These values can be used to plot FDR curves showing how the FDR estimate for a target contig varies as p or r varies.
- Estimates the FDRs of naïve mutation calls from
-
strainFlye fdr fix
(BCF, FDR estimates) → BCF- Given a fixed FDR (e.g. f = 1%), returns a filtered BCF file with all target contigs' mutations filtered to use a threshold of p or r yielding an FDR estimate ≤ f.
Hotspot and coldspot identification
-
strainFlye spot hot-features
(BCF, GFF3) → hotspot TSV- Identifies "hotspot" features from the GFF3 file, based on simple configurable thresholds (e.g. to be a "hotspot," a feature must have at least N mutations).
-
strainFlye spot cold-gaps
(BCF) → coldspot TSV- Identifies "coldspot" gaps between mutations, considering a gap to be a coldspot if it is at least a certain length.
- Assigns a p-value to the longest coldspot gap seen in each contig. (Note that the computation of this p-value makes a few simplifying assumptions; see the supplemental material of our paper for details.)
Phasing
-
strainFlye smooth create
(contigs, BAM, BCF, [diversity indices]) → smoothed and virtual reads- Creates smoothed and virtual reads for each contig. These correspond to the original reads, but modified to remove all differences from the assembled contigs (aside from the mutations in the BCF file).
-
strainFlye smooth assemble
(smoothed and virtual reads) → assemblies- Uses LJA to assemble smoothed and virtual reads.
- See notes above about installing the
simple_ec
branch of LJA, which we rely on here. - For information on LJA's outputs (assembled sequences, assembly graph, etc.) please see its manual.
-
strainFlye link nt
(contigs, BAM, BCF) → nucleotide (co-)occurrence information- Computes nucleotide (co-)occurrence information for each contig.
- This information is stored in pickle files.
-
strainFlye link graph
(nucleotide (co-)occurrence information) → link graphs- Converts the output of
strainFlye link nt
into link graphs (one per contig), representing which alleles of a contig tend to co-occur. - Depending on the
-f
/--output-format
parameter, link graphs are saved to either DOT files or to pickle files (each pickle file corresponds to a serialized NetworkX representation of a link graph).
- Converts the output of
Mutation matrices
-
strainFlye matrix count
(contigs, BAM, GFF3) → 3-mer count information- Counts 3-mers aligned to the codons in the coding sequences (
CDS
orSO:0000316
features in the GFF3 file) in each contig. - This information is stored in pickle files.
- Counts 3-mers aligned to the codons in the coding sequences (
-
strainFlye matrix fill
(3-mer count information) → mutation matrices- Converts the output of
strainFlye matrix count
into codon and amino acid mutation matrices. - Each contig is represented by its own directory, containing four files: a codon mutation matrix, an amino acid mutation matrix, codon reference counts (the number of times we saw a codon in all the coding sequences of a contig, ignoring mutations), and amino acid reference counts.
- Depending on the
-f
/--output-format
parameter, these four files are saved as either TSV or JSON files.
- Converts the output of
Growth dynamics
-
strainFlye dynam covskew
(contigs, BAM) → coverage and skew TSVs- Computes information about binned coverages and GC skews in each contig.
Miscellaneous utilities
-
strainFlye utils gfa-to-fasta
(GFA) → FASTA- Converts a GFA 1 file to a FASTA file (and performs some sanity checking on it).
If you're interested in making changes to strainFlye's code, please see
CONTRIBUTING.md
for some tips on getting started.
An article describing strainFlye is available in Genome Research at this link. If you use strainFlye in your work, please cite it! The BibTeX for this article is:
@article{fedarko2022,
author={Fedarko, Marcus W and Kolmogorov, Mikhail and Pevzner, Pavel A},
title={Analyzing rare mutations in metagenomes assembled using long and accurate reads},
journal={Genome Research},
publisher={Cold Spring Harbor Laboratory Press},
volume={32},
number={11-12},
pages={2119--2133},
year={2022},
doi={10.1101/gr.276917.122},
url={http://genome.cshlp.org/content/32/11-12/2119.abstract},
eprint={http://genome.cshlp.org/content/32/11-12/2119.full.pdf+html}
}
-
If you use
strainFlye align
, please also cite minimap2 (Li 2018), which is used by this strainFlye command to perform alignment. (Depending on the version of minimap2 you use, you may also want to cite Li 2021.) -
If you use
strainFlye smooth assemble
, please also cite LJA (Bankevich et al., 2022), which is used by this strainFlye command to perform assembly.
Some of strainFlye's commands use Python's
subprocess
module to run
non-Python software: minimap2, samtools, bcftools, Prodigal, LJA, etc.
Most of the time, we do this using subprocess.run()
with shell=False
:
long story short, this helps prevent the problem of
shell injection.
However, as of writing, there are two places where strainFlye uses
subprocess.run()
with shell=True
: in strainFlye align
(when running
minimap2 and samtools), and in strainFlye smooth assemble
(when running LJA).
This is for convenience's sake, since we allow the user to pass in extra
parameters to these commands (the --minimap2-params
option for strainFlye align
, and the --lja-params
option for strainFlye smooth assemble
).
Our use of shell=True
in these two cases means that it's possible to make
these commands do unexpected things (see
Python's documentation here for details).
This should not be a problem if you are running strainFlye directly. However,
if you decide to host strainFlye on a server somewhere (and you allow users to
upload files, specify parameters, etc.) then you should be careful about
preventing shell injection in these cases. Feel free to open an issue if you
have any questions about this.
sample1.gfa
(located in strainflye/tests/inputs/
)
was downloaded from the gfalint
repository. The other GFA files in this folder beginning with sample1
are
also based on this GFA file.
The strainFlye tutorial (linked above, and located in docs/
) uses the "SheepGut" dataset shown in the strainFlye paper. This dataset was previously described in Kolmogorov et al., 2020 and Bickhart & Kolmogorov et al., 2022.
If you have any questions, suggestions, comments, etc. about strainFlye, please feel free to open an issue in this repository!