RecurM is based on the idea of recurrent assembly, a novel method for plasmid and phage discovery in metagenomic sequence data. The foundation of this method is a high-throughput all-to-all comparison between assembled sequences from different metagenomic samples. Repeated assembly the same sequence (contig) across multiple independent datasets provides evidence that it may represent the assembly of an intact, complete underlying biological unit. The more prevalent that contig is, the higher the confidence in the integrity of the sequence.
A sequence is considered to be recurrently assembled if near-identical contigs are assembled from multiple metagenomes and meet length ratio (LR), alignment fraction (AF) and average nucleotide identity (ANI) cutoffs. By default these are 90%. Contigs that meet these alignment thresholds are grouped into discrete clusters, with each cluster expected to represent multiple instances of a discrete plasmid or phage being assembled across multiple samples. The higher the number of contigs within a cluster, the higher the confidence that the cluster represents a real, distinct mobile genetic element such as a phage or plasmid. The mapping of the alignments between contigs within a cluster also allows RecurM to identify putative circular, linear and ambiguous (‘imperfect’) sequence structure.
RecurM’s algorithm is implemented in python and can be broken down into three separate stages: alignment; parsing and clustering; and cluster classification. RecurM also has a number of options to improve accuracy by detecting and removing spurious clusters generated by the preceding three steps that may correspond to incompletely assembled elements and/or genomic fragments.
By default, RecurM simplifies the quadratic runtime required for an all-vs-all contig alignment by binning contigs into slightly overlapping size groups and only aligning within those groups (e.g. first aligning all 3kb-8kb contigs against each other, then all 7kb-12kb, etc. RecurM will automatically detect the size distribution of your input assemblies and create size bins accordingly.
To run RecurM, you first need to setup a conda environment with all the prerequisites:
Retrieve the Github files:
git clone --recursive https://github.com/chklovski/recurm.git && cd recurm
Then create a Conda environment using the recurm.yaml
file:
conda env create -n recurm -f recurm.yaml
conda activate recurm
You can run recurm from the folder:
RecurM/bin/recurm -h
The basic workflow is to run the RecurM clustering algorithm on a set of metagenomic assemblies (or MAGs). To do this, you can run:
RecurM/bin/recurm cluster -t <threads> -i <folder with all assemblies in .fasta format> -o <output folder> --min-contig-len <minimum contig length to consider (default 2500)> --min-cluster-size <minimum number of contigs to form a cluster (default: 3)>
RecurM is able to detect both circular and linear clusters. However, some recurrently assembled elements may be integrative or genomic fragments (especially if short-read assemblies are used as input). It is therefore highly recommended to use the --collapse_against_assembly
argument. This will remove any resultant clusters that map against any longer contigs in the input. This will drastically cut down on false positives with minimal downsides to recall. The only case where this option is NOT recommended is if you are specifically interested in integrative circular elements using long-read assemblies.
If you have a huge input dataset (> 1,000 metagenomic assemblies) it is recommended to use the --fast
option to cut down on alignment time. This should speed up RecurM's runtime substantially with relatively few downsides. This option changes the minimap2 settings to: -k16 -Xw5 -g1000 -r1000 --max-chain-skip 500 --max-chain-iter 1000
to exit chaining early and avoid quadratic time complexity in the worst case. This may miss a small number of alignments, but this should not matter too much except for ultra-rare ocurrences (e.g. an occasional plasmid present in 3 out of 10,000 samples).
RecurM puts out a 'clusters' folder, where each file is a FASTA sequence representative of the circular, linear or imperfect cluster. RecurM also outputs a 'results' folder with cluster information (cluster_information.tsv), as well as a full list of which contigs in which sample were found in which cluster cluster_contigs_information.tsv and a separate list of all unclustered contigs (leftover_contigs_information.tsv).
The cluster information file has the topology of the cluster ("Circular/Linear/Imperfect"), the number of contigs in the cluster ("Total_Contigs"), counts of edges within clusters for further information ("Circular_Count/Linear_Count/Imperfect_Count" - note that these will sum to higher than the number of contigs in the cluster, as they represent edges within the graph).
RecurM also uses fastANI to calculate if any clusters are related to each other, and puts that information in the 'Group' column, as well as the mean ANI between clusters in a given group ('Average_ANI' column).
Please be aware that RecurM is still in beta, so there may be bugs, and results may change significantly as algorithm changes.
Use recurm -h for further info