small README updateush

moritzbuck · Jul 7, 2021 · 70710cc · 70710cc
1 parent 909afd4
commit 70710cc
Showing 1 changed file with 17 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,13 +1,15 @@
 # mOTUlizer
 
-Utility to analyse a group of closely related MAGs/Genomes/bins/SUBs of more or less dubious origin. Right now it is composed of three programs:
+Utility to analyse a group of closely related MAGs/Genomes/bins/SUBs of more or less dubious origin. Right now it is composed of a number of programs:
 
 * `mOTUlize.py` takes a set of genomes (I will use the term genome as a short hand for set of nucleotide sequences that presumably come from the same organism/population, can be incomplete, redundant or contaminated) and cluster them in to metagenomic Operational Taxonomic Units (mOTUs). Using similarity scores (by default ANI as computed by fastANI, but user can provide other similarities) a network is built based on (user defined) better quality genomes (for historical reasons called MAGs) by thresholding the similarities at a specific value (95% by default). The connected components of this graph are the mOTUs. Additionally lower quality genomes (SUBs, ) are recruited to the mOTU of whichever MAG they are most similar too if the similarity is above the threshold.
 
 * `mOTUpan.py` computes the likelihood of gene-encoded traits to be expected in all of a set of genomes, e.g. of a trait to be in the core genome of a set of genomes (of possibly varying quality). Basically you provide to `mOTUpan` the set of proteomes of your genomes of interest (for example from the same mOTU or Genus) as well as a completeness prior of these genomes (for example [`checkm`](https://ecogenomics.github.io/CheckM/) output or a fixed value) and it computes gene clusters using [`mmseqs2`](https://github.com/soedinglab/MMseqs2), you can also provide your own genome encoded traits either as a `JSON`-file, or `TAB`-separated file (see example files). For each of these gene-clusters it will then compute the likelihood of it being in the core vs the likelihood of it not being, the ratio of these likelihoods will determine if a trait is considered core or not. This new partitioning can be used to update our completeness prior, and recomputed iteratively until convergence.
 
 * `mOTUconvert.py` converts the output of diverse programs into input files for `mOTUpan.py`, currently includes methods for [`mmseqs2`](https://github.com/soedinglab/MMseqs2), [`roary`](https://sanger-pathogens.github.io/Roary/), [`PPanGGOLiN`](https://github.com/labgem/PPanGGOLiN), [`eggNOGmapper`](https://github.com/eggnogdb/eggnog-mapper), [`anvio`][https://merenlab.org/software/anvio/] pangenome databases.
 
+* **experimental** `anvi-run-motupan.py` a anvi'o compatible version of `mOTUpan.py` a bit less options right now, but runs directly on anvi'o pangenome database
+
 a number of example files are to be found in the `example_files`-folder, the `fasta`- and `gff`-files are the ones used for all the other files, these are generated by the always fantastic [`prokka`](example_files/fnas/). Also there is some reading material in the `mOTUlizer/doc` (a poster, a presentation and a very early paper draft, but at least it has the maths in it), the paper will eventually be available there!
 
 ## INSTALL
@@ -87,6 +89,20 @@ Check all flags in with `--help`, but here are some keys ones a bit more explain
 
 * `--max_iter` : maximum number of iterations for the recursive aspect of motupan. You might want to put that to `1` if you have only few traits that would not be sufficient to estimate completeness.
 
+### anvi-run-motupan
+
+You need an anvi'o pangenome-database, and if you have it the genome-storage (for completenesses), great otherwise simply:
+
+```
+# if you want just a tsv :
+
+anvi-run-motupan.py -p MYPANGENOME-PAN.db -g MYGENOMES.db -o MY_OUTPUT.tsv
+
+# if you want to update the db, so it show up in anvi-display-pan
+
+anvi-run-motupan.py -p MYPANGENOME-PAN.db -g MYGENOMES.db --store-in-db
+
+```
 
 ### mOTUconvert