Arianna I. Krinos, Margaret Mars Brisbin, Sarah K. Hu, Natalie R. Cohen, Tatiana A. Rynearson, Michael J. Follows, Frederik Schulz, Harriet Alexander
Taxonomic annotation is a critical problem in environmental microbial meta-omics. In protists (single-celled microbial eukaryotes) in particular, complex genomes and incomplete databases pose important threats to accurate interpretation. We conducted a careful analysis of protistan meta-omic datasets in order to quantify the extent of this problem. We also propose a 2-stage approach that helps with more accurate estimation of uncertainty in microbial meta-omics.
This work would not have been possible without many very useful software tools, including but not limited to
And a couple of our own tools
These workflows are deployed on the cluster for heavier-lift parts of this analysis. The outputs of these workflows are often used in the analysis notebooks.
01-scale-genus_eukulele
- runEUKulele
against the Phaeocystis databases (stored on Zenodo) for Scale 1 of the paper as written onbioRxiv
01-scale-genus_functional
- runeggnog-mapper
to functionally annotate Phaeocystis sequences from the Tara Oceans metagenomes01-scale-genus_tree
- run alignment and phylogenetic tree tools for the Phaeocystis references02-scale-family_eukulele
- runEUKulele
against the sequences from Narragansett Bay, as appears in Figure 3 of the paper03-scale-phylum_deepclust
- runDIAMOND DeepClust
against the sequences from the BATS dataset, including/excluding the sequences from phylum Retaria as described in the paper03-scale-phylum_eukulele
- runEUKulele
against the sequences from the BATS dataset, including/excluding the sequences from phylum Retaria as described in the paperXX-scale-all_deepclust
- run all scales of analysis throughDIAMOND DeepClust
to provide input to thetax-aliquots
steps
Each notebook is connected to one of the main text and/or supplemental figures in the final paper. Data needed to run these notebooks can be generated by downloading source datasets and running the Snakemake
workflows from the section above.
Notebooks are named according to the convention:
XXFIG_<descriptor>.ipynb
where "XX" will either tell you which figure this notebook was connected to, if a main text figure, or "XX" if strictly supplemental. "FIG" tells you that this is a figure notebook, and the descriptor provides more details about the notebook's objective(s).