-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher-order "compare" method with zeta diversity metric #1189
Comments
Did a quick skim on the paper, and I think this is possible with the info we already output from A possible extension would be to do the zeta diversity on the hashes, and that would require a bit more of code (but is doable with the LCA index, which already has the mapping from hash to which signatures contain the hash). I'm not sure how robust the results would be... but worth trying it out =] |
Hi, @luizirber, thanks for getting back to me. I am most interested in doing the calculation on the hashes because I do not want database limitations to skew results. I was able to figure out how to pull the hashes out of the signatures via the Python API and then do simple set operations on them this way:
From these intersections of multiple hash sets, I can calculate the statistics I need for the zeta diversity decay curve. I can also determine via Mind you, this approach of measuring shared species among metagenomes at different levels of intersection is analogous to the pangenome paradigm of measuring shared genes among genomes at different levels of intersection. |
Nevermind, @luizirber . It is much more efficient to just create a table of the metagenome signature names, the hashes, and even the hash abundances like this:
Then I can calculate all the statistics I want from the table/matrix really quickly. Using itertools to create all combinations of signatures with shared hashes is a waste of time. Am I reinventing the wheel; does sourmash already have a method to save the signature names, hashes, and hash counts for a set of signatures to a tabular file? |
congratulations, you have leveled up to sourmash Power User level 3.
:) :)
thank you for the code!
|
We didn't have a method before, but now we do =] This is beautiful! |
I've been converting signatures to CSVs and then merging the CSVs into a single table, backfilling with zeros when the hash is not observed in a signature. Very time consuming...this is much better!!! |
@ctb, you're welcome for the code. I'm just happy to give back in a small capacity to a community that's helping me 😄 @luizirber, ah, good! @taylorreiter oh, cool, nice to know someone else is traveling the same path; I have a спутник (a satellite, yes, but also means a fellow traveler), haha! Hey, do you know how to efficiently convert a pandas dataframe from long form like above to wide form (a matrix with hashes along one axis, sample ids along another axis, and abundances filling the matrix)? I usually just jump straight into R with tabular data, so I'm not savvy with pandas dataframes. |
@nmb85 I ended up with this implementation:
but this operation gave me a non-trivial amount of problems. With larger matrices, this code will give @luizirber and I came up with a similar implementation that uses dask:
However this could also be problematic, so parquet could also be useful as an alternative. Some links: |
Thank you, @taylorreiter! This is so useful and saves me lots of time! It looks like I'll pick up a few more Python modules over the next week! |
revisiting this - right now I'm hesitant to add pandas as an installation requirement, so am going to pass on this for 4.0. maybe for 5.0 tho! |
Hi, I am interested in using sourmash to compare entire metagenomes and hierarchically cluster them based on diversity metrics. There is new work with the "zeta-diversity" metric suggesting that changing the "zeta order" has an effect on understanding drivers of microbial bio-geography and turnover of rare versus abundant species. The familiar diversity metrics are alpha diversity, which has a zeta order of 1 because it is the number of species in a single sample, and beta diversity, which has a zeta order of 2 because it is the number of species (not) shared by two samples, and diversity metrics with higher zeta orders follow this pattern for species shared by three samples, etc. There is an illustration of this in Figure 5 of this paper, which is an example of the zeta diversity metric applied to soil microbiomes (both 16S surveys and shotgun metagenomes). Is it possible to calculate diversity metrics with zeta orders higher than 2 using sourmash signatures?
The text was updated successfully, but these errors were encountered: