Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher-order "compare" method with zeta diversity metric #1189

Closed
nmb85 opened this issue Aug 30, 2020 · 11 comments
Closed

Higher-order "compare" method with zeta diversity metric #1189

nmb85 opened this issue Aug 30, 2020 · 11 comments
Labels

Comments

@nmb85
Copy link

nmb85 commented Aug 30, 2020

Hi, I am interested in using sourmash to compare entire metagenomes and hierarchically cluster them based on diversity metrics. There is new work with the "zeta-diversity" metric suggesting that changing the "zeta order" has an effect on understanding drivers of microbial bio-geography and turnover of rare versus abundant species. The familiar diversity metrics are alpha diversity, which has a zeta order of 1 because it is the number of species in a single sample, and beta diversity, which has a zeta order of 2 because it is the number of species (not) shared by two samples, and diversity metrics with higher zeta orders follow this pattern for species shared by three samples, etc. There is an illustration of this in Figure 5 of this paper, which is an example of the zeta diversity metric applied to soil microbiomes (both 16S surveys and shotgun metagenomes). Is it possible to calculate diversity metrics with zeta orders higher than 2 using sourmash signatures?

@luizirber
Copy link
Member

Did a quick skim on the paper, and I think this is possible with the info we already output from sourmash gather (if you do the classification first and use matches to calculate diversity).

A possible extension would be to do the zeta diversity on the hashes, and that would require a bit more of code (but is doable with the LCA index, which already has the mapping from hash to which signatures contain the hash). I'm not sure how robust the results would be... but worth trying it out =]

@luizirber luizirber added the idea label Sep 3, 2020
@nmb85
Copy link
Author

nmb85 commented Sep 3, 2020

Hi, @luizirber, thanks for getting back to me. I am most interested in doing the calculation on the hashes because I do not want database limitations to skew results. I was able to figure out how to pull the hashes out of the signatures via the Python API and then do simple set operations on them this way:

# load modules for working with signatures, building set combinations, and set operations
import glob, sourmash, itertools, time

# get signature paths
mgsig_files = glob.glob(paths_to_sigs)

# load signatures into list
mgsigs = []
for mgsig_file in mgsig_files:
    mgsigs.append(sourmash.load_one_signature(mgsig_file, ksize = 31))

# get the hash lists from the signatures and store in a dictionary
mghashsets = {}
for mgsig in mgsigs:
    mghashset = mgsig.minhash.get_hashes()
    sample = mgsig.name().split("/")[1] # this line is specific to my input file paths and just gets the metagenome sample name to use as dictionary keys
    mghashsets[sample] = set(mghashset)

# get the different combinations of hash sets
intersect_size = 3
combos = list(itertools.combinations(mghashsets,intersect_size))

# calculate intersections between sets of hashes, this takes longer with more sets in each intersection (intersect_size), 
# and becomes difficult for a single node at an intersect_size of 6-8 or so for 40-50 metagenome signatures with < 10k hashes per signature

start = time.time()
intersects = {}

for combo in combos:
    hashsets = [ mghashsets[sample] for sample in combo ]
    intersects[combo] = set.intersection(*hashsets)
end = time.time()
print(end-start)

From these intersections of multiple hash sets, I can calculate the statistics I need for the zeta diversity decay curve. I can also determine via sourmash gather which species are shared by the signature hash sets in the intersection. Is this sound? And can you suggest a more efficient way to calculate these intersections (in Python)? Something I might be able to try myself as not to burden you and your team?

Mind you, this approach of measuring shared species among metagenomes at different levels of intersection is analogous to the pangenome paradigm of measuring shared genes among genomes at different levels of intersection.

@nmb85
Copy link
Author

nmb85 commented Sep 4, 2020

Nevermind, @luizirber . It is much more efficient to just create a table of the metagenome signature names, the hashes, and even the hash abundances like this:

# load modules for working with signatures, building set combinations, and set operations
import glob, sourmash, pandas as pd

# get signature paths
mgsig_files = glob.glob(paths_to_sigs)

# load signatures into list
mgsigs = []
for mgsig_file in mgsig_files:
    mgsigs.append(sourmash.load_one_signature(mgsig_file, ksize = 31))

# extract the metagenome signature names, hashes, and counts into nested dictionary
sample_dict = {}
for mgsig in mgsigs:
    mins = mgsig.minhash.get_mins(with_abundance=True)
    sample = mgsig.name().split("/")[1]
    sample_dict[sample] = mins

# convert the nested dictionary into a pandas dataframe
sample_ids = []
min_list = []

for sample_id, mins in sample_dict.items():
    sample_ids.append(sample_id)
    min_list.append(pd.DataFrame.from_dict(mins, orient="index",columns=["abundance"]))

df = pd.concat(min_list, keys=sample_ids)

# reformat the pandas dataframe a bit and then save to disk as csv file to work with later/import into R
df.reset_index(inplace=True)
df = df.rename(columns={'level_0': 'experiment', 'level_1': 'hash'})
df = df.drop(columns="index")
df.to_csv(path_to_csv_file, sep = "\t", index_label = False)

Then I can calculate all the statistics I want from the table/matrix really quickly. Using itertools to create all combinations of signatures with shared hashes is a waste of time. Am I reinventing the wheel; does sourmash already have a method to save the signature names, hashes, and hash counts for a set of signatures to a tabular file?

@ctb
Copy link
Contributor

ctb commented Sep 4, 2020 via email

@luizirber
Copy link
Member

Then I can calculate all the statistics I want from the table/matrix really quickly. Using itertools to create all combinations of signatures with shared hashes is a waste of time. Am I reinventing the wheel; does sourmash already have a method to save the signature names, hashes, and hash counts for a set of signatures to a tabular file?

We didn't have a method before, but now we do =] This is beautiful!

@taylorreiter
Copy link
Contributor

I've been converting signatures to CSVs and then merging the CSVs into a single table, backfilling with zeros when the hash is not observed in a signature. Very time consuming...this is much better!!!

@nmb85
Copy link
Author

nmb85 commented Sep 4, 2020

@ctb, you're welcome for the code. I'm just happy to give back in a small capacity to a community that's helping me 😄

@luizirber, ah, good!

@taylorreiter oh, cool, nice to know someone else is traveling the same path; I have a спутник (a satellite, yes, but also means a fellow traveler), haha! Hey, do you know how to efficiently convert a pandas dataframe from long form like above to wide form (a matrix with hashes along one axis, sample ids along another axis, and abundances filling the matrix)? I usually just jump straight into R with tabular data, so I'm not savvy with pandas dataframes.

@taylorreiter
Copy link
Contributor

@nmb85 I ended up with this implementation:

rule make_hash_abund_table_wide:
    input: "outputs/hash_tables/normalized_abund_hashes_long.csv"
    output: "outputs/hash_tables/normalized_abund_hashes_wide.feather"
    run:
        import pandas as pd
        import feather
        
        ibd = pd.read_csv(str(input), dtype = {"minhash" : "int64", "abund" : "float64", "sample" : "object"})
        ibd_wide=ibd.pivot(index='sample', columns='minhash', values='abund')
        ibd_wide = ibd_wide.fillna(0)
        ibd_wide['sample'] = ibd_wide.index
        ibd_wide = ibd_wide.reset_index(drop=True)
        ibd_wide.columns = ibd_wide.columns.astype(str)
        ibd_wide.to_feather(str(output))

but this operation gave me a non-trivial amount of problems. With larger matrices, this code will give Unstacked DataFrame is too big, causing int32 overflow

@luizirber and I came up with a similar implementation that uses dask:

import sys

import dask.dataframe as dd
import feather

ibd = dd.read_csv(sys.argv[1], dtype = {"minhash" : "category", "abund" : "float64", "sample" : "str"})
ibd["minhash"] = ibd["minhash"].cat.as_known()
ibd_wide=ibd.pivot_table(index='sample', columns="minhash", values='abund')
ibd_wide = ibd_wide.fillna(0)

ibd_wide.columns = list(ibd_wide.columns)
ibd_wide = ibd_wide.reset_index()
ibd_wide.columns = ibd_wide.columns.astype(str)
ibd_wide.compute().to_feather(sys.argv[2])

However this could also be problematic, so parquet could also be useful as an alternative.

Some links:

@nmb85
Copy link
Author

nmb85 commented Sep 4, 2020

Thank you, @taylorreiter! This is so useful and saves me lots of time! It looks like I'll pick up a few more Python modules over the next week!

@ctb
Copy link
Contributor

ctb commented Sep 9, 2020

hi folks, do y'all think the code above would fit in well with an upgraded sourmash sig export command? See also #1098

@ctb
Copy link
Contributor

ctb commented Jan 30, 2021

revisiting this - right now I'm hesitant to add pandas as an installation requirement, so am going to pass on this for 4.0. maybe for 5.0 tho!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants