-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Within signature distance? #33
Comments
Sorry to raise an old issue from the dead, but this is the only mention of my question that I could find. Do the distances between hashes, i.e., the edit (hamming, levenshtein, etc) distance between, for example, 16571421240 <-> 111660087188, contain any underlying kmer <-> kmer distance information? It would be interesting to build a tree like suggested and use that for a unifrac distance calculation on a collection of signatures. |
wow, this IS an old issue! um, there is no k-mer <-> k-mer edit distance information available, I'm afraid. The hashing function explicitly breaks any link between similar k-mers, in fact! |
to answer @anmwinter question -- first, major apologies for taking so long! second, any metric that can operate on k-mers should work with k-mer hashes from scaled signatures. we've looked at things like accumulation curves (see comment) and and also @taylorreiter has thought a bit about diversity metrics for collections of k-mers. Happy to chat further if you have a specific calculation to do! |
@taylorreiter (and anyone else, haha!) When you have a moment (I realize that you and your team are busy with your main projects), I was wondering what you thought about this: Could one build a tree like the BK Tree mentioned by @anmwinter, place minhash abundances on that tree, and use the earth mover's distance to get a diversity-informed distance metric between minhash sets? Earth mover's distance is equivalent to unifrac, I think... Regardless of whether or not it is phylogenetically informative, it still would give a more information-rich distance metric between datasets than Jaccard, right? |
I don't have an intuition about whether this would work or not. Can you describe a little more how this would be more information-rich than Jaccard? Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances? As Titus said above, if it works on k-mers, it should work on minhashes. It might be worth exploring a proof of concept in k-mer space? |
Hi, @taylorreiter , thanks very much for your thoughts! First re your questions:
Sure, as far as I understand it and in analogy only (I'm not fluent in symbolic math, to my chagrin): whereas Jaccard Distance tells you how many kmers/hashes are shared between samples, Earth Mover's Distance (EMD) would tell you how many kmers/hashes "piled up" at defined locations in a 2-D area (think of it as mounds, ponds, and slopes in your backyard: a landscape) that constitute one sample need to be moved to form the same landscape in another sample. So, to give a practical example that you're probably familiar with, and please ignore if you already know this: in the case of 16s rRNA sequences, earth mover's distance is the same as the weighted unifrac distance. Unifrac is based on the phylogenetic distribution of the sequences across a phylogenetic tree (which incorporates not just the classification of the kmer/sequence but also the phylogenetic distance between the kmers/sequences). The phylogenetic tree that is formed from all samples from which measurements will be made is analogous to the 2-D area across which "dirt" is distributed in samples measured by EMD. Weighted unifrac is the most information-rich distance metric that I'm aware of for phylogenetic trees that contain count information. With hashes, you can't build a phylogenetic tree of hash distances because hashes don't contain phylogenetic information per @ctb 's comment above - however - if there is still true diversity information contained in those hashes, i.e., hash similarity is proportional somehow to the similarity between the kmers represented by those hashes (which is something I don't know, and my main question to you), then the earth mover's distance of hash sets shared by samples would add at least some information about the similarity between the diversity distribution of the samples. So, as @anmwinter suggested hierarchically clustering hashes by their pairwise edit distances, it struck me that you may be able to use the diversity information contained in the BK-tree to inform an EMD between hash sets (analogously to phylogenetic trees + weighted unifrac for 16S sequences). That would provide a much more information-rich distance metric between samples than Jaccard.
Yes, plus it accounts for the difference in the structure of diversity between two samples, which is an additional source of information.
Would hierarchically clustering kmers like that be computationally practical? So, just to clarify, my main question to you in this post is: Thanks, Taylor! |
Just dropping a quick note that we actually have hash counts too, and that might be useful information to use... (Sorry for short message, will comment more next week) |
@luizirber For sure! I think count data is required for calculating the earth mover's distance (aka Wassertein or Kantorovich–Rubinstein). An alternative distance measure that is still more informative than the simple Jaccard distance, if a tree representing hash diversity is not a useful measure of diversity, is the weighted Jaccard and is proposed in this paper on the Order MinHash (OMH). Have you heard of this type of hash? It wouldn't be as informative as EMD, but it might be a nice option if the EMD is impossible to calculate from hashes. |
I played with the weighted Jaccard in another codebase, and it is fairly straighforward to do the same for Scaled MinHash sketches. And while WJ is not well defined for regular MinHash (you would need something like CWS for it), it does work for Scaled MinHash sketches. (deeper analysis on why WJ works for Scaled MinHash sketches pending =]) |
(note that we have some empirical evidence now that hash abundances are properly representative - see these (not necessarily permanent... sorry...) links, figure 8 -- hash abundance correlates with mapping for mock and high-coverage real data sets. https://ctb.github.io/2020-grist-examples/reports/report-SRR606249.html These are for sourmash gather output, not sourmash --containment output, but I can produce them for the latter easily enough should there be interest. |
I was wondering if it was worth looking at within signature differences to get an idea of the diversity of hashes? i.e using Jaccard, Hammings or Lenvestein metrics.
I need to check my understanding of how this whole hashing thing works for a metagenome:
I hacked together a BK-Tree script from someone else's code to look at which hashes might be similar within a given distance.
The text was updated successfully, but these errors were encountered: