-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible inconsistency in sbt_gather output #275
Comments
First things first - could you update to the latest master branch?
https://sourmash.readthedocs.io/en/latest/tutorials.html
The columns and output are much more stable now :).
On your P.S. and overlap, you are correct. I'm trying to think about
what to do - see
#266 (comment)
for a more detailed explanation.
thanks for asking!
|
Did the update and can confirm that the output is now more stable :) Thx! stdout:
Thanks to the explanation in #266, the |
Man, we really don't make the output easy to interpret, do we... :) see code for gather which (after perennial confusion on my part) I tried to simplify and sanify so that it's readable! here, 'intersect_mins' is with respect to what has not yet been matched, as you say. The column f_match should sum to 1.0 (p_match to 100%) if everything is known. I'm thinking about adding a |
Note, we are working on taxonomic summarization of the output too, so that we could e.g. group by species rather than splitting strains, but that functionality will be somewhat distinct from gather b/c it depends on NCBI metadata that not all databases will have. gather works equally well on private databases without lineage info. See #195 for that. |
Maybe it's just me being "slow" today ;) I found https://github.com/dib-lab/sourmash/blob/f76938edf1125c7aee61b9039dcf30add5b215f8/sourmash_lib/commands.py#L991 to be interesting in this respect, i.e., already found hashes are subtracted iteratively. I'll keep thinking about features that might come in handy and let you know. For now, there is another point that confused me. When I
In particular, ANEWREF was missing from the top hits in [EDIT] Not sure whether this should be a separate issue or if you'd close the current issue and we simply continue on the similarity issue herein. |
On Wed, Jun 07, 2017 at 04:44:06AM -0700, Cedric Laczny wrote:
Maybe it's just me being "slow" today ;)
hah unlikely!
It's WIP after all, so totally understand that and greatly appreciate your quick support!
welcome & thanks for the engagement!
I found https://github.com/dib-lab/sourmash/blob/f76938edf1125c7aee61b9039dcf30add5b215f8/sourmash_lib/commands.py#L991 to be interesting in this respect, i.e., already found hashes are *subtracted* iteratively.
yep. The idea is to greedily partition a metagenome into its constituent
"best matches".
For now, there is another point that confused me. When I `search` instead of `gather`, it is much slower and the similarity values do not match those in `gather`:
```
310 matches; showing first 3:
similarity match
---------- -----
77.8% SOMEREF.gz
74.4% SOMEOTHERREF.gz
73.8% ANEWREF.gz
```
In particular, ANEWREF was *missing* from the top hits in `gather`.
Could you please shed some light here too, especially, why are the similarity values "low"? :)
gather matches are meant to be disjoint; 'search' is doing a straightforward
Jaccard search and reporting anything that matches well; it should be identical
to what the mash software does.
If you use --best-only you will recover the speed of gather, but you will lose
most of the 310 matches. (Also note that 'search' on SBTs is not necessarily
100% accurate yet; see #224 for discussion and proposed fix.)
If you add --containment you should get similarity values that match those
in gather for the top hit. Briefly, when matching, 'gather' does not take
into account all of the k-mers that are in SOMEREF but are not in the query,
while 'search' does -- this latter behavior is true Jaccard similarity,
the former is what we call 'containment', and it is what gather uses.
`--containment` is theoretically guaranteed to return best match in SBTs (vice
plain ol' bugs) but will not be able to distinguish between e.g. a plasmid that
is part of a big genome and the plasmid itself, if both full genome and plasmid
are in the database.
|
That further clarifies many things, but not all unfortunately :) I gave
When I do a
Which is not what I would expect, naively. Specifically, I would expect the top hits to be the same :) |
On Wed, Jun 07, 2017 at 07:02:54AM -0700, Cedric Laczny wrote:
That further clarifies many things, but not all unfortunately :)
I had a look at the issue you mentioned, but, TBH, did not readily see what the "inaccuracy" could be and how the proposed fix looks.
Right, it's pretty technical - let's just say that we can only guarantee
best containment with our current approach, and we need to adjust to guarantee
best similarity!
I gave `search` another try and can confirm that `--best-only` speeds up things greatly.
However, the results look unexpected to me.
When I do a `search`, this is what I get on stdout:
```
310 matches; showing first 3:
similarity match
---------- -----
77.8% SOMEREF.gz
74.4% SOMEOTHERREF.gz
73.8% REF.gz
```
When I do a `search --best-only`, the stdout looks as follows:
```
(truncated search because of --best-only; only trust top result
6 matches; showing first 3:
similarity match
---------- -----
77.8% SOMEREF.gz
74.4% SOMEOTHERREF.gz
63.0% AGAINANEWREF.gz
```
Which is not what I would expect, naively. Specifically, I would expect the top hits to be the same :)
only the very top result is guaranteed (see message starting with 'truncated
search...' for my attempt to make this clear). I think in the future I will
simply limit output to 1 result...!
thanks again!
|
doooooh sorry to have missed that. Thank you very much @ctb! |
On Wed, Jun 07, 2017 at 08:13:30AM -0700, Cedric Laczny wrote:
> only the very top result is guaranteed (see message starting with 'truncated search...' for my attempt to make this clear).
*doooooh* sorry to have missed that.
That's actually pretty much what I need right now: a quick way to find the most similar reference. The SBT will help greatly here as it will avoid going over all references, as it would have to do with`mash`.
Thank you very much @ctb!
welcome! and glad to hear it!
suggest that you use both 'search' and 'search --containment' until we
get that bug fixed, just to be 100% certain. (The former should give you
correct results 99.9% of the time, the latter should give you correct results
100% of the time.)
|
Hi,
first of all, I'd like to thank you all for this nice tool.
I am playing around with sourmash from http://2017-ucsc-metagenomics.readthedocs.io/en/latest/sourmash.html, i.e., I "pip installed" it as described there as I wanted to have the
sbt_gather
function.For now, I just wanted to get things running, so I used the default parameters (
-k 31 --scaled 2000 --track-abundance
) for building the reference signatures, followed bysbt_gather
of MYQUERY (representing a de novo assembled, bacterial isolate genome).While doing so, I observed an unexpected behavior and was wondering what is going on there.
Here comes the output to stdout:
Followed by the output (
-o MYQUERY.hits
):Concretely, I am confused by the column order, as it seems switched between stdout and
-o
: (98.7% 96.2%
vs.0.9617[...],0.987[..]
), albeit the column headers' order seems to be consistent.TIA for your support.
Best,
Cedric
P.S. I am unsure on how to interpret
the recovered matches hit 100.0% of the query
when having multiple, here 3, (partially) matching references. Might this be an indication of some genetic exchange, thinking, MYQUERY "includes" the majority from SOMEREF, but also some "parts" from SOMEOTHERREF?[EDIT] Along the lines of the P.S., how is the line
4.9 Mbp 96.2% 1.1% SOMEOTHERREF
to be understood`? How can the overlap be so large, yet SOMEOTHERREF is only found with a fraction of 1.1%? For completeness, my index only includes signatures from bacterial genomes.The text was updated successfully, but these errors were encountered: