-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new behavior for protein k-mer size calculations - gathering the issues together. #999
Comments
note that such changes would trigger a new major version, so ~now-ish is a good time... |
I just went through the issues above and tried to summarize here: desired functionality:
proposed solutions: In #574, @ctb proposed splitting dna and protein signature calculation. This could be a good cli solution: each signature method would use exact ksize, except when translating (translate In line with comments in #720, I propose changing the Note that the change in #576 (issue #575) actually removed good error checking for the current working implementation, but will be accurate for the proposed changes. |
It is enabled in the Rust code (see here), it converts the This is the commit that would change it in Python (expose the I think we should expose the def __init__(self, n, ksize, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=MINHASH_DEFAULT_SEED, max_hash=0, mins=None, scaled=0): Change to: def __init__(self, ksize, scaled=1000, hash_function=HashFunctions.murmur64_DNA, track_abundance=False, seed=MINHASH_DEFAULT_SEED, mins=None, *, n=0, is_protein=False, dayhoff=False, hp=False): Note: these changes also
We can also potentially move the def __init__(self, ksize, *, scaled=1000, hash_function=HashFunctions.murmur64_DNA, track_abundance=False, seed=MINHASH_DEFAULT_SEED, mins=None, n=0, is_protein=False, dayhoff=False, hp=False): and so the only required argument would be |
Thanks @ctb @bluegenes @luizirber for the discussion so far! Jumping on @bluegenes's comment:
Haha, this breaks my workflow :) I do a good amount of comparisons of DNA vs protein, e.g. DNA k=33 and protein k=33 (11). Maybe something like a separate "protein ksize" parameter?
@bluegenes what do you think of the reverse complement protein k-mers? I hadn't thought about that. I've personally moved completely away from
Yes, some user feedback on "I saw some letters I didn't like" would be helpful.
I think this makes the most sense. Handling ksize and molecule/alphabet is a unique use case for proteins (though I've seen reduced nucleotide alphabets like Weak/Strong, Amino/Keto, Purine/Pyrimidine be useful for e.g. noncoding RNA homology...), I think it's worth separating to avoid confusion.
Yes to this! FWIW, my current main workflow is as in nf-core/kmermaid:
Hope that helps! Thanks for bringing these issues up. |
hi @olgabot thanks! I'm confused by two things --
I don't understand what kinds of computation or comparison you're doing here - are you doing translations of DNA sequence into aa at DNA ksize=33/aa ksize 11, as well as protein sequence at ksize=33 (aa ksize 11), and then comparing them? separately, "this breaks my workflow" - is that a negative, or a neutral? You'd be able to pin sourmash to > 2 and < 4 and still use it as normal until you migrated your workflows to sourmash 4 explicitly. |
The comparison is done on the Jaccard similarities after
Ah it was supposed to be a joking neutral. I can update the pipeline code, it already treats nucleotide vs peptide sequences differently so it's not a big deal 👍 |
OK, I did some playing around this morning and have a better understanding of just how terrible all of this is 😂 sourmash protein stuffstarting from GCA_001593925 - d__Archaea,p__Crenarchaeota,c__Bathyarchaeia,o__B26-1,f__B26-1,g__B26-1,s__B26-1 sp001593925
and GCA_001593935 - d__Archaea,p__Crenarchaeota,c__Bathyarchaeia,o__B26-1,f__B26-1,g__B26-1,s__B26-1 sp001593935
notes --
|
yeah, continued terribleness when building lots of protein signatures. We need better error checking and UX improvements around "how could you possibly have meant to do this?" We may want to change some of the default parameters when doing protein calculations. This kind of actually argues for a different command line entry point for protein computations, e.g. something other than compute. 🤔 New thought... will sit on it to see where I come out. |
some additional thoughts:
Potential solutions:
|
Does
YES YES YES to all of this. I also end up generating 1 minhash/
Here's an implementation: import math
PROTEIN_SIGMA = 20
def get_equivalent_protein_ksize(protein_ksize, sigma):
return int(math.ceil(protein_ksize * math.log(PROTEIN_SIGMA) / math.log(sigma)))
get_equivalent_protein_ksize(protein_ksize=5, sigma=6) |
Unless I'm mistaken, Though I guess I don't know what the behavior is if merging files with different ksizes (that thus have no union)... |
Random thought - what about a syntax like:
etc? |
I like it! |
after working through logic for #1037, I think there are multiple opportunities for checks on alphabet (DNA vs prot) at the API level. interim thoughts:
|
more thoughts -- I think we can proceed slowly - maybe even in another 3.x release - by adding a new top-level aasketch or whatever command. This would prototype new command line behavior, e.g. |
I'm a bit worried this will get quite hard to parse... Another idea:
In this case parallelism is achieved by piping more processes. Drawback is that it is longer and more repetitive, and depends on shell piping. |
hmm. I really like having this diversity of options, and I kinda like all of them, together with a 'cat' style command for sourmash sig to concatenate individual sig files. re
I'm more concerned about the UX! I want to make simple things simple, and hard things possible... |
musing out loud about serialization - if we change the meaning of ksize in protein signatures, how do we deal with loading existing signature files (where the ksize is 3x what it will be)? our options are to change molecule type (to protein2, or something - ugly!), or to change the sourmash signature version, I think. Changing the version seems better to me. That would also allow us to upgrade signatures with separate thought - if we have two different commands, dnasketch and aasketch, where does the translate-this-dna-sequence command go? in dnasketch because we're taking in DNA sequences, or in aasketch because we're producing protein signatures? |
To be consistent with previous versions, can the ksize variable be renamed, e.g. |
hot takes --
|
@olgabot @bluegenes @luizirber ok, in #1159 I've added (what appears to be) a fully functional
See the description there for (lots) more details! I think it's a much less confusing user experience, but I welcome your feedback, pro or con! Note that I did implement better ksize handling (for protein and translate, it's now the "true" amino acid k-mer size) but I didn't do anything magic with ksize in the internals, so if you construct a signature with |
here are some notes I took while working through #1159. Changes to signature computationSuggested changes to signature computation:
Signature JSON format changesWhat signature format changes should we do in tandem? see #268 for rollup issue
Changes to MinHash APIgeneral issue here, #338
|
also a question to ask re #1159, how easy/hard does this make snakemake recipes? |
I've been using the new sketching a fair bit now - going well so far. Re: snakemake: If you're generating a single type of signature, it's not hard. It gets more complex if you want to enable flexibility for sketching with On ksizes: The different ksizes for sketching vs internal params is a bit frustrating (still need to pass k*3 when selecting sigs) - consider this my vote in for changing that for 4.0. Thoughts on ksize, @olgabot @ctb ? Or should I move this discussion to #1169 ? |
great!
hmm, interesting. I wonder if there's a way to make this easy in a single go... will think on't.
just thinking out loud - it'd be easy to fix the CLI and the Python API, I think, without changing the saving/loading. Maybe I'll try doing that as a first step. |
The CLI and Python API were updated to use new protein ksizes in #1315. This issue needs a serious revisit to glean the remaining ideas, perhaps to be put into new issues! |
replaced by #1525 |
Note: updated in #1525
when taking in DNA sequence to translate into protein, how should sourmash treat the k-mer size?
right now,
src/core/src/sketch/minhash.rs::add_sequence()
,if self.is_dna()
block)src/core/src/sketch/minhash.rs::add_protein()
).src/core/src/sketch/minhash.rs::add_sequence()
,else
block)this is an enduring source of confusion, so should be fixed. and there are a complex set of issues in the issue tracker here...
--protein
k-size meaning #574 has a good discussion of some options. Note also Allow all ksizes for protein signature calculation if input is protein #575.add_sequence
The text was updated successfully, but these errors were encountered: