Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new behavior for protein k-mer size calculations - gathering the issues together. #999

Closed
ctb opened this issue May 25, 2020 · 27 comments
Closed
Labels
revisit_me An issue that needs attention and clarification

Comments

@ctb
Copy link
Contributor

ctb commented May 25, 2020

Note: updated in #1525

when taking in DNA sequence to translate into protein, how should sourmash treat the k-mer size?

right now,

  1. sourmash on DNA sequence -> DNA minhash uses exact k-mer size (k=21, 31, 51, or whatever) does reverse complement calculations. (src/core/src/sketch/minhash.rs::add_sequence(), if self.is_dna() block)
  2. sourmash on protein sequence -> protein minhash uses ksize / 3, with no reverse complement. (src/core/src/sketch/minhash.rs::add_protein()).
  3. sourmash on DNA sequence -> protein minhash uses ksize / 3, with reverse complement. (src/core/src/sketch/minhash.rs::add_sequence(), else block)

this is an enduring source of confusion, so should be fixed. and there are a complex set of issues in the issue tracker here...

@ctb
Copy link
Contributor Author

ctb commented May 25, 2020

note that such changes would trigger a new major version, so ~now-ish is a good time...

@bluegenes
Copy link
Contributor

bluegenes commented May 25, 2020

I just went through the issues above and tried to summarize here:

desired functionality:

  1. Calculate protein signatures with exact ksize (not k=k/3), both at python api and on cli. Translated signatures still use k=k/3.
  2. Translated k=33 should be comparable with protein k=11 (equivalent ksizes). can we just use the hash function to flag incompatible signatures, instead of DNA/protein/etc? #751 could facilitate this. In Change --protein k-size meaning  #574, @luizirber pointed out that can we just use the hash function to flag incompatible signatures, instead of DNA/protein/etc? #751 is already enabled in the rust code?
  3. Potential translation improvements:
  4. Potentially enhance input checks for add_sequence and add_protein? From Document and/or improve protein MinHash API. #720: "in the provided example in Setting "scaled=1" doesn't retrieve all protein k-mers #701, I'm not sure why add_sequence doesn't complain when it's given a bunch of non-ACTG characters."

proposed solutions:

In #574, @ctb proposed splitting dna and protein signature calculation. This could be a good cli solution: each signature method would use exact ksize, except when translating (translate k = k/3).

In line with comments in #720, I propose changing the add_protein code to work directly on the provided ksize, so the functionality would change both at the cli and python api levels. #186 suggests an alternate method that would merge add_sequence and add_protein, but the current implementation (as separate methods) may be clearer and facilitate input checks?

Note that the change in #576 (issue #575) actually removed good error checking for the current working implementation, but will be accurate for the proposed changes.

@luizirber
Copy link
Member

  1. Translated k=33 should be comparable with protein k=11 (equivalent ksizes). can we just use the hash function to flag incompatible signatures, instead of DNA/protein/etc? #751 could facilitate this. In Change --protein k-size meaning  #574, @luizirber pointed out that can we just use the hash function to flag incompatible signatures, instead of DNA/protein/etc? #751 is already enabled in the rust code?

It is enabled in the Rust code (see here), it converts the is_protein/hp/dayhoff booleans into the right hash_function enum.

This is the commit that would change it in Python (expose the HashFunction enum in Python). It is in the rfc/751_rust branch, but can be merged without changing the public Python API.

I think we should expose the HashFunction as a parameter in MinHash.__init__, replacing the
is_protein/hp/dayhoff for 4.0, but that would break all the existing code.
Current __init__:

def __init__(self, n, ksize, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=MINHASH_DEFAULT_SEED, max_hash=0, mins=None, scaled=0):

Change to:

def __init__(self, ksize, scaled=1000, hash_function=HashFunctions.murmur64_DNA, track_abundance=False, seed=MINHASH_DEFAULT_SEED, mins=None, *, n=0, is_protein=False, dayhoff=False, hp=False):

Note: these changes also

  • regular MinHash are not the default (n=0) anymore
  • force is_protein/hp/dayhoff to be keyword-only arguments
  • drops max_hash (it already throws a deprecation warning)

We can also potentially move the * and make all the keywords into keyword-only, like this

def __init__(self, ksize, *, scaled=1000, hash_function=HashFunctions.murmur64_DNA, track_abundance=False, seed=MINHASH_DEFAULT_SEED, mins=None, n=0, is_protein=False, dayhoff=False, hp=False):

and so the only required argument would be ksize, and all the other should be used explicitly. More info: https://www.asmeurer.com/python3-presentation/slides.html#26

@olgabot
Copy link
Collaborator

olgabot commented May 25, 2020

Thanks @ctb @bluegenes @luizirber for the discussion so far!

Jumping on @bluegenes's comment:

  1. Calculate protein signatures with exact ksize (not k=k/3), both at python api and on cli. Translated signatures still use k=k/3.

Haha, this breaks my workflow :) I do a good amount of comparisons of DNA vs protein, e.g. DNA k=33 and protein k=33 (11). Maybe something like a separate "protein ksize" parameter?

  1. Potential translation improvements:

@bluegenes what do you think of the reverse complement protein k-mers? I hadn't thought about that.

I've personally moved completely away from sourmash's translation to sencha because there's no guarantee that the 11-mer protein without stop codons is actually the "right" one. So while it's convenient to have the translation within sourmash I find myself not using it.

  1. Potentially enhance input checks for add_sequence and add_protein? From Document and/or improve protein MinHash API. #720: "in the provided example in Setting "scaled=1" doesn't retrieve all protein k-mers #701, I'm not sure why add_sequence doesn't complain when it's given a bunch of non-ACTG characters."

Yes, some user feedback on "I saw some letters I didn't like" would be helpful.

proposed solutions:

In #574, @ctb proposed splitting dna and protein signature calculation. This could be a good cli solution: each signature method would use exact ksize, except when translating (translate k = k/3).

I think this makes the most sense. Handling ksize and molecule/alphabet is a unique use case for proteins (though I've seen reduced nucleotide alphabets like Weak/Strong, Amino/Keto, Purine/Pyrimidine be useful for e.g. noncoding RNA homology...), I think it's worth separating to avoid confusion.

In line with comments in #720, I propose changing the add_protein code to work directly on the provided ksize, so the functionality would change both at the cli and python api levels. #186 suggests an alternate method that would merge add_sequence and add_protein, but the current implementation (as separate methods) may be clearer and facilitate input checks?

Yes to this!

FWIW, my current main workflow is as in nf-core/kmermaid:

  1. Preprocess reads with trimming, FASTQC, etc
  2. Translate RNA-seq reads to protein-coding sequences with sencha translate, creating coding_nucleotides.fasta and coding_peptides.fasta files for each sequence
  3. Generate nucleotide signatures of the coding nucleotides with sourmash sketch
  4. Generate protein signatures of the coding peptides with sourmash sketch, at the corresponding ksizes the nucleotides.
  5. (optional sourmash compare that I lately haven't been doing because I need to compare multiple runs together later)

Hope that helps! Thanks for bringing these issues up.

@ctb
Copy link
Contributor Author

ctb commented May 26, 2020

hi @olgabot thanks! I'm confused by two things --

Haha, this breaks my workflow :) I do a good amount of comparisons of DNA vs protein, e.g. DNA k=33 and protein k=33 (11).

I don't understand what kinds of computation or comparison you're doing here - are you doing translations of DNA sequence into aa at DNA ksize=33/aa ksize 11, as well as protein sequence at ksize=33 (aa ksize 11), and then comparing them?

separately, "this breaks my workflow" - is that a negative, or a neutral? You'd be able to pin sourmash to > 2 and < 4 and still use it as normal until you migrated your workflows to sourmash 4 explicitly.

@olgabot
Copy link
Collaborator

olgabot commented May 26, 2020

hi @olgabot thanks! I'm confused by two things --

Haha, this breaks my workflow :) I do a good amount of comparisons of DNA vs protein, e.g. DNA k=33 and protein k=33 (11).

I don't understand what kinds of computation or comparison you're doing here - are you doing translations of DNA sequence into aa at DNA ksize=33/aa ksize 11, as well as protein sequence at ksize=33 (aa ksize 11), and then comparing them?

The comparison is done on the Jaccard similarities after sourmash compare, i.e. how similar are these two transcriptomes if you use DNA or protein sequences, and if protein, what if you reencode to dayhoff or HP? How do the similarities increase/decrease? How does this track with evolutionary distance of the transcriptomes? These are the kinds of questions I've been thinking about :)

separately, "this breaks my workflow" - is that a negative, or a neutral? You'd be able to pin sourmash to > 2 and < 4 and still use it as normal until you migrated your workflows to sourmash 4 explicitly.

Ah it was supposed to be a joking neutral. I can update the pipeline code, it already treats nucleotide vs peptide sequences differently so it's not a big deal 👍

@ctb
Copy link
Contributor Author

ctb commented May 30, 2020

OK, I did some playing around this morning and have a better understanding of just how terrible all of this is 😂

sourmash protein stuff

starting from

GCA_001593925 - d__Archaea,p__Crenarchaeota,c__Bathyarchaeia,o__B26-1,f__B26-1,g__B26-1,s__B26-1 sp001593925

GCA_001593925.1_ASM159392v1_cds_from_genomic.fna.gz
GCA_001593925.1_ASM159392v1_genomic.fna.gz
GCA_001593925.1_ASM159392v1_protein.faa.gz

and

GCA_001593935 - d__Archaea,p__Crenarchaeota,c__Bathyarchaeia,o__B26-1,f__B26-1,g__B26-1,s__B26-1 sp001593935

GCA_001593935.1_ASM159393v1_cds_from_genomic.fna.gz
GCA_001593935.1_ASM159393v1_genomic.fna.gz
GCA_001593935.1_ASM159393v1_protein.faa.gz

notes --

  • at k=33 sourmash compare *v1_genomic.fna.gz.sig shows no similarity
  • sourmash compute -k 33 *.faa.gz should not have worked without error or warning! (computes DNA signatures!)
  • sourmash compute -k 33 *.faa.gz --protein --no-dna works but does the wrong thing - assumes input sequences are DNA!
  • sourmash compute -k 11,33 --input-is-protein *.faa.gz - which ksize do we end up with here?
  • sourmash compute -k 33 --protein *_cds_from_genomic.fna.gz works but does 6 frame translation. (and how would sourmash know the right frame?)

@ctb
Copy link
Contributor Author

ctb commented Jun 18, 2020

yeah, continued terribleness when building lots of protein signatures. We need better error checking and UX improvements around "how could you possibly have meant to do this?"

We may want to change some of the default parameters when doing protein calculations. This kind of actually argues for a different command line entry point for protein computations, e.g. something other than compute. 🤔 New thought... will sit on it to see where I come out.

@bluegenes
Copy link
Contributor

bluegenes commented Jun 18, 2020

some additional thoughts:

  1. protein, dayhoff, hp signatures work well at different ranges of ksizes. If you want to calculate all of them together, you end up with a series of useless signatures (e.g. want protein k=21, but hp at k=21 is relatively useless)
  2. If instead you calculate them separately, you need different files. For me this ends up practically with one file per ksize per alphabet. Not a big deal, but lots of files.

Potential solutions:

  • easy: signature cat utility that allows you to cat sigs into a single file after creating them separately (if they have identical file metadata)
  • harder, perhaps confusing at cli level: enable diff ksizes per alphabet

@olgabot
Copy link
Collaborator

olgabot commented Jun 18, 2020

some additional thoughts:

  1. protein, dayhoff, hp signatures work well at different ranges of ksizes. If you want to calculate all of them together, you end up with a series of useless signatures (e.g. want protein k=21, but hp at k=21 is relatively useless)
  2. If instead you calculate them separately, you need different files. For me this ends up practically with one file per ksize per alphabet. Not a big deal, but lots of files.

Potential solutions:

  • easy: signature cat utility that allows you to cat sigs into a single file after creating them separately (if they have identical file metadata)

Does sourmash sig merge solve this?

  • harder, perhaps confusing at cli level: enable diff ksizes per alphabet

YES YES YES to all of this. I also end up generating 1 minhash/.sig file because of the alphabet-vs-ksize usefulness. There's some log math one can do to get the equivalent ksize for a different alphabet of size sigma, relative to the protein ksize k

sigma ^ k_new = 20 ^ k
k_new * log(sigma) = k * log(20)
k_new = k * log(20) / log(sigma)

Here's an implementation:

import math

PROTEIN_SIGMA = 20

def get_equivalent_protein_ksize(protein_ksize, sigma):
    return int(math.ceil(protein_ksize * math.log(PROTEIN_SIGMA) / math.log(sigma)))

get_equivalent_protein_ksize(protein_ksize=5, sigma=6)

@bluegenes
Copy link
Contributor

Does sourmash sig merge solve this?

Unless I'm mistaken, merge doesn't solve bc it's designed to output the union of all the hashes in file1.sig and file2.sig to merged.sig

Though I guess I don't know what the behavior is if merging files with different ksizes (that thus have no union)...

@ctb
Copy link
Contributor Author

ctb commented Jun 18, 2020

Random thought - what about a syntax like:

sourmash compute --dayhoff k=19,scaled=200 k=21,scaled=200 --protein k=21,scaled=200

etc?

@olgabot
Copy link
Collaborator

olgabot commented Jun 19, 2020

I like it!

@ctb
Copy link
Contributor Author

ctb commented Jun 21, 2020

after working through logic for #1037, I think there are multiple opportunities for checks on alphabet (DNA vs prot) at the API level.

interim thoughts:

  • I think we do need both command line and API level checking. The command line can make use of additional info (filename, aggregated across sequences, etc) while the API has to do the trickier job of working with only the sequence it's given.
  • I am leaning towards add_dna_sequence and add_protein_sequence at the API level;
  • it's not 100% clear how robust it will be to check that any given k-mer is DNA vs prot;
  • one strategy might be to look at what fraction of k-mers are valid alphabet;
  • I'm strangely challenged by the notion of separating compute-with-input-is-DNA and compute-with-input-is-protein at the command line.
    • we may need a third, right? input-is-RNA? that would do translation? but we don't really know how to find the right reading frame just yet, so that should probably be punted until sencha and other approaches are fully validated.
    • I know olga has proposed changing compute to sketch in rename 'compute' to 'sketch' or compare to similarity? #663, I don't really know how to evaluate that, but it COULD be part of this change, e.g. sourmash sketch --dna and sourmash sketch --protein or something else terrible like that :). maybe dnasketch and aasketch/protsketch? to be later followed by rnasketch?
    • and/or we could limit compute to DNA in 4.0, and add a new separate command for the current --input-is-protein behavior, like aasketch or protsketch

@ctb
Copy link
Contributor Author

ctb commented Jun 21, 2020

more thoughts --

I think we can proceed slowly - maybe even in another 3.x release - by adding a new top-level aasketch or whatever command. This would prototype new command line behavior, e.g. sourmash aasketch --dayhoff k=19,scaled=200 k=21,scaled=200, while preserving old behavior, and would let us play a little bit with these suggestions before locking us into a new release. Could do same with dnasketch or similar command.

@luizirber
Copy link
Member

luizirber commented Jun 21, 2020

sourmash compute --dayhoff k=19,scaled=200 k=21,scaled=200 --protein k=21,scaled=200

I'm a bit worried this will get quite hard to parse...
A pro is that we have all the info beforehand, so it is also possible to explore more parallelism. In this example there are 3 distinct sketches being calculated, and they can all the done in parallel.

Another idea: unique-kmers.py from khmer has a passthrough option (-S, --stream-records) for using it as a filter in the middle of a pipeline. If there was a flag like that for sourmash compute it would also be possible to build many sketches by varying parameters:

sourmash compute --dayhoff -k 19,21 --scaled=200 -S -o dayhoff.sig | \
sourmash compute --protein -k=21 --scaled=200 -S -o protein.sig | \
sourmash compute --dna -k=21 --scaled=1000 -o dna.sig

In this case parallelism is achieved by piping more processes. Drawback is that it is longer and more repetitive, and depends on shell piping.

@ctb
Copy link
Contributor Author

ctb commented Jun 23, 2020

hmm. I really like having this diversity of options, and I kinda like all of them, together with a 'cat' style command for sourmash sig to concatenate individual sig files.

re

I'm a bit worried this will get quite hard to parse...

I'm more concerned about the UX! I want to make simple things simple, and hard things possible...

@ctb
Copy link
Contributor Author

ctb commented Jul 3, 2020

musing out loud about serialization - if we change the meaning of ksize in protein signatures, how do we deal with loading existing signature files (where the ksize is 3x what it will be)? our options are to change molecule type (to protein2, or something - ugly!), or to change the sourmash signature version, I think. Changing the version seems better to me. That would also allow us to upgrade signatures with sig cat, etc. etc.

separate thought - if we have two different commands, dnasketch and aasketch, where does the translate-this-dna-sequence command go? in dnasketch because we're taking in DNA sequences, or in aasketch because we're producing protein signatures?

@olgabot
Copy link
Collaborator

olgabot commented Jul 6, 2020

To be consistent with previous versions, can the ksize variable be renamed, e.g. protein_ksize or pksize ? Where protein_ksize = ksize/3?

@ctb
Copy link
Contributor Author

ctb commented Jul 7, 2020

To be consistent with previous versions, can the ksize variable be renamed, e.g. protein_ksize or pksize ? Where protein_ksize = ksize/3?

hot takes --

  • sounds like a lot of confusion, special casing, and support burden to add to sourmash internals!
  • I think only you and tessa are using protein k-mers...?
  • so IMO it's much, much easier to keep backwards compatibility in the signature loading code, and move forward with changing the meaning of ksize for protein/dayhoff/hp in sourmash 4.0.

@ctb
Copy link
Contributor Author

ctb commented Aug 10, 2020

@olgabot @bluegenes @luizirber ok, in #1159 I've added (what appears to be) a fully functional sourmash sketch set of commands.

sourmash sketch dna|rna <sketch params> [ files ]

sourmash sketch aa|prot|protein <sketch params> [ files]

sourmash sketch translate <sketch params> [ files ]

See the description there for (lots) more details! I think it's a much less confusing user experience, but I welcome your feedback, pro or con!

Note that I did implement better ksize handling (for protein and translate, it's now the "true" amino acid k-mer size) but I didn't do anything magic with ksize in the internals, so if you construct a signature with sketch protein -k 11 you end up with a signature file that has a ksize of 33. That can be fixed later - my main question is whether this set of changes makes sense.

@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2020

here are some notes I took while working through #1159.

Changes to signature computation

Suggested changes to signature computation:

Signature JSON format changes

What signature format changes should we do in tandem? see #268 for rollup issue

Changes to MinHash API

general issue here, #338
more here, #720
and more here, #885, although that is mostly about docs and tests now.

@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2020

also a question to ask re #1159, how easy/hard does this make snakemake recipes?

@bluegenes
Copy link
Contributor

I've been using the new sketching a fair bit now - going well so far.

Re: snakemake: If you're generating a single type of signature, it's not hard. It gets more complex if you want to enable flexibility for sketching with dna, translate, or protein (just requires multiple rules or a function to generate the params). Generating a single sketch file can save on downstream snakemake complexity, so when I want both nucleotide and translated signatures, I've taken to sketching each and then using sig cat to combine them into a single signature.

On ksizes: The different ksizes for sketching vs internal params is a bit frustrating (still need to pass k*3 when selecting sigs) - consider this my vote in for changing that for 4.0. Thoughts on ksize, @olgabot @ctb ? Or should I move this discussion to #1169 ?

@ctb
Copy link
Contributor Author

ctb commented Aug 31, 2020

I've been using the new sketching a fair bit now - going well so far.

great!

Re: snakemake: If you're generating a single type of signature, it's not hard. It gets more complex if you want to enable flexibility for sketching with dna, translate, or protein (just requires multiple rules or a function to generate the params). Generating a single sketch file can save on downstream snakemake complexity, so when I want both nucleotide and translated signatures, I've taken to sketching each and then using sig cat to combine them into a single signature.

hmm, interesting. I wonder if there's a way to make this easy in a single go... will think on't.

On ksizes: The different ksizes for sketching vs internal params is a bit frustrating (still need to pass k*3 when selecting sigs) - consider this my vote in for changing that for 4.0. Thoughts on ksize, @olgabot @ctb ? Or should I move this discussion to #1169 ?

just thinking out loud - it'd be easy to fix the CLI and the Python API, I think, without changing the saving/loading. Maybe I'll try doing that as a first step.

@ctb ctb added the revisit_me An issue that needs attention and clarification label Mar 4, 2021
@ctb
Copy link
Contributor Author

ctb commented Mar 4, 2021

The CLI and Python API were updated to use new protein ksizes in #1315.

This issue needs a serious revisit to glean the remaining ideas, perhaps to be put into new issues!

@ctb
Copy link
Contributor Author

ctb commented May 15, 2021

replaced by #1525

@ctb ctb closed this as completed May 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
revisit_me An issue that needs attention and clarification
Projects
None yet
Development

No branches or pull requests

4 participants