Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dayhoff encoding of amino acids #689

Merged
merged 53 commits into from
Aug 26, 2019
Merged

Add Dayhoff encoding of amino acids #689

merged 53 commits into from
Aug 26, 2019

Conversation

olgabot
Copy link
Collaborator

@olgabot olgabot commented Jun 29, 2019

The Dayhoff encoding of amino acids was created by Margaret Oakley Dayhoff, arguably the first bioinformatician, and uses a lossy encoding of amino acid sequences. I'm interested in adding this feature to compare sequences across large evolutionary distances without penalizing minor residue changes.

Amino acid Property Dayhoff
C Sulfur polymerization a
A, G, P, S, T Small b
D, E, N, Q Acid and amide c
H, K, R Basic d
I, L, M, V Hydrophobic e
F, W, Y Aromatic f

Dayhoff table from
Peris, P., López, D., & Campos, M. (2008).
IgTM: An algorithm to predict transmembrane domains and topology in
proteins. BMC Bioinformatics, 9(1), 1029–11.
http://doi.org/10.1186/1471-2105-9-367

Original source:
Dayhoff M. O., Schwartz R. M., Orcutt B. C. (1978).
A model of evolutionary change in proteins,
in Atlas of Protein Sequence and Structure,
ed Dayhoff M. O., editor.
(Washington, DC: National Biomedical Research Foundation; ), 345–352.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

Yes, I'll remove the irrelevant 10x barcode commits.. -- rebased and force-pushed

@ctb
Copy link
Contributor

ctb commented Jun 29, 2019 via email

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 29, 2019

This PR is proving complicated because it adds a new molecule type which touches basically everything...

@olgabot
Copy link
Collaborator Author

olgabot commented Jul 1, 2019

Tests are passing on my machine (details below)

python -m pytest
=========================================================== test session starts ============================================================
platform darwin -- Python 3.6.4, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
rootdir: /Users/olgabot/code/sourmash, inifile: pytest.ini
plugins: cov-2.7.1
collected 486 items


Finished saving nodes, now saving SBT json file.
.                                                                                                                 [  0%]
sourmash/logging.py .......                                                                                                          [  1%]
tests/test__minhash.py ............................................................................................................. [ 24%]
..................                                                                                                                   [ 27%]
tests/test_api.py .                                                                                                                  [ 27%]
tests/test_cmd_signature.py ...........................................                                                              [ 36%]
tests/test_deprecated.py ..                                                                                                          [ 37%]
tests/test_jaccard.py .......................                                                                                        [ 41%]
tests/test_lca.py ...................................................                                                                [ 52%]
tests/test_sbt.py ..................ss.....                                                                                          [ 57%]
tests/test_signature.py ..............................................                                                               [ 67%]
tests/test_signature_json.py .....                                                                                                   [ 68%]
tests/test_sourmash.py ............................................................................................................. [ 90%]
..............................................                                                                                       [100%]

================================================== 484 passed, 2 skipped in 28.96 seconds ==================================================

On Travis, I'm seeing a Segmentation fault in the function test_do_sourmash_compute_protein_bad_sequences:

python -m pytest --cov=.
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-5.0.0, py-1.8.0, pluggy-0.12.0
cachedir: .tox/py37/.pytest_cache
rootdir: /home/travis/build/dib-lab/sourmash, inifile: pytest.ini
plugins: cov-2.7.1
collected 486 items                                                            
doc/api-example.md .                                                     [  0%]
sourmash/logging.py .......                                              [  1%]
tests/test__minhash.py .........................FF..FF.................. [ 11%]
........................................................................ [ 26%]
......                                                                   [ 27%]
tests/test_api.py .                                                      [ 27%]
tests/test_cmd_signature.py ...........................................  [ 36%]
tests/test_deprecated.py ..                                              [ 37%]
tests/test_jaccard.py .......................                            [ 41%]
tests/test_lca.py .......F...........................................    [ 52%]
tests/test_sbt.py ..................xx.....                              [ 57%]
tests/test_signature.py ..............................................   [ 67%]
tests/test_signature_json.py .....                                       [ 68%]
tests/test_sourmash.py ...................................Fatal Python error: Segmentation fault
Thread 0x00007f97f27e5700 (most recent call first):
  File "/home/travis/build/dib-lab/sourmash/.tox/py37/lib/python3.7/site-packages/multiprocess/connection.py", line 382 in _recv
  File "/home/travis/build/dib-lab/sourmash/.tox/py37/lib/python3.7/site-packages/multiprocess/connection.py", line 410 in _recv_bytes
  File "/home/travis/build/dib-lab/sourmash/.tox/py37/lib/python3.7/site-packages/multiprocess/connection.py", line 253 in recv
  File "/home/travis/build/dib-lab/sourmash/.tox/py37/lib/python3.7/site-packages/multiprocess/pool.py", line 470 in _handle_results
  File "/opt/python/3.7.1/lib/python3.7/threading.py", line 865 in run
  File "/opt/python/3.7.1/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/opt/python/3.7.1/lib/python3.7/threading.py", line 885 in _bootstrap

Should I be worried about this?

@luizirber
Copy link
Member

@olgabot I'm also seeing this segfault under Linux, I think there is something awry in the _dna_to_aa function. I'll take a closer look.

(BTW, awesome PR!)

@olgabot
Copy link
Collaborator Author

olgabot commented Jul 7, 2019

@olgabot I'm also seeing this segfault under Linux, I think there is something awry in the _dna_to_aa function. I'll take a closer look.

The logic of getting an amino acid reside had to get rewritten with the addition of dayhoff and is definitely not something I'm super attached to. So feel free to rewrite it in a way that makes more sense. I focused on just getting it working...

(BTW, awesome PR!)

aww thanks! 😊

@luizirber
Copy link
Member

luizirber commented Jul 7, 2019

@olgabot I sent a PR to your branch, https://github.com/czbiohub/sourmash/pull/3

It keeps the logic from the previous translation unchanged, and avoid adding the xxN codons to the original table. Dayhoff encoding should stay the same (but can you check again the change from residue to codon?)

UPDATE: I'm wrong, sorry. you need to check the residue, oops. I'll fix it.
UPDATE 2: Fixed.

@olgabot
Copy link
Collaborator Author

olgabot commented Jul 9, 2019

Cool now I'm getting a fun error because I added dayhoff conversion to add_protein. I thought that aa_to_dayhoff is in the right place in kmer_min_hash.hh but apparently it's not getting detected? Not 100% sure what's happening here

(sourmash-py364)
 ✘  Tue  9 Jul - 08:39  ~/code/sourmash   origin ☊ olgabot/dayhoff ↑1 12☀ 3● 
  make install
python setup.py build_ext -i
running build_ext
cythoning sourmash/_minhash.pyx to sourmash/_minhash.cpp

Error compiling Cython file:
------------------------------------------------------------
...
                deref(self._this).add_word(to_bytes(aa_kmer))
        else:
            for aa_kmer in aa_kmers:
                dayhoff_kmer = ''
                for aa in aa_kmer:
                    dayhoff_letter = deref(self._this).aa_to_dayhoff(aa)
                                                     ^
------------------------------------------------------------

sourmash/_minhash.pyx:448:54: Object of type 'KmerMinHash &' has no attribute 'aa_to_dayhoff'

Error compiling Cython file:
------------------------------------------------------------
...
                deref(self._this).add_word(to_bytes(aa_kmer))
        else:
            for aa_kmer in aa_kmers:
                dayhoff_kmer = ''
                for aa in aa_kmer:
                    dayhoff_letter = deref(self._this).aa_to_dayhoff(aa)
                                                     ^
------------------------------------------------------------

sourmash/_minhash.pyx:448:54: Object of type 'KmerMinHash &' has no attribute 'aa_to_dayhoff'

Error compiling Cython file:
------------------------------------------------------------
...
                deref(self._this).add_word(to_bytes(aa_kmer))
        else:
            for aa_kmer in aa_kmers:
                dayhoff_kmer = ''
                for aa in aa_kmer:
                    dayhoff_letter = deref(self._this).aa_to_dayhoff(aa)
                                                     ^
------------------------------------------------------------

sourmash/_minhash.pyx:448:54: Object of type 'KmerMinHash &' has no attribute 'aa_to_dayhoff'
building 'sourmash._minhash' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/envs/sourmash-py364/include -I/anaconda3/envs/sourmash-py364/include -I./sourmash -I./third-party/smhasher/ -I/anaconda3/envs/sourmash-py364/include/python3.6m -c sourmash/_minhash.cpp -o build/temp.macosx-10.9-x86_64-3.6/sourmash/_minhash.o -std=c++11 -pedantic -arch x86_64 -mmacosx-version-min=10.7 -stdlib=libc++
sourmash/_minhash.cpp:1:2: error: Do not use this file, it is the result of a failed Cython compilation.
#error Do not use this file, it is the result of a failed Cython compilation.
 ^
1 error generated.
error: command 'gcc' failed with exit status 1
make: *** [all] Error 1

@olgabot
Copy link
Collaborator Author

olgabot commented Jul 15, 2019

Whoo tests passing on my machine! aka it's not really working yet because it only works on one computer :)

FYI it's now possible to run a subset of tests that use a particular fixture using --usesfixture. This was cribbed from here

Mon 15 Jul - 05:58  ~/code/sourmash   origin ☊ olgabot/dayhoff ✔ 12☀ 
  pytest --usesfixture dayhoff -v .
============================================================================================ test session starts =============================================================================================
platform darwin -- Python 3.6.4, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /anaconda3/envs/sourmash-py364/bin/python
cachedir: .pytest_cache
rootdir: /Users/olgabot/code/sourmash, inifile: pytest.ini
plugins: cython-0.1.0, cov-2.7.1
collected 493 items / 485 deselected / 8 selected

tests/test__minhash.py::test_bytes_protein[True-True] PASSED                                                                                                                                           [ 12%]
tests/test__minhash.py::test_bytes_protein[True-False] PASSED                                                                                                                                          [ 25%]
tests/test__minhash.py::test_bytes_protein[False-True] PASSED                                                                                                                                          [ 37%]
tests/test__minhash.py::test_bytes_protein[False-False] PASSED                                                                                                                                         [ 50%]
tests/test__minhash.py::test_protein[True-True] PASSED                                                                                                                                                 [ 62%]
tests/test__minhash.py::test_protein[True-False] PASSED                                                                                                                                                [ 75%]
tests/test__minhash.py::test_protein[False-True] PASSED                                                                                                                                                [ 87%]
tests/test__minhash.py::test_protein[False-False] PASSED                                                                                                                                               [100%]

================================================================================== 8 passed, 485 deselected in 1.95 seconds ==================================================================================

@olgabot
Copy link
Collaborator Author

olgabot commented Jul 15, 2019

PR is failing checks because the coverage isn't sufficient. I've hunted through what I can to add coverage. If it's not enough, could you let me know which additional functions/methods/classes I should add tests for?

@olgabot
Copy link
Collaborator Author

olgabot commented Jul 15, 2019

Arggghh I'm getting an error in the docker image which uses Python 3.7:

ERROR ~ Error executing process > 'sourmash_compute_sketch (SRR4050380_molecule-dayhoff_ksize-9_log2sketchsize-2)'

Caused by:
  Process `sourmash_compute_sketch (SRR4050380_molecule-dayhoff_ksize-9_log2sketchsize-2)` terminated with an error exit status (1)

Command executed:

  sourmash compute       --num-hashes $((2**2))       --ksizes 9       --no-dna       --dayhoff       --output SRR4050380_molecule-dayhoff_ksize-9_log2sketchsize-2.sig       --merge 'SRR4050380' SRR4050380_pass_1.fastq.gz SRR4050380_pass_2.fastq.gz

Command exit status:
  1

Command output:
  (empty)

Command error:

  == This is sourmash version 2.0.0a10.dev97+g4c07abb. ==

  == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==


  computing signatures for files: SRR4050380_pass_1.fastq.gz, SRR4050380_pass_2.fastq.gz

  Computing signature for ksizes: [9]

  Computing only Dayhoff-encoded protein (and not nucleotide) signatures.

  Computing a total of 1 signature(s).

  ... reading sequences from SRR4050380_pass_1.fastq.gz
  Traceback (most recent call last):
    File "/opt/conda/bin/sourmash", line 11, in <module>
      load_entry_point('sourmash==2.0.0a10.dev97+g4c07abb', 'console_scripts', 'sourmash')()
    File "/opt/conda/lib/python3.7/site-packages/sourmash-2.0.0a10.dev97+g4c07abb-py3.7-linux-x86_64.egg/sourmash/__main__.py", line 83, in main
      cmd(sys.argv[2:])
    File "/opt/conda/lib/python3.7/site-packages/sourmash-2.0.0a10.dev97+g4c07abb-py3.7-linux-x86_64.egg/sourmash/commands.py", line 359, in compute
      args.input_is_protein, args.check_sequence)
    File "/opt/conda/lib/python3.7/site-packages/sourmash-2.0.0a10.dev97+g4c07abb-py3.7-linux-x86_64.egg/sourmash/commands.py", line 221, in add_seq
      E.add_sequence(seq, not check_sequence)
    File "sourmash/_minhash.pyx", line 190, in sourmash._minhash.MinHash.add_sequence
  ValueError: std::bad_alloc

Work dir:
  /Users/olgabot/code/nf-kmer-similarity/work/95/9f89181ad7c59788f9afca79a9d487

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`

But it runs just fine using the latest build, but on Python 3.6.4:

 Mon 15 Jul - 09:31  ~/code/nf-kmer-similarity/work/95/9f89181ad7c59788f9afca79a9d487   origin ☊ olgabot/dayhoff ↑6 ✔ 12☀ 
  sourmash compute       --num-hashes $((2**2))       --ksizes 9       --no-dna       --dayhoff       --output SRR4050380_molecule-dayhoff_ksize-9_log2sketchsize-2.sig       --merge 'SRR4050380' SRR4050380_pass_1.fastq.gz SRR4050380_pass_2.fastq.gz
== This is sourmash version 2.0.2.dev43+gca38f47.d20190715. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

computing signatures for files: SRR4050380_pass_1.fastq.gz, SRR4050380_pass_2.fastq.gz
Computing signature for ksizes: [9]
Computing only Dayhoff-encoded protein (and not nucleotide) signatures.
Computing a total of 1 signature(s).
... reading sequences from SRR4050380_pass_1.fastq.gz
... SRR4050380_pass_1.fastq.gz 25 sequences
... reading sequences from SRR4050380_pass_2.fastq.gz
... SRR4050380_pass_2.fastq.gz 25 sequences
calculated 1 signatures for 50 sequences taken from 2 files
saved 1 signature(s). Note: signature license is CC0.

Yet somehow the Travis Build on 3.7 runs just fine?

Is there anything clearly wrong in the Dockerfile ?

@olgabot
Copy link
Collaborator Author

olgabot commented Jul 24, 2019

Yay my docker image runs now!! There was some weirdness happening in codon translation so it's now its own method that gets tested separately for sanity.

Currently the build is failing due to integration and wheel building:

Screen Shot 2019-07-24 at 6 31 35 AM

@luizirber luizirber merged commit 5cbcd06 into sourmash-bio:master Aug 26, 2019
@luizirber
Copy link
Member

Thanks @olgabot, this is amazing! I'll cut a new release with all the merged PRs from this weekend.

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 2, 2019 via email

@olgabot olgabot mentioned this pull request Oct 22, 2019
@luizirber luizirber mentioned this pull request Oct 31, 2019
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants