Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dayhoff encoding of amino acids #689

Merged
merged 53 commits into from
Aug 26, 2019
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
df1d637
Initial commit of dayhoff encoding
olgabot May 19, 2019
602a395
Pad end of codon with ambiguous nucleotides
olgabot May 20, 2019
dc379fa
Add ambiguous bases
olgabot May 20, 2019
482f4d2
Add dayhoff encoding
olgabot Jun 29, 2019
9885571
Add more checks of dayhoff type
olgabot Jun 29, 2019
1ca5bfe
hopefully fix tests to get the Cpp to build
olgabot Jun 30, 2019
4c992e6
--no-protein --> --no-dayhoff
olgabot Jul 1, 2019
fc28322
set dayhoff to default to false
olgabot Jul 1, 2019
c4a7a08
Add dayhoff to construct_moltype_args
olgabot Jul 1, 2019
0bbaa67
track_abundance is now 6th parameter instead of 5th
olgabot Jul 1, 2019
2544000
"scaled" is now 7th parameter
olgabot Jul 1, 2019
d1e7328
use 'dayhoff' as molecule name
olgabot Jul 1, 2019
119eed8
Actually compute a dayhoff signature
olgabot Jul 1, 2019
5668449
Add simple test for computing dayhoff
olgabot Jul 1, 2019
7867b26
Set dayhoff as molecule type when writing
olgabot Jul 1, 2019
588f5d1
don't compute dna signature for simplicity
olgabot Jul 1, 2019
7d714df
Check if dayhoff is set if protein is set
olgabot Jul 1, 2019
dbaf9a3
Test for dna + dayhoff
olgabot Jul 1, 2019
6e8b370
Fix logic of obtaining molecule type
olgabot Jul 1, 2019
189ac42
Add test for molecule type
olgabot Jul 1, 2019
559b6a4
Add dayhoff test for `sourmash sig describe`
olgabot Jul 4, 2019
2cd1fcc
Merge branch 'master' into olgabot/dayhoff
olgabot Jul 7, 2019
c485ce0
fix segfault in linux
luizirber Jul 7, 2019
d4ab126
Merge pull request #3 from luizirber/olgabot/dayhoff
olgabot Jul 9, 2019
61136df
Add args.dayhoff to add_sequence
olgabot Jul 9, 2019
6ea480d
Use class's value of dayhoff
olgabot Jul 9, 2019
c38eb64
Try to add dayhoff option for when input is protein
olgabot Jul 9, 2019
e5e6aa9
Don't need to pass dayhoff=True to add_sequence
olgabot Jul 9, 2019
3f8b89a
Add dayhoff fixture for testing
olgabot Jul 9, 2019
44e9d33
Add low level tests of dayhoff hashes
olgabot Jul 9, 2019
463c9ff
Both signatures need to be dayhoff to be comparable
olgabot Jul 9, 2019
bb90f77
Use class's value of dayhoff variable to drive logic
olgabot Jul 9, 2019
0d82ce2
Don't pass dayhoff to add_sequence
olgabot Jul 9, 2019
c0ff22b
Add nonworking test to make sure dayhoff and protein hashes are distinct
olgabot Jul 9, 2019
5b4f16b
Add public method of converting amino acid letter to dayhoff
olgabot Jul 9, 2019
1bcd1ae
Add 'for' to list comprehension
olgabot Jul 9, 2019
3ded344
Remove dayhoff boolean from add_sequence
olgabot Jul 9, 2019
7de2b42
aa_to_dayhoff as const?!?
olgabot Jul 9, 2019
e2bc460
Maybe need to instantiate the return string first??
olgabot Jul 9, 2019
3f6403a
Don't do list comprehension for now for debugging purposes
olgabot Jul 9, 2019
ca38f47
Now KmerMinHash actually has aa_to_dayhoff method
olgabot Jul 11, 2019
e338dc5
Convert python string to bytes to input to aa_to_dayhoff
olgabot Jul 15, 2019
1dc49fa
aa_to_dayhoff returns type 'string'
olgabot Jul 15, 2019
c3e5004
Allow for int as input to to_bytes and convert to bytes properly
olgabot Jul 15, 2019
89e0cf3
Add option to run only tests using a particular fixture
olgabot Jul 15, 2019
aae410b
Fix test output for 'sig describe' so dayhoff md5 != protein md5
olgabot Jul 15, 2019
4c07abb
Add json_iter test for molecule types
olgabot Jul 15, 2019
2d1681b
Add check for dayhoff signature output
olgabot Jul 15, 2019
15d54ab
Reorganized amino acid translation and dayhoff encoding
olgabot Jul 23, 2019
34f81db
check for message in stderr, not stdout
olgabot Jul 23, 2019
a5626a4
Expose translate_codon method for testing
olgabot Jul 24, 2019
4db8a1f
Merge branch 'master' into olgabot/dayhoff
luizirber Aug 24, 2019
438593e
Merge branch 'master' into olgabot/dayhoff
luizirber Aug 26, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions sourmash/_minhash.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -25,29 +25,34 @@ cdef extern from "kmer_min_hash.hh":
const unsigned int num;
const unsigned int ksize;
const bool is_protein;
const bool dayhoff;
const HashIntoType max_hash;
CMinHashType mins;

KmerMinHash(unsigned int, unsigned int, bool, uint32_t, HashIntoType)
KmerMinHash(unsigned int, unsigned int, bool, bool, uint32_t, HashIntoType)
void add_hash(HashIntoType) except +ValueError
void remove_hash(HashIntoType) except +ValueError
void add_word(string word) except +ValueError
void add_sequence(const char *, bool) except +ValueError
void merge(const KmerMinHash&) except +ValueError
string aa_to_dayhoff(string aa) except +ValueError
string translate_codon(string codon) except +ValueError
unsigned int count_common(const KmerMinHash&) except +ValueError
unsigned long size()


cdef cppclass KmerMinAbundance(KmerMinHash):
CMinHashType abunds;

KmerMinAbundance(unsigned int, unsigned int, bool, uint32_t, HashIntoType)
KmerMinAbundance(unsigned int, unsigned int, bool, bool, uint32_t, HashIntoType)
void add_hash(HashIntoType) except +ValueError
void remove_hash(HashIntoType) except +ValueError
void add_word(string word) except +ValueError
void add_sequence(const char *, bool) except +ValueError
void merge(const KmerMinAbundance&) except +ValueError
void merge(const KmerMinHash&) except +ValueError
string aa_to_dayhoff(string aa) except +ValueError
string translate_codon(string codon) except +ValueError
unsigned int count_common(const KmerMinAbundance&) except +ValueError
unsigned long size()

Expand Down
63 changes: 49 additions & 14 deletions sourmash/_minhash.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,15 @@ def get_scaled_for_max_hash(max_hash):


cdef bytes to_bytes(s):
if not isinstance(s, (basestring, bytes)):
# Allow for strings, bytes or int
# Single item of byte string = int
if not isinstance(s, (basestring, bytes, int)):
raise TypeError("Requires a string-like sequence")

if isinstance(s, unicode):
s = s.encode('utf-8')
if isinstance(s, int):
s = bytes([s])
return s


Expand Down Expand Up @@ -88,6 +92,7 @@ cdef class MinHash(object):

def __init__(self, unsigned int n, unsigned int ksize,
bool is_protein=False,
bool dayhoff=False,
bool track_abundance=False,
uint32_t seed=MINHASH_DEFAULT_SEED,
HashIntoType max_hash=0,
Expand All @@ -107,9 +112,9 @@ cdef class MinHash(object):

cdef KmerMinHash *mh = NULL
if track_abundance:
mh = new KmerMinAbundance(n, ksize, is_protein, seed, max_hash)
mh = new KmerMinAbundance(n, ksize, is_protein, dayhoff, seed, max_hash)
else:
mh = new KmerMinHash(n, ksize, is_protein, seed, max_hash)
mh = new KmerMinHash(n, ksize, is_protein, dayhoff, seed, max_hash)

self._this.reset(mh)

Expand All @@ -122,7 +127,8 @@ cdef class MinHash(object):

def __copy__(self):
a = MinHash(deref(self._this).num, deref(self._this).ksize,
deref(self._this).is_protein, self.track_abundance,
deref(self._this).is_protein, deref(self._this).dayhoff,
self.track_abundance,
deref(self._this).seed, deref(self._this).max_hash)
a.merge(self)
return a
Expand All @@ -135,23 +141,24 @@ cdef class MinHash(object):
return (deref(self._this).num,
deref(self._this).ksize,
deref(self._this).is_protein,
deref(self._this).dayhoff,
self.get_mins(with_abundance=with_abundance),
None, self.track_abundance, deref(self._this).max_hash,
deref(self._this).seed)

def __setstate__(self, tup):
(n, ksize, is_protein, mins, _, track_abundance, max_hash, seed) =\
(n, ksize, is_protein, dayhoff, mins, _, track_abundance, max_hash, seed) =\
tup

self.track_abundance = track_abundance

cdef KmerMinHash *mh = NULL
if track_abundance:
mh = new KmerMinAbundance(n, ksize, is_protein, seed, max_hash)
mh = new KmerMinAbundance(n, ksize, is_protein, dayhoff, seed, max_hash)
self._this.reset(mh)
self.set_abundances(mins)
else:
mh = new KmerMinHash(n, ksize, is_protein, seed, max_hash)
mh = new KmerMinHash(n, ksize, is_protein, dayhoff, seed, max_hash)
self._this.reset(mh)
self.add_many(mins)

Expand All @@ -160,6 +167,7 @@ cdef class MinHash(object):
(deref(self._this).num,
deref(self._this).ksize,
deref(self._this).is_protein,
deref(self._this).dayhoff,
self.track_abundance,
deref(self._this).seed,
deref(self._this).max_hash,
Expand All @@ -173,7 +181,8 @@ cdef class MinHash(object):

def copy_and_clear(self):
a = MinHash(deref(self._this).num, deref(self._this).ksize,
deref(self._this).is_protein, self.track_abundance,
deref(self._this).is_protein, deref(self._this).dayhoff,
self.track_abundance,
deref(self._this).seed, deref(self._this).max_hash)
return a

Expand Down Expand Up @@ -234,6 +243,10 @@ cdef class MinHash(object):
def is_protein(self):
return deref(self._this).is_protein

@property
def dayhoff(self):
return deref(self._this).dayhoff

@property
def ksize(self):
return deref(self._this).ksize
Expand All @@ -247,6 +260,9 @@ cdef class MinHash(object):
def add_hash(self, uint64_t h):
deref(self._this).add_hash(h)

def translate_codon(self, codon):
return deref(self._this).translate_codon(to_bytes(codon))

def count_common(self, MinHash other):
return deref(self._this).count_common(deref(other._this))

Expand All @@ -255,7 +271,8 @@ cdef class MinHash(object):
raise ValueError('new sample n is higher than current sample n')

a = MinHash(new_num, deref(self._this).ksize,
deref(self._this).is_protein, self.track_abundance,
deref(self._this).is_protein, deref(self._this).dayhoff,
self.track_abundance,
deref(self._this).seed, 0)
if self.track_abundance:
a.set_abundances(self.get_mins(with_abundance=True))
Expand Down Expand Up @@ -286,7 +303,8 @@ cdef class MinHash(object):
new_max_hash = get_max_hash_for_scaled(new_num)

a = MinHash(0, deref(self._this).ksize,
deref(self._this).is_protein, self.track_abundance,
deref(self._this).is_protein, deref(self._this).dayhoff,
self.track_abundance,
deref(self._this).seed, new_max_hash)
if self.track_abundance:
a.set_abundances(self.get_mins(with_abundance=True))
Expand All @@ -307,13 +325,15 @@ cdef class MinHash(object):
combined_mh = new KmerMinAbundance(num,
deref(self._this).ksize,
deref(self._this).is_protein,
deref(self._this).dayhoff,
deref(self._this).seed,
deref(self._this).max_hash)

else:
combined_mh = new KmerMinHash(num,
deref(self._this).ksize,
deref(self._this).is_protein,
deref(self._this).dayhoff,
deref(self._this).seed,
deref(self._this).max_hash)

Expand Down Expand Up @@ -424,12 +444,27 @@ cdef class MinHash(object):
if not deref(self._this).is_protein:
raise ValueError("cannot add amino acid sequence to DNA MinHash!")

for i in range(0, len(sequence) - ksize + 1):
deref(self._this).add_word(to_bytes(sequence[i:i + ksize]))
aa_kmers = (sequence[i:i + ksize] for i in range(0, len(sequence) - ksize + 1))
if not self.dayhoff:
for aa_kmer in aa_kmers:
deref(self._this).add_word(to_bytes(aa_kmer))
else:
for aa_kmer in aa_kmers:
dayhoff_kmer = ''
for aa in aa_kmer:
dayhoff_letter = deref(self._this).aa_to_dayhoff(to_bytes(aa))
dayhoff_kmer += dayhoff_letter
# dayhoff_kmer = ''.join( for aa in aa_kmer)
deref(self._this).add_word(to_bytes(dayhoff_kmer))

def is_molecule_type(self, molecule):
if molecule.upper() == 'DNA' and not self.is_protein:
return True
if molecule == 'protein' and self.is_protein:
return True
if self.is_protein:
if self.dayhoff:
if molecule == 'dayhoff':
return True
else:
if molecule == 'protein':
return True
return False
30 changes: 27 additions & 3 deletions sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,14 +149,22 @@ def compute(args):
if args.dna and args.protein:
notify('Computing both nucleotide and protein signatures.')
num_sigs = 2*len(ksizes)
elif args.dna and args.dayhoff:
notify('Computing both nucleotide and Dayhoff-encoded protein '
'signatures.')
num_sigs = 2*len(ksizes)
elif args.dna:
notify('Computing only nucleotide (and not protein) signatures.')
num_sigs = len(ksizes)
elif args.protein:
notify('Computing only protein (and not nucleotide) signatures.')
num_sigs = len(ksizes)
elif args.dayhoff:
notify('Computing only Dayhoff-encoded protein (and not nucleotide) '
'signatures.')
num_sigs = len(ksizes)

if args.protein and not args.input_is_protein:
if (args.protein or args.dayhoff) and not args.input_is_protein:
bad_ksizes = [ str(k) for k in ksizes if k % 3 != 0 ]
if bad_ksizes:
error('protein ksizes must be divisible by 3, sorry!')
Expand All @@ -182,13 +190,23 @@ def make_minhashes():
if args.protein:
E = MinHash(ksize=k, n=args.num_hashes,
is_protein=True,
dayhoff=False,
track_abundance=args.track_abundance,
scaled=args.scaled,
seed=seed)
Elist.append(E)
if args.dayhoff:
E = MinHash(ksize=k, n=args.num_hashes,
is_protein=True,
dayhoff=True,
track_abundance=args.track_abundance,
scaled=args.scaled,
seed=seed)
Elist.append(E)
if args.dna:
E = MinHash(ksize=k, n=args.num_hashes,
is_protein=False,
dayhoff=False,
track_abundance=args.track_abundance,
scaled=args.scaled,
seed=seed)
Expand Down Expand Up @@ -1310,9 +1328,15 @@ def watch(args):
if args.dna:
moltype = 'DNA'
is_protein = False
else:
dayhoff = False
elif args.protein:
moltype = 'protein'
is_protein = True
dayhoff = False
else:
moltype = 'dayhoff'
is_protein = True
dayhoff = True

tree = load_sbt_index(args.sbt_name)

Expand All @@ -1323,7 +1347,7 @@ def watch(args):
tree_mh = leaf.data.minhash
ksize = tree_mh.ksize

E = MinHash(ksize=ksize, n=args.num_hashes, is_protein=is_protein)
E = MinHash(ksize=ksize, n=args.num_hashes, is_protein=is_protein, dayhoff=dayhoff)
streamsig = sig.SourmashSignature(E, filename='stdin', name=args.name)

notify('Computing signature for k={}, {} from stdin', ksize, moltype)
Expand Down
Loading