Skip to content
This repository has been archived by the owner on Mar 17, 2023. It is now read-only.

Partition reads to coding, noncoding and low complexity #8

Merged
merged 208 commits into from
Dec 5, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
208 commits
Select commit Hold shift + click to select a range
7a4ae09
Start writing read partitioner and tests
olgabot Oct 11, 2019
f046e3f
Keep writing out read partitioner
olgabot Oct 11, 2019
f48a1fe
Fix string formattingˆ
olgabot Oct 11, 2019
b1f4a6e
Remove polo from requirements
olgabot Oct 11, 2019
2f649eb
Fix more string formatting
olgabot Oct 11, 2019
cdb73d6
Adjust indentation
olgabot Oct 11, 2019
0445295
more indentation adjustments
olgabot Oct 11, 2019
e301200
Fix more indents
olgabot Oct 11, 2019
63ee111
No continue
olgabot Oct 11, 2019
4c0c10f
Fix peptide bloom filter creation
olgabot Oct 11, 2019
370e373
description --> name for screed
olgabot Oct 11, 2019
d5a6169
Add tqdm for iterating over readcs
olgabot Oct 11, 2019
6ffb4fc
Use exracted description
olgabot Oct 11, 2019
fda0ab2
Missed another low complexity line
olgabot Oct 11, 2019
eb75359
Reorder low complexity check to be before conversion to biopython seq…
olgabot Oct 11, 2019
f69c41f
Add n_kmers column
olgabot Oct 11, 2019
5bbf52b
Missing comma :facepalm:
olgabot Oct 11, 2019
bcb40cc
Add scoring for single sequence
olgabot Oct 11, 2019
2693754
Fix string formatting
olgabot Oct 11, 2019
c928403
Use fraction_in_peptide_db instead of 'score'
olgabot Oct 11, 2019
d1efc3e
Add default values
olgabot Oct 11, 2019
7563b3f
Actually update the max_fraction_in_peptide_db
olgabot Oct 11, 2019
4816051
Return the fraction in peptide db
olgabot Oct 11, 2019
c6640fa
Add more verbose pritning for kmers in peptide db
olgabot Oct 11, 2019
bfafe72
Reorder comment
olgabot Oct 11, 2019
380f35d
Print keys (kmers) only
olgabot Oct 11, 2019
5d99f5f
Add comment
olgabot Oct 11, 2019
c02e97a
Missing paren
olgabot Oct 11, 2019
91777ba
Separate out kmer string making
olgabot Oct 11, 2019
6a1f80a
Print the raw kmer to hash dictionary
olgabot Oct 11, 2019
2710f7c
Add more verbose printing
olgabot Oct 11, 2019
86d8ace
New translation
olgabot Oct 11, 2019
54c5124
Return the actual ppetide sequence
olgabot Oct 11, 2019
822f278
Fix some printing statements
olgabot Oct 11, 2019
f4e493c
Remove print statements
olgabot Oct 11, 2019
7c91683
Fix some string formattin
olgabot Oct 11, 2019
e9c4e4f
Make separate encode_peptide function and test
olgabot Oct 11, 2019
4bb838b
no kmer_string because it's a tuple
olgabot Oct 11, 2019
4d44f42
Start writing outputting the peptide seq
olgabot Oct 12, 2019
a7c5341
No trailing comma
olgabot Oct 12, 2019
160243a
Finish if statement
olgabot Oct 12, 2019
7443753
Separate out kmer entropy check
olgabot Oct 12, 2019
fff01cd
Fix logic order
olgabot Oct 12, 2019
11374d5
Add more verbosity levels
olgabot Oct 12, 2019
8f67ebb
Add tab to the protein sequence
olgabot Oct 12, 2019
92f301d
Move low complexity checking
olgabot Oct 12, 2019
f7058d9
Add warning to ignore biopython warning
olgabot Oct 12, 2019
d902191
warnings.catch_warnings is the one you use 'with' with
olgabot Oct 12, 2019
36689aa
Add prefix option to score_reads
olgabot Oct 12, 2019
9675c17
Pass description to score_single_sequence
olgabot Oct 12, 2019
39d1475
Pass jaccard_threshold to score_single_sequence
olgabot Oct 12, 2019
ba8ef36
Translation needs to be at least peptide_ksize+1
olgabot Oct 12, 2019
9ead2d4
Add separate function to open and announce it to the user
olgabot Oct 12, 2019
9d4acf8
Add option to provide csv
olgabot Oct 12, 2019
8494fe3
options --> option
olgabot Oct 12, 2019
282f4f3
Write newline AFTER first entry
olgabot Oct 12, 2019
92547e7
Properly close file handles
olgabot Oct 12, 2019
d67038d
String formatting!!
olgabot Oct 12, 2019
8e4f976
Actually write the noncoding reads to the noncoding file
olgabot Oct 12, 2019
94f338e
Maybe start writing tests
olgabot Oct 12, 2019
19438f6
Use 'with' for screed.open to close file handle
olgabot Oct 13, 2019
1129dd5
Add option to input already-created bloom filter
olgabot Oct 13, 2019
05199cc
Add 1 to sign so it is obvious which coding frame was used
olgabot Oct 13, 2019
369ee45
Check if actual file handle is None
olgabot Oct 13, 2019
dc9f8d6
Docstring fixes
olgabot Oct 13, 2019
8365f50
Start writing tests for bloom filters
olgabot Oct 14, 2019
a7a0f9d
Swap logic in maybe_make_peptide_bloom_filter
olgabot Oct 14, 2019
cfd827d
translation_frame --> frame
olgabot Oct 14, 2019
08ed466
Fix logic of saving peptide bloom filter
olgabot Oct 14, 2019
4e078d4
Add __name__ == "__main__" to run single file as command lineˆ
olgabot Oct 14, 2019
d8896e6
Add arguments
olgabot Oct 14, 2019
5bba5b1
Update naming for test_make_peptide_bloom_filter
olgabot Oct 14, 2019
d508350
start writing tests for partitioning reads
olgabot Oct 14, 2019
b8b1fba
Add dot to separate extension
olgabot Oct 14, 2019
4ce7d13
Add some usage docs!
olgabot Oct 14, 2019
7409be6
Get read scoring to work!!
olgabot Oct 14, 2019
dddbb9a
Infinite number of input read files, so put them as last
olgabot Oct 15, 2019
708600f
Specify peptides as single argument
olgabot Oct 19, 2019
a15056f
Ignore largeish nodegraph files
olgabot Oct 19, 2019
4e3a242
Whitespace
olgabot Oct 19, 2019
f335e5d
Add separate test data as fixtures for test_score_reads
olgabot Oct 20, 2019
f2a6373
Force translation to string for re-encoding peptide
olgabot Oct 20, 2019
92408b8
Don't test for exact value of bloom filter for now
olgabot Oct 20, 2019
c61ad1b
Add biopython to requirements
olgabot Oct 20, 2019
dbc3d1f
Add default jaccard threshold, change all print statements to click.e…
olgabot Oct 22, 2019
2c656bf
Omgee went crazy with extracting and modularizaing
olgabot Oct 22, 2019
367017c
Capture stderr for score_reads test
olgabot Oct 22, 2019
3b7366b
Parameterize molecule and peptide ksize to be paired
olgabot Oct 23, 2019
05392ea
Refactor getting of peptide ksize to bloom filter
olgabot Oct 23, 2019
a31a37f
Add data files and can't get pytest.mark.xfail to work
olgabot Oct 23, 2019
6a9342b
Actually use the true_scores fixture
olgabot Oct 23, 2019
9be85d2
peptide_graph --> peptide_bloom_filter
olgabot Oct 23, 2019
8674b69
Add fixture for peptide bloom filter because the object is pretty large
olgabot Oct 23, 2019
d32a495
Use a smaller table size for testing
olgabot Oct 23, 2019
b6d4643
Add peptide fasta subset and bloom filters
olgabot Oct 23, 2019
4c73111
Actually, use the larger bloom filter so the tests are actually true
olgabot Oct 23, 2019
a4a75ea
Go back to larger tablesize for bloom filter
olgabot Oct 23, 2019
cff80d1
Move peptide bloom filter fixture to conftest
olgabot Oct 23, 2019
1d3e3ef
Add separate protein coding fasta
olgabot Oct 23, 2019
67a5ec4
Remove pdb call
olgabot Oct 23, 2019
0137712
make separate peptide bloom filter and graph
olgabot Oct 23, 2019
3870f14
Separate out peptide_bloom_filter_path vs peptide_bloom_filter
olgabot Oct 23, 2019
492377e
Add default protein sizes and stuff
olgabot Oct 23, 2019
e7f4c1b
Increase default table size
olgabot Oct 23, 2019
0287b45
No need for "suffix"
olgabot Oct 23, 2019
571a75f
Reformat for readability and pep8
olgabot Oct 23, 2019
65422c5
Skip the ensembl lookup test for now
olgabot Oct 23, 2019
4c668ab
Remove unused import
olgabot Oct 23, 2019
bc786fd
Remove molecule fixture from sequence encodings
olgabot Oct 23, 2019
81fbf4a
rename: partition --> extract_coding
olgabot Oct 23, 2019
54dbbba
Remove all versions from requirements so that Python 3.6 version mayb…
olgabot Oct 23, 2019
196ba1d
Move tests out of main package
olgabot Oct 23, 2019
648322f
Make peptide bloom fixture more forgiving and create the bloom filter…
olgabot Oct 23, 2019
be37660
Refactor to fix n_kmers when all translations have stop codons
olgabot Oct 23, 2019
b6b093e
Update default encoding ksizes to work with tests
olgabot Oct 23, 2019
729555b
Add error message if k-mer is shorter than sequence
olgabot Oct 23, 2019
dc6d2fe
rename "score_single_sequence" --> "score_single_read"
olgabot Oct 23, 2019
cdee037
Add csvs of true coding scores
olgabot Oct 23, 2019
30121de
Add low complexity tests
olgabot Oct 23, 2019
dd2a984
Add adversarial low complexity peptide (but not low-complexity nucleo…
olgabot Oct 23, 2019
372cfeb
Use fastp's style of low complexity sequence assessment
olgabot Oct 23, 2019
01137c3
Return None if ksize is larger than sequence length
olgabot Oct 23, 2019
9ce86a2
Flip sign of complexity threshold
olgabot Oct 23, 2019
fd7ba49
Update k-mer low complexity in case the sequence is shorter than the …
olgabot Oct 23, 2019
045fed1
Finally got read scoring to work!
olgabot Oct 23, 2019
a1872a6
Fix datasets for extract coding
olgabot Oct 23, 2019
e8d8162
Remove k-mer complexity, just do sequential complexity
olgabot Oct 23, 2019
5da265c
Add dayhoff with ksize=11 for tests
olgabot Oct 23, 2019
f075200
Whitespace/readability
olgabot Oct 23, 2019
266a582
Add separate peptide fasta string
olgabot Oct 23, 2019
82f9067
Set jaccard_threshold to None by default, then infer from molecule type
olgabot Oct 23, 2019
7e0faeb
Flake8, lint fixes
olgabot Oct 24, 2019
ae2bc70
Test for csv creation
olgabot Oct 24, 2019
5fd75b9
Add all scoring csvs
olgabot Oct 24, 2019
92022c6
Fix codecov and travis badge names
olgabot Oct 24, 2019
79b5294
Skip test_get_stdout_stderr_from_command if not on mac
olgabot Oct 24, 2019
b767a31
Use with screed open for tqdm iterations
olgabot Oct 24, 2019
5b768fa
Merge branch 'master' into olgabot/partition-coding-reads
olgabot Oct 24, 2019
0f64aa7
import tqdm from tqdm
olgabot Oct 24, 2019
2848d9c
Add nodegraph bloom filters for test data
olgabot Oct 24, 2019
1c537f9
Remove git cruft
olgabot Oct 24, 2019
0c7af66
Lint tests as well
olgabot Oct 24, 2019
0de6819
autopep8, yapf application
olgabot Oct 24, 2019
c572326
lint fixes
olgabot Oct 24, 2019
63d828f
Don't check for tqdm iterations because they get flushed (?)
olgabot Oct 24, 2019
fd39adf
Update badges
olgabot Oct 24, 2019
44f6b82
Remove whitespace between badges
olgabot Oct 24, 2019
28279cf
Add exception for if sequence length is smaller than kmer size for bl…
olgabot Oct 24, 2019
b5d24e1
Add adversarial test for short peptides
olgabot Oct 25, 2019
7b09d27
Clarify reading frames
olgabot Oct 25, 2019
990316a
Add more text to explain different options
olgabot Oct 25, 2019
2f9adde
Add explanation for why low complexity sequences are ignored.
olgabot Oct 25, 2019
36082f2
Pin some potentially problematic versions
olgabot Oct 25, 2019
99acc1f
Give more explanation for fastp's definition of complexity and an exa…
olgabot Oct 25, 2019
dc4fb65
Use k-mer definition of complexity for peptide sequence
olgabot Oct 25, 2019
c6975f0
Specify k-mer complexity for peptides
olgabot Oct 25, 2019
cdf6441
Implement @pranathivemuri's suggestion for creating a boolean variable
olgabot Oct 25, 2019
4f22185
Actually write the noncoding nucleotides
olgabot Oct 25, 2019
89d413e
add docstring to score_single_read
olgabot Oct 25, 2019
e1bc110
Just in case, get jaccard threshold for the molecule type
olgabot Oct 25, 2019
8581537
Alphabetize the amino acids in the polar section of the HP encoding
olgabot Oct 25, 2019
4402c09
Remove redundant if statement
olgabot Oct 25, 2019
d3e8660
Expand docstring for cli
olgabot Oct 25, 2019
f4851c9
Don't write test to path
olgabot Oct 25, 2019
94f263b
Fix k-mer complexity test logic
olgabot Oct 25, 2019
e37e2b5
Output number of k-mers when testing for low complexity peptides
olgabot Oct 25, 2019
619c3fd
Decrease default tablesize
olgabot Oct 28, 2019
876e895
Remove hello.py
olgabot Oct 28, 2019
7ea9f9a
Add tests for outputting all the fastas
olgabot Oct 28, 2019
99fd41b
Add ability to easily specify powers of ten for bloom filter size
olgabot Oct 30, 2019
5757e7d
add test for maybe make bloom filter
olgabot Oct 30, 2019
660e4b8
White space
olgabot Oct 30, 2019
19fd996
Reduce default tablesize to 8
olgabot Oct 30, 2019
beec65a
Fix bloom filter making tests
olgabot Oct 30, 2019
66df131
Add tests for getting default peptide ksize per molecule
olgabot Oct 30, 2019
c117a5f
Fix getting default peptise ksize
olgabot Oct 30, 2019
6019659
Pep8/whitespace
olgabot Oct 30, 2019
5f92f12
Write non-fasta output to stderr
olgabot Oct 31, 2019
50dcd15
Add load_nodegraph function for compatibility with multiple khmer ver…
olgabot Nov 2, 2019
60f72eb
Add message for if provided peptide ksize doesn't match one found in …
olgabot Nov 2, 2019
0780aa8
Let extract_coding figure out what the peptide ksize is without promp…
olgabot Nov 2, 2019
167729e
Specify jaccard as only being a float between 0 and 1
olgabot Nov 3, 2019
2d31a85
Add test for bad jaccard threshold
olgabot Nov 3, 2019
abb378e
Add more text to error message
olgabot Nov 3, 2019
01867b6
Add more tests for bad jaccard values
olgabot Nov 3, 2019
0e1ecce
Write summary of coding/noncoding to file
olgabot Nov 3, 2019
c405b82
Refactor to separate function for writing csv
olgabot Nov 3, 2019
cbe3f4b
Use pandas' DataFrame.describe functionality
olgabot Nov 3, 2019
db0c7a2
Rename: jaccard --> jaccard_in_peptide_db
olgabot Nov 4, 2019
6e291cb
Return value if jaccard is None
olgabot Nov 4, 2019
a733f50
Make writing json summary actually work
olgabot Nov 4, 2019
2a2c4dd
Separate out writing json for all input files and per-file
olgabot Nov 4, 2019
97369ab
Add classification percentages
olgabot Nov 4, 2019
8acacc0
Add false positive rate calculation
olgabot Nov 5, 2019
d83a968
Adjust the outputs of json summary
olgabot Nov 5, 2019
cd0d503
Make sure to get peptide ksize within maybe_make_bloom_filter
olgabot Nov 5, 2019
101d4fc
Simplify json summary and test
olgabot Nov 5, 2019
fc0f322
Add jaccard of sequence to fasta name
olgabot Nov 5, 2019
2093fdf
Don't need all molecule/ksize combinations for these tests
olgabot Nov 5, 2019
26bd46e
Add code for peptide set
olgabot Nov 5, 2019
6e2bcbd
Add make peptide set docstring
olgabot Nov 6, 2019
2a31127
use load_nodegraph in tests, too
olgabot Nov 7, 2019
28047fc
Add jaccard to fasta output
olgabot Nov 7, 2019
7019f67
Don't test for equivalence, just run the make peptide bloom filter
olgabot Nov 7, 2019
d55d977
Lint fixes
olgabot Nov 7, 2019
89bc0f7
Merge branch 'master' into olgabot/partition-coding-reads
olgabot Nov 8, 2019
9322973
Comment out assertion in test_bloom_filter for now
olgabot Nov 8, 2019
bd40b37
No plus sign
olgabot Nov 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -276,3 +276,4 @@ dmypy.json

# Pyre type checker
.pyre/
*.nodegraph
91 changes: 76 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
Kmer-hashing tools
================================

[![image](https://img.shields.io/travis/%7B%7B%20cookiecutter.github_organization%20%7D%7D/%7B%7B%20cookiecutter.repo_name%20%7D%7D.svg)](https://travis-ci.org/%7B%7B%20cookiecutter.github_organization%20%7D%7D/%7B%7B%20cookiecutter.repo_name%20%7D%7D)


[![codecov](https://codecov.io/gh/%7B%7B%20cookiecutter.github_organization%20%7D%7D/%7B%7B%20cookiecutter.repo_name%20%7D%7D/branch/master/graph/badge.svg)](https://codecov.io/gh/%7B%7B%20cookiecutter.github_organization%20%7D%7D/%7B%7B%20cookiecutter.repo_name%20%7D%7D)

[![image](https://img.shields.io/pypi/v/%7B%7B%20cookiecutter.repo_name%20%7D%7D.svg)](https://pypi.python.org/pypi/%7B%7B%20cookiecutter.repo_name%20%7D%7D)

[![image](https://img.shields.io/travis/czbiohub/kh-tools.svg)](https://travis-ci.com/czbiohub/kh-tools)
[![codecov](https://codecov.io/gh/czbiohub/kh-tools/branch/master/graph/badge.svg)](https://codecov.io/gh/czbiohub/kh-tools)

What is khtools?
-------------------------------------
Expand All @@ -23,25 +18,91 @@ Installation
To install this code, clone this github repository and use pip to install

```
git clone <https://github.com/>czbiohub/khtools.git
cd khtools
git clone <https://github.com/>czbiohub/khtools.git
cd khtools

# The "." means "install *this*, the folder where I am now"
pip install .
pip install .
```

Usage
-----

Greet a name multiple times!
### Extract likely protein-coding reads from sequencing data


```
khtools extract_coding peptides.fa.gz *.fastq.gz > coding_peptides.fasta
```

#### Save the "coding scores" to a csv

The "coding score" of each read is calculated by translating each read in six
frames, then is calculatating the
[Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) between any of the
six translated frames of the read and the peptide database. The final coding
score is the maximum Jaccard index across all reading frames. If you'd like to
see the coding scores for all reads, use the `--csv` flag.

```
$ Kmer-hashing tools hello --name "Rosalind Franklin" --count 10
khtools extract_coding --csv coding_scores.csv peptides.fa.gz *.fastq.gz > coding_peptides.fasta
```


Features
--------
#### Save the coding nucleotides to a fasta

By default, only the coding *peptides* are output. If you'd like to also output
the underlying *nucleotide* sequence, then use the flag `--coding-nucleotide-fasta`

```
khtools extract_coding --coding-nucleotide-fasta coding_nucleotides.fasta peptides.fa.gz *.fastq.gz > coding_peptides.fasta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not khtools extract_coding peptides.fa.gz *.fastq.gz > coding_peptides.fasta does it save the coding nucleotides to fasta by default. if so can you mention in the comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good point. By default, only the coding peptides are output to stdout and the user needs to ask for everything else. Updated the README, let me know if it's still unclear.

```

- TODO
#### Save the *non*-coding nucleotides to a fasta

To see the sequence of reads which were deemed non-coding, use the flag
`--noncoding-nucleotide-fasta`.

```
khtools extract_coding --noncoding-nucleotide-fasta noncoding_nucleotides.fasta peptides.fa.gz *.fastq.gz > coding_peptides.fasta
```

#### Save the low complexity nucleotides to a fasta

To see the sequence of reads found to have too low complexity of nucleotide
sequence to evaluate, use the flag `--low-complexity-nucleotide-fasta`. Low
complexity is determined by the same method as the read trimmer
[fastp](https://github.com/OpenGene/fastp) in which we calculate what
percentage of the sequence has consecutive runs of the same base,
or mathematically, how often `seq[i] = seq[i+1]`. The default threshold is
`0.3`. As an example, the sequence `CCCCCCCCCACCACCACCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCACACACCCCCAACACCC`
would be considered low complexity. While this sequence has many nucleotide
k-mers, it is likely a result of a sequencing error and we ignore it.

```
khtools extract_coding --low-complexity-nucleotide-fasta low_complexity_nucleotides.fasta peptides.fa.gz *.fastq.gz > coding_peptides.fasta
```

#### Save the low complexity peptides to a fasta

Even if the nucleotide sequence may pass the complexity filter, the peptide
sequence may still be low complexity. As an example, all translated frames of
the sequence
`CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG`
would be considered low complexity, as it translates to either
`QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ` (5'3' Frame 1),
`SSSSSSSSSSSSSSSSSSSSSSSSSSSSS` (5'3' Frame 2),
`AAAAAAAAAAAAAAAAAAAAAAAAAAAAA` (5'3' Frame 3 and 3'5' Frame 3),
`LLLLLLLLLLLLLLLLLLLLLLLLLLLLLL` (3'5' Frame 1),
or `CCCCCCCCCCCCCCCCCCCCCCCCCCCCC` (3'5' Frame 2). As these sequences have few
k-mers and are difficult to assess for how "coding" they are, we ignore them.
Unlike for nucleotides where we look at runs of consecutive bases, we require
the translated peptide to contain greater than `(L - k + 1)/2` k-mers, where
`L` is the length of the sequence and `k` is the k-mer size. To save the
sequence of low-complexity peptides to a fasta, use the flag
`--low-complexity-peptides-fasta`.

```
khtools extract_coding --low-complexity-peptides-fasta low_complexity_peptides.fasta peptides.fa.gz *.fastq.gz > coding_peptides.fasta
```

12 changes: 12 additions & 0 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,15 @@ Usage
To use Kmer-hashing tools in a project::

import khtools

To create a bloom filter of sequences::

khtools bloom-filter --molecule protein --peptide-ksize 7 --save-as Homo_sapiens.GRCh38.pep.subset.molecule-protein_ksize-7.bloomfilter.nodegraph Homo_sapiens.GRCh38.pep.subset.fa.gz

To partition reads into coding/noncoding bins using the bloom filter::

khtools partition -- SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled.fq.gz Homo_sapiens.GRCh38.pep.all.fa.gz

To create the bloom filter and partition the reads in one step::

khtools partition ~/code/kmer-hashing/extract_kmers/test-data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled.fq.gz ~/Downloads/Homo_sapiens.GRCh38.pep.all.fa.gz
253 changes: 253 additions & 0 deletions khtools/bloom_filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
import math
import os

import click
import khmer
import screed
from sourmash._minhash import hash_murmur
from tqdm import tqdm

from khtools.compare_kmer_content import kmerize
from khtools.sequence_encodings import encode_peptide, VALID_PEPTIDE_MOLECULES

# khmer Nodegraph features
DEFAULT_N_TABLES = 4
DEFAULT_MAX_TABLESIZE = int(1e8)

# Default k-mer sizes for different alphabets
DEFAULT_PROTEIN_KSIZE = 7
DEFAULT_DAYHOFF_KSIZE = 11
DEFAULT_HP_KSIZE = 21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of hardcoding these values maybe add them to a config file instead, that way it will be easier for a user to track down defaults for debugging.



def per_read_false_positive_coding_rate(n_kmers_in_read, n_total_kmers=1e7,
n_hash_functions=DEFAULT_N_TABLES,
tablesize=DEFAULT_MAX_TABLESIZE):
exponent = - n_hash_functions * n_total_kmers / tablesize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the variable name exponent is a little too vague.

print(f"exponent: {exponent}")

# Probability that a single k-mer is randomly in the data
# per_kmer_fpr = math.pow(1 - math.exp(exponent), n_hash_functions)

# Use built-in `exp1m` = exp - 1
# - (exp - 1) = 1 - exp
per_kmer_fpr = math.pow(- math.expm1(exponent), n_hash_functions)
print(f"per kmer false positive rate: {per_kmer_fpr}")

# Probability that the number of k-mers in the read are all false positives
per_read_fpr = math.pow(per_kmer_fpr, n_kmers_in_read)
return per_read_fpr


def load_nodegraph(*args, **kwargs):
try:
# khmer 2.1.1
return khmer.load_nodegraph(*args, **kwargs)
except AttributeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this error is encountered it would be a good thing to log to user as warning.

Copy link
Contributor Author

@olgabot olgabot Nov 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what @luizirber told me to do to load the Nodegraph in a way that's compatible with both versions of khmer, so I'm not sure exactly what would be warned here. Here's how they do it in sourmash: https://github.com/dib-lab/sourmash/blob/master/sourmash/sbt.py#L62

# khmer 3+/master branch
return khmer.Nodegraph.load(*args, **kwargs)


# Cribbed from https://click.palletsprojects.com/en/7.x/parameters/
class BasedIntParamType(click.ParamType):
name = "integer"

def convert(self, value, param, ctx):
try:
if isinstance(value, int):
return value
if 'e' in value:
sigfig, exponent = value.split('e')
sigfig = float(sigfig)
exponent = int(exponent)
return int(sigfig * 10 ** exponent)
return int(value, 10)
except TypeError:
self.fail(
"expected string for int() conversion, got "
f"{value!r} of type {type(value).__name__}",
param,
ctx,
)
except ValueError:
self.fail(f"{value!r} is not a valid integer", param, ctx)


BASED_INT = BasedIntParamType()


def make_peptide_bloom_filter(peptide_fasta,
peptide_ksize,
molecule,
n_tables=DEFAULT_N_TABLES,
tablesize=DEFAULT_MAX_TABLESIZE):
"""Create a bloom filter out of peptide sequences"""
peptide_bloom_filter = khmer.Nodegraph(peptide_ksize,
tablesize,
n_tables=n_tables)

with screed.open(peptide_fasta) as records:
for record in tqdm(records):
olgabot marked this conversation as resolved.
Show resolved Hide resolved
if '*' in record['sequence']:
continue
sequence = encode_peptide(record['sequence'], molecule)
try:
kmers = kmerize(sequence, peptide_ksize)
for kmer in kmers:
# Convert the k-mer into an integer
hashed = hash_murmur(kmer)

# .add can take the hashed integer so we can hash the
# peptide kmer and add it directly
peptide_bloom_filter.add(hashed)
except ValueError:
# Sequence length is smaller than k-mer size
continue
return peptide_bloom_filter


def make_peptide_set(peptide_fasta, peptide_ksize, molecule):
"""Create a python set out of peptide sequence k-mers

For comparing to the bloom filter in storage and performance
"""
peptide_set = set([])

with screed.open(peptide_fasta) as records:
for record in tqdm(records):
if '*' in record['sequence']:
continue
sequence = encode_peptide(record['sequence'], molecule)
try:
kmers = kmerize(sequence, peptide_ksize)
peptide_set.update(kmers)
except ValueError:
# Sequence length is smaller than k-mer size
continue
return peptide_set


def maybe_make_peptide_bloom_filter(peptides, peptide_ksize, molecule,
peptides_are_bloom_filter,
n_tables=DEFAULT_N_TABLES,
tablesize=DEFAULT_MAX_TABLESIZE):
if peptides_are_bloom_filter:
click.echo(
f"Loading existing bloom filter from {peptides} and "
f"making sure the ksizes match",
err=True)
peptide_bloom_filter = load_nodegraph(peptides)
if peptide_ksize is not None:
try:
assert peptide_ksize == peptide_bloom_filter.ksize()
except AssertionError:
raise ValueError(f"Given peptide ksize ({peptide_ksize}) and "
f"ksize found in bloom filter "
f"({peptide_bloom_filter.ksize()}) are not"
f"equal")
else:
peptide_ksize = get_peptide_ksize(molecule, peptide_ksize)
click.echo(
f"Creating peptide bloom filter with file: {peptides}\n"
f"Using ksize: {peptide_ksize} and molecule: {molecule} "
f"...",
err=True)
peptide_bloom_filter = make_peptide_bloom_filter(
peptides, peptide_ksize, molecule=molecule,
n_tables=n_tables, tablesize=tablesize)
return peptide_bloom_filter


def maybe_save_peptide_bloom_filter(peptides, peptide_bloom_filter, molecule,
save_peptide_bloom_filter):
if save_peptide_bloom_filter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the else condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no else, it's just a small function to abstract away the saving of the bloom filter so that the body of the cli() is clean. I've been inspired by Ruby to keep my methods short, around 10 lines, before breaking them apart

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why you would need this if statement. I think it would be more fluid if you created a bloom filter class and had a save_bloom_filter method. otherwise forget the if and just save the bloom filter? Are there any cases where you would call this function and not save the bloom filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the bloom filter takes ~30s which is small compared to the filtering step (minutes-hours) so in case the user doesn't want to save the bloom filter, this maybe function is here. I wanted the only required argument of the CLI to be the peptides and the reads, so that the filtered coding sequences could be output to stdout. I'm not a big fan of 'secretly' creating files and saving them without the use opting in to them, unless you think for this case it makes sense.

ksize = peptide_bloom_filter.ksize()

if isinstance(save_peptide_bloom_filter, str):
filename = save_peptide_bloom_filter
peptide_bloom_filter.save(save_peptide_bloom_filter)
else:
suffix = f'.molecule-{molecule}_ksize-{ksize}.bloomfilter.' \
f'nodegraph'
filename = os.path.splitext(peptides)[0] + suffix

click.echo(f"Writing peptide bloom filter to {filename}", err=True)
peptide_bloom_filter.save(filename)
click.echo("\tDone!", err=True)


@click.command()
@click.argument('peptides')
@click.option('--peptide-ksize',
default=None, type=int,
help="K-mer size of the peptide sequence to use. Defaults for"
" different molecules are, "
f"protein: {DEFAULT_PROTEIN_KSIZE}"
f", dayhoff: {DEFAULT_DAYHOFF_KSIZE},"
f" hydrophobic-polar: {DEFAULT_HP_KSIZE}")
@click.option('--molecule',
default='protein',
help="The type of amino acid encoding to use. Default is "
"'protein', but 'dayhoff' or 'hydrophobic-polar' can be "
"used")
@click.option('--save-as',
default=None,
help='If provided, save peptide bloom filter as this filename. '
'Otherwise, add ksize and molecule name to input filename.')
@click.option('--tablesize', type=BASED_INT,
default="1e8",
help='Size of the bloom filter table to use')
@click.option('--n-tables', type=int,
default=DEFAULT_N_TABLES,
help='Size of the bloom filter table to use')
def cli(peptides, peptide_ksize=None, molecule='protein', save_as=None,
tablesize=DEFAULT_MAX_TABLESIZE, n_tables=DEFAULT_N_TABLES):
"""Make a peptide bloom filter for your peptides

\b
Parameters
----------
reads : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no reads

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. Thanks for catching!

Sequence file of reads to filter
peptides : str
Sequence file of peptides
peptide_ksize : int
Number of characters in amino acid words
long_reads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

molecule save as are missing args and remove long_reads. I think for this function here it seems like redundant to have docstring since its help strings are right above, but better have 2 docs than none!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that!

verbose

\b
Returns
-------

"""
# \b above prevents rewrapping of paragraph
peptide_ksize = get_peptide_ksize(molecule, peptide_ksize)
peptide_bloom_filter = make_peptide_bloom_filter(peptides, peptide_ksize,
molecule,
n_tables=n_tables,
tablesize=tablesize)
click.echo("\tDone!", err=True)

save_peptide_bloom_filter = save_as if save_as is not None else True
maybe_save_peptide_bloom_filter(
peptides,
peptide_bloom_filter,
molecule,
save_peptide_bloom_filter=save_peptide_bloom_filter)


def get_peptide_ksize(molecule, peptide_ksize):
if molecule not in VALID_PEPTIDE_MOLECULES:
raise ValueError(f"{molecule} is not a valid protein encoding! "
f"Only one of 'protein', 'hydrophobic-polar', or"
f" 'dayhoff' can be specified")

if peptide_ksize is None:
if molecule == 'protein':
peptide_ksize = DEFAULT_PROTEIN_KSIZE
elif molecule == 'dayhoff':
peptide_ksize = DEFAULT_DAYHOFF_KSIZE
elif molecule == 'hydrophobic-polar' or molecule == 'hp':
peptide_ksize = DEFAULT_HP_KSIZE
return peptide_ksize
Loading