czbiohub-sf · olgabot · Dec 5, 2019 · Oct 11, 2019 · Oct 11, 2019 · Oct 11, 2019
diff --git a/.gitignore b/.gitignore
@@ -276,3 +276,4 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+*.nodegraph
diff --git a/README.md b/README.md
@@ -1,13 +1,8 @@
 Kmer-hashing tools
 ================================
 
-[![image](https://img.shields.io/travis/%7B%7B%20cookiecutter.github_organization%20%7D%7D/%7B%7B%20cookiecutter.repo_name%20%7D%7D.svg)](https://travis-ci.org/%7B%7B%20cookiecutter.github_organization%20%7D%7D/%7B%7B%20cookiecutter.repo_name%20%7D%7D)
-
-
-[![codecov](https://codecov.io/gh/%7B%7B%20cookiecutter.github_organization%20%7D%7D/%7B%7B%20cookiecutter.repo_name%20%7D%7D/branch/master/graph/badge.svg)](https://codecov.io/gh/%7B%7B%20cookiecutter.github_organization%20%7D%7D/%7B%7B%20cookiecutter.repo_name%20%7D%7D)
-
-[![image](https://img.shields.io/pypi/v/%7B%7B%20cookiecutter.repo_name%20%7D%7D.svg)](https://pypi.python.org/pypi/%7B%7B%20cookiecutter.repo_name%20%7D%7D)
-
+[![image](https://img.shields.io/travis/czbiohub/kh-tools.svg)](https://travis-ci.com/czbiohub/kh-tools)
+[![codecov](https://codecov.io/gh/czbiohub/kh-tools/branch/master/graph/badge.svg)](https://codecov.io/gh/czbiohub/kh-tools)
 
 What is khtools?
 -------------------------------------
@@ -23,25 +18,91 @@ Installation
 To install this code, clone this github repository and use pip to install
 
 ```
-git clone <https://github.com/>czbiohub/khtools.git 
-cd khtools 
+git clone <https://github.com/>czbiohub/khtools.git
+cd khtools
 
 # The "." means "install *this*, the folder where I am now"
-pip install . 
+pip install .
 ```
 
 Usage
 -----
 
-Greet a name multiple times!
+### Extract likely protein-coding reads from sequencing data
+
+
+```
+khtools extract_coding peptides.fa.gz *.fastq.gz > coding_peptides.fasta
+```
+
+#### Save the "coding scores" to a csv
+
+The "coding score" of each read is calculated by translating each read in six
+frames, then is calculatating the
+[Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) between any of the
+six translated frames of the read and the peptide database. The final coding
+score is the maximum Jaccard index across all reading frames. If you'd like to
+see the coding scores for all reads, use the `--csv` flag.
 
 ```
-$ Kmer-hashing tools hello --name "Rosalind Franklin" --count 10 
+khtools extract_coding --csv coding_scores.csv peptides.fa.gz *.fastq.gz > coding_peptides.fasta
 ```
 
 
-Features
---------
+#### Save the coding nucleotides to a fasta
+
+By default, only the coding *peptides* are output. If you'd like to also output
+the underlying *nucleotide* sequence, then use the flag `--coding-nucleotide-fasta`
+
+```
+khtools extract_coding --coding-nucleotide-fasta coding_nucleotides.fasta peptides.fa.gz *.fastq.gz > coding_peptides.fasta
+```
 
--   TODO
+#### Save the *non*-coding nucleotides to a fasta
+
+To see the sequence of reads which were deemed non-coding, use the flag
+`--noncoding-nucleotide-fasta`.
+
+```
+khtools extract_coding --noncoding-nucleotide-fasta noncoding_nucleotides.fasta peptides.fa.gz *.fastq.gz > coding_peptides.fasta
+```
+
+#### Save the low complexity nucleotides to a fasta
+
+To see the sequence of reads found to have too low complexity of nucleotide
+sequence to evaluate, use the flag `--low-complexity-nucleotide-fasta`. Low
+complexity is determined by the same method as the read trimmer
+[fastp](https://github.com/OpenGene/fastp) in which we calculate what
+percentage of the sequence has consecutive runs of the same base,
+or mathematically, how often `seq[i] = seq[i+1]`. The default threshold is
+`0.3`. As an example, the sequence `CCCCCCCCCACCACCACCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCACACACCCCCAACACCC`
+would be considered low complexity. While this sequence has many nucleotide
+k-mers, it is likely a result of a sequencing error and we ignore it.
+
+```
+khtools extract_coding --low-complexity-nucleotide-fasta low_complexity_nucleotides.fasta peptides.fa.gz *.fastq.gz > coding_peptides.fasta
+```
+
+#### Save the low complexity peptides to a fasta
+
+Even if the nucleotide sequence may pass the complexity filter, the peptide
+sequence may still be low complexity. As an example, all translated frames of
+the sequence
+`CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG`
+would be considered low complexity, as it translates to either
+`QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ` (5'3' Frame 1),
+`SSSSSSSSSSSSSSSSSSSSSSSSSSSSS` (5'3' Frame 2),
+`AAAAAAAAAAAAAAAAAAAAAAAAAAAAA` (5'3' Frame 3 and 3'5' Frame 3),
+`LLLLLLLLLLLLLLLLLLLLLLLLLLLLLL` (3'5' Frame 1),
+or `CCCCCCCCCCCCCCCCCCCCCCCCCCCCC` (3'5' Frame 2). As these sequences have few
+k-mers and are difficult to assess for how "coding" they are, we ignore them.
+Unlike for nucleotides where we look at runs of consecutive bases, we require
+the translated peptide to contain greater than `(L - k + 1)/2` k-mers, where
+`L` is the length of the sequence and `k` is the k-mer size. To save the
+sequence of low-complexity peptides to a fasta, use the flag
+`--low-complexity-peptides-fasta`.
+
+```
+khtools extract_coding --low-complexity-peptides-fasta low_complexity_peptides.fasta peptides.fa.gz *.fastq.gz > coding_peptides.fasta
+```
 
diff --git a/docs/usage.rst b/docs/usage.rst
@@ -5,3 +5,15 @@ Usage
 To use Kmer-hashing tools in a project::
 
     import khtools
+
+To create a bloom filter of sequences::
+
+    khtools bloom-filter --molecule protein --peptide-ksize 7 --save-as Homo_sapiens.GRCh38.pep.subset.molecule-protein_ksize-7.bloomfilter.nodegraph Homo_sapiens.GRCh38.pep.subset.fa.gz
+
+To partition reads into coding/noncoding bins using the bloom filter::
+
+    khtools partition -- SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled.fq.gz Homo_sapiens.GRCh38.pep.all.fa.gz
+
+To create the bloom filter and partition the reads in one step::
+
+    khtools partition  ~/code/kmer-hashing/extract_kmers/test-data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled.fq.gz ~/Downloads/Homo_sapiens.GRCh38.pep.all.fa.gz
diff --git a/khtools/bloom_filter.py b/khtools/bloom_filter.py
@@ -0,0 +1,253 @@
+import math
+import os
+
+import click
+import khmer
+import screed
+from sourmash._minhash import hash_murmur
+from tqdm import tqdm
+
+from khtools.compare_kmer_content import kmerize
+from khtools.sequence_encodings import encode_peptide, VALID_PEPTIDE_MOLECULES
+
+# khmer Nodegraph features
+DEFAULT_N_TABLES = 4
+DEFAULT_MAX_TABLESIZE = int(1e8)
+
+# Default k-mer sizes for different alphabets
+DEFAULT_PROTEIN_KSIZE = 7
+DEFAULT_DAYHOFF_KSIZE = 11
+DEFAULT_HP_KSIZE = 21
+
+
+def per_read_false_positive_coding_rate(n_kmers_in_read, n_total_kmers=1e7,
+                                        n_hash_functions=DEFAULT_N_TABLES,
+                                        tablesize=DEFAULT_MAX_TABLESIZE):
+    exponent = - n_hash_functions * n_total_kmers / tablesize
+    print(f"exponent: {exponent}")
+
+    # Probability that a single k-mer is randomly in the data
+    # per_kmer_fpr = math.pow(1 - math.exp(exponent), n_hash_functions)
+
+    # Use built-in `exp1m` = exp - 1
+    # - (exp - 1) = 1 - exp
+    per_kmer_fpr = math.pow(- math.expm1(exponent), n_hash_functions)
+    print(f"per kmer false positive rate: {per_kmer_fpr}")
+
+    # Probability that the number of k-mers in the read are all false positives
+    per_read_fpr = math.pow(per_kmer_fpr, n_kmers_in_read)
+    return per_read_fpr
+
+
+def load_nodegraph(*args, **kwargs):
+    try:
+        # khmer 2.1.1
+        return khmer.load_nodegraph(*args, **kwargs)
+    except AttributeError:
+        # khmer 3+/master branch
+        return khmer.Nodegraph.load(*args, **kwargs)
+
+
+# Cribbed from https://click.palletsprojects.com/en/7.x/parameters/
+class BasedIntParamType(click.ParamType):
+    name = "integer"
+
+    def convert(self, value, param, ctx):
+        try:
+            if isinstance(value, int):
+                return value
+            if 'e' in value:
+                sigfig, exponent = value.split('e')
+                sigfig = float(sigfig)
+                exponent = int(exponent)
+                return int(sigfig * 10 ** exponent)
+            return int(value, 10)
+        except TypeError:
+            self.fail(
+                "expected string for int() conversion, got "
+                f"{value!r} of type {type(value).__name__}",
+                param,
+                ctx,
+            )
+        except ValueError:
+            self.fail(f"{value!r} is not a valid integer", param, ctx)
+
+
+BASED_INT = BasedIntParamType()
+
+
+def make_peptide_bloom_filter(peptide_fasta,
+                              peptide_ksize,
+                              molecule,
+                              n_tables=DEFAULT_N_TABLES,
+                              tablesize=DEFAULT_MAX_TABLESIZE):
+    """Create a bloom filter out of peptide sequences"""
+    peptide_bloom_filter = khmer.Nodegraph(peptide_ksize,
+                                           tablesize,
+                                           n_tables=n_tables)
+
+    with screed.open(peptide_fasta) as records:
+        for record in tqdm(records):
+            if '*' in record['sequence']:
+                continue
+            sequence = encode_peptide(record['sequence'], molecule)
+            try:
+                kmers = kmerize(sequence, peptide_ksize)
+                for kmer in kmers:
+                    # Convert the k-mer into an integer
+                    hashed = hash_murmur(kmer)
+
+                    # .add can take the hashed integer so we can hash the
+                    #  peptide kmer and add it directly
+                    peptide_bloom_filter.add(hashed)
+            except ValueError:
+                # Sequence length is smaller than k-mer size
+                continue
+    return peptide_bloom_filter
+
+
+def make_peptide_set(peptide_fasta, peptide_ksize, molecule):
+    """Create a python set out of peptide sequence k-mers
+
+    For comparing to the bloom filter in storage and performance
+    """
+    peptide_set = set([])
+
+    with screed.open(peptide_fasta) as records:
+        for record in tqdm(records):
+            if '*' in record['sequence']:
+                continue
+            sequence = encode_peptide(record['sequence'], molecule)
+            try:
+                kmers = kmerize(sequence, peptide_ksize)
+                peptide_set.update(kmers)
+            except ValueError:
+                # Sequence length is smaller than k-mer size
+                continue
+    return peptide_set
+
+
+def maybe_make_peptide_bloom_filter(peptides, peptide_ksize, molecule,
+                                    peptides_are_bloom_filter,
+                                    n_tables=DEFAULT_N_TABLES,
+                                    tablesize=DEFAULT_MAX_TABLESIZE):
+    if peptides_are_bloom_filter:
+        click.echo(
+            f"Loading existing bloom filter from {peptides} and "
+            f"making sure the ksizes match",
+            err=True)
+        peptide_bloom_filter = load_nodegraph(peptides)
+        if peptide_ksize is not None:
+            try:
+                assert peptide_ksize == peptide_bloom_filter.ksize()
+            except AssertionError:
+                raise ValueError(f"Given peptide ksize ({peptide_ksize}) and "
+                                 f"ksize found in bloom filter "
+                                 f"({peptide_bloom_filter.ksize()}) are not"
+                                 f"equal")
+    else:
+        peptide_ksize = get_peptide_ksize(molecule, peptide_ksize)
+        click.echo(
+            f"Creating peptide bloom filter with file: {peptides}\n"
+            f"Using ksize: {peptide_ksize} and molecule: {molecule} "
+            f"...",
+            err=True)
+        peptide_bloom_filter = make_peptide_bloom_filter(
+            peptides, peptide_ksize, molecule=molecule,
+            n_tables=n_tables, tablesize=tablesize)
+    return peptide_bloom_filter
+
+
+def maybe_save_peptide_bloom_filter(peptides, peptide_bloom_filter, molecule,
+                                    save_peptide_bloom_filter):
+    if save_peptide_bloom_filter:
+        ksize = peptide_bloom_filter.ksize()
+
+        if isinstance(save_peptide_bloom_filter, str):
+            filename = save_peptide_bloom_filter
+            peptide_bloom_filter.save(save_peptide_bloom_filter)
+        else:
+            suffix = f'.molecule-{molecule}_ksize-{ksize}.bloomfilter.' \
+                     f'nodegraph'
+            filename = os.path.splitext(peptides)[0] + suffix
+
+        click.echo(f"Writing peptide bloom filter to {filename}", err=True)
+        peptide_bloom_filter.save(filename)
+        click.echo("\tDone!", err=True)
+
+
+@click.command()
+@click.argument('peptides')
+@click.option('--peptide-ksize',
+              default=None, type=int,
+              help="K-mer size of the peptide sequence to use. Defaults for"
+              " different molecules are, "
+              f"protein: {DEFAULT_PROTEIN_KSIZE}"
+              f", dayhoff: {DEFAULT_DAYHOFF_KSIZE},"
+              f" hydrophobic-polar: {DEFAULT_HP_KSIZE}")
+@click.option('--molecule',
+              default='protein',
+              help="The type of amino acid encoding to use. Default is "
+              "'protein', but 'dayhoff' or 'hydrophobic-polar' can be "
+              "used")
+@click.option('--save-as',
+              default=None,
+              help='If provided, save peptide bloom filter as this filename. '
+              'Otherwise, add ksize and molecule name to input filename.')
+@click.option('--tablesize', type=BASED_INT,
+              default="1e8",
+              help='Size of the bloom filter table to use')
+@click.option('--n-tables', type=int,
+              default=DEFAULT_N_TABLES,
+              help='Size of the bloom filter table to use')
+def cli(peptides, peptide_ksize=None, molecule='protein', save_as=None,
+        tablesize=DEFAULT_MAX_TABLESIZE, n_tables=DEFAULT_N_TABLES):
+    """Make a peptide bloom filter for your peptides
+
+    \b
+    Parameters
+    ----------
+    reads : str
+        Sequence file of reads to filter
+    peptides : str
+        Sequence file of peptides
+    peptide_ksize : int
+        Number of characters in amino acid words
+    long_reads
+    verbose
+
+    \b
+    Returns
+    -------
+
+    """
+    # \b above prevents rewrapping of paragraph
+    peptide_ksize = get_peptide_ksize(molecule, peptide_ksize)
+    peptide_bloom_filter = make_peptide_bloom_filter(peptides, peptide_ksize,
+                                                     molecule,
+                                                     n_tables=n_tables,
+                                                     tablesize=tablesize)
+    click.echo("\tDone!", err=True)
+
+    save_peptide_bloom_filter = save_as if save_as is not None else True
+    maybe_save_peptide_bloom_filter(
+        peptides,
+        peptide_bloom_filter,
+        molecule,
+        save_peptide_bloom_filter=save_peptide_bloom_filter)
+
+
+def get_peptide_ksize(molecule, peptide_ksize):
+    if molecule not in VALID_PEPTIDE_MOLECULES:
+        raise ValueError(f"{molecule} is not a valid protein encoding! "
+                         f"Only one of 'protein', 'hydrophobic-polar', or"
+                         f" 'dayhoff' can be specified")
+
+    if peptide_ksize is None:
+        if molecule == 'protein':
+            peptide_ksize = DEFAULT_PROTEIN_KSIZE
+        elif molecule == 'dayhoff':
+            peptide_ksize = DEFAULT_DAYHOFF_KSIZE
+        elif molecule == 'hydrophobic-polar' or molecule == 'hp':
+            peptide_ksize = DEFAULT_HP_KSIZE
+    return peptide_ksize
Original file line number	Diff line number	Diff line change
Expand Up		@@ -276,3 +276,4 @@ dmypy.json

		# Pyre type checker
		.pyre/
		*.nodegraph