Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster continuous k-mer decomposition #475

Merged
merged 5 commits into from
May 28, 2023
Merged

Conversation

padix-key
Copy link
Member

@padix-key padix-key commented Apr 26, 2023

This PR introduces a faster method for create_kmers() that uses the k-mer from the previous position to compute the next position. This way lopping over k is not necessary anymore as opposed to the current method.
The performance increase is clear (orange: naive, green: new method):

benchmark_1

The following script was used to create the benchmarks:

import time
import numpy as np
import biotite
import biotite.sequence as seq
import biotite.sequence.align as align
import matplotlib.pyplot as plt


LENGTH = 1_000_000
MAX_K = 12
N = 100


orig_durations = []
new_durations = []

sequence = seq.NucleotideSequence(ambiguous=False)
np.random.seed(0)
sequence.code = np.random.randint(len(sequence.alphabet), size=LENGTH)

k_values = np.arange(2, MAX_K+1)
for k in k_values:
    kmer_alph = align.KmerAlphabet(sequence.alphabet, k)

    now = time.time_ns()
    for _  in range(N):
        kmer_codes = kmer_alph.create_kmers(sequence.code)
    new_durations.append((time.time_ns() - now) / N * 1e-6)

    kmer_alph._create_kmers_func = kmer_alph._create_continuous_kmers

    now = time.time_ns()
    for _  in range(N):
        kmer_codes = kmer_alph.create_kmers(sequence.code)
    orig_durations.append((time.time_ns() - now) / N * 1e-6)

fig, ax = plt.subplots()
ax.plot(
    k_values, orig_durations, color=biotite.colors["orange"],
    marker="o", linestyle="None"
)
ax.plot(
    k_values, new_durations, color=biotite.colors["green"],
    marker="o", linestyle="None"
)
ax.set_xlabel("k")
ax.set_ylabel("Duration (ms)")
plt.show()

For easier implementation the k-mer ordering in the alphabet is reversed.

@padix-key padix-key changed the title Faster k-mer decomposition for some alphabets Faster continuous k-mer decomposition Apr 27, 2023
@padix-key padix-key merged commit a2df958 into biotite-dev:master May 28, 2023
@padix-key padix-key deleted the kmer branch September 24, 2023 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant