-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange results #15
Comments
Just to add as a test case, I have provided a fastq sequence file with three reads
And a kmer database of:
Neither the 21-mer or 32-mer was found in the sequences despite their presence. 21-mer example:
I must misunderstand something important about how fastv is operating, or there is an error. I will be grateful for any advice. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
Thanks for this amazing tool. I am using fastv in perhaps an unusual way. I'm looking to detect the presence or absence of homologous gene clusters in metagenomic data. I started with ~17 million ORFs from our contigs, and clustered (mmseqs2, 95% identity) them into almost 4 million clusters. I want to detecte the clusters by detecting one or more unique kmers from their representative sequences.
I used
unique_kmer
(initially) to identify unique 24mers, but this took a long time, generated millions of files and unique 24-mers could not be found for a majority of sequences. I fell back on jellyfish, 32mers, and a convoluted pipeline of aligning the kmers against the cluster representatives and then filtering the sorted SAM file to include only non-overlapping kmers. This way the vast majority of cluster representatives had one or more unique kmers (mean of 3 and up to hundreds).I applied
fastv
with minimal filtering and lowest thresholds (-A -G -Q -L -p 0.001 -d 0.001
), but only ~100k of ~3.4 million cluster representatives are ever identified across all >200 samples.I tested further by extracting only unique kmers from one sample and testing them against the sample reads: no hits!
Yet, when I search with
seqkit
, I find that the sequence file does indeed contain this kmer three times: "ATGAAATTCCATGGAATGGAATGGAATGGAAA"Can you advise why
fastv
seems not to be detecting it?Thanks!
The text was updated successfully, but these errors were encountered: