Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilizing FM-index in "locate" and "grep" #14

Closed
shenwei356 opened this issue May 10, 2017 · 5 comments
Closed

Utilizing FM-index in "locate" and "grep" #14

shenwei356 opened this issue May 10, 2017 · 5 comments

Comments

@shenwei356
Copy link
Owner

shenwei356 commented May 10, 2017

So we can make the searching faster and allow mismatch.

I've written a package, bwt (Burrows-Wheeler Transform, and FM-index in golang). So it won't be too long.

update: there are some bugs...

@ctava
Copy link
Contributor

ctava commented Sep 15, 2018

@shenwei356 see the bwt repo. have the following questions:

from a design perspective, whats the difference between the locate and grep commands?
(Notice that seqkit grep only searches in positive strand, but seqkit locate could recognize both strand.)

could this be done for the locate command first?
https://github.com/shenwei356/bwt/blob/master/fmi/fmi.go#L62
https://github.com/shenwei356/bwt/blob/master/fmi/fmi.go#L102

How do you envision this struct fitting in?
https://godoc.org/github.com/shenwei356/bwt/fmi#FMIndex

noticed burrows , wheeler is a compression algorithm:
https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

whats the goal to make search faster? how does compression help?
assuming benchmarking before and after is needed?

to what degree should mismatch be allowed?
locate already supports mismatches..
https://github.com/shenwei356/seqkit/blob/master/seqkit/cmd/locate.go#L72

@shenwei356
Copy link
Owner Author

grep searches mainly by FASTA ID and headers beside sequence, it focuses 'searching/filtering, while locatefocuses finding locations of subsequences. grep` should consider the backward strand, I'll fix this.

fmi is a general package, which is not responsible for handling other form of the text to be searched.

The optimization of bwt is on the todo list.

Parameter mismatches fmi.Locate(query, mismatches) is already the maximum allowed mismatches in searching. Flag -M/--max-mismatch in command seqkit locate tries to make it easy to understand.

@shiva1387
Copy link

shiva1387 commented May 14, 2019

Hi, question pertaining to memory and speed. I would like to use locate on an ncbi database (~50 to 100 million sequences), for finding sequences that match specific patterns. Currently locate does not make use of the fasta index, can i make use of that to increase speed? Any other suggestions? Thanks

@shenwei356
Copy link
Owner Author

Try BLAST?

@shiva1387
Copy link

Sorry to be clear, i am searching for either degenerate sequence or using regular expressions (for protein motifs) so blast wont help for either case..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants