-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seqkit locate misses embedded patterns #368
Comments
For some reasons I can't recall, I chose to remove these "embedded" regions when searching with regular expressions before. It looks unnecessary now, cause we already have the Thanks for reporting :)
|
Thank you for the quick response again @shenwei356 ! I have tested it, and see just one issue, with the example you give for finding ORFs, rather than seqkit locate
e.g.
These few cases can be filtered out easily afterwards, so this is just a minor comment in case you know how to improve the regex so it excludes these non-ORF cases. |
I'm afraid it's not.
|
@shenwei356 my statement wasn't fully clear, but this regex includes a stop within an "ORF" if it is immediately after the start codon:
|
I see. The regular expression is not perfect for finding ORF. |
Describe your issue
Seqkit locate does not seem to find fully overlapping patterns on the same strand (whether in greedy or non-greedy mode)
For instance, the regex below for searching for ORFs, adapted from that provided, does not find fully overlapping (embedded) ORFs on the same strand.
#with genome file "test.txt":
cat test.txt | seqkit locate -i -p "[AT]TG(?:.{3})+?T" -r | awk '{if (($6-$5)>=60 && NR>=2 ) print }' > ORFs.txt ;
e.g. this predicts an ORF from positions 123-260 on the plus strand, but misses the embedded ORF from 128-247 (found e.g. with prodigal, in test_all.txt)
prodigal -i test.txt -g 1 -f gff -s test_all.txt > test.gff ;
faidx test.txt LVMU01000403.1:123-260 ;
faidx test.txt LVMU01000403.1:128-247 ;
test.txt
The text was updated successfully, but these errors were encountered: