Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqkit subseq with bed file does properly handle uppercase and lowercase accession (not case-sensitive?) #400

Closed
4 tasks done
xinehc opened this issue Jul 23, 2023 · 4 comments

Comments

@xinehc
Copy link

xinehc commented Jul 23, 2023

Hi Wei,

I am using seqkit v2.4.0 subseq to extract some sequences from NBCI NR, but I notice that some accessions are consistently missed (tried with --id-ncbi). A more detailed look shows that if two records share an accession (in lower-case) then one of them will be ignored (upper-case). Is it a bug?

Update: I suspect this is caused by indexing the fasta, and the result is somehow random (with 50% ignoring the uppercase and 50% the lower, not always the uppercase). seqtk subseq does not indexing the fasta and seems not having this issue.

bed:

4ZER_1F 6       201     l4/l4e-tcov:0.96-acc:0.92-nr-bacteria   0.9605911330049262      +
4ZER_1f 0       95      s6-tcov:0.95-acc:0.93-nr-bacteria       0.95    +

output:

>4ZER_1f_7-201:+ l4/l4e-tcov:0.96-acc:0.92-nr-bacteria
NIVLNPNLDQSQLALEKEIIQRALENYGARVEKVEELGLRRLAYPIAKDPQGYFLWYQVE
MPEDRVNDLARELRIRDNVRRVMVVKSQEPFLAN
>4ZER_1f_1-95:+ s6-tcov:0.95-acc:0.93-nr-bacteria
MRRYEVNIVLNPNLDQSQLALEKEIIQRALENYGARVEKVEELGLRRLAYPIAKDPQGYF
LWYQVEMPEDRVNDLARELRIRDNVRRVMVVKSQE

input:

>4ZER_1f Chain 1f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >4ZER_2f Chain 2f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >5F8K_1f Chain 1f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >5F8K_2f Chain 2f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >5FDU_1f Chain 1f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >5FDU_2f Chain 2f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >5FDV_1f Chain 1f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >5FDV_2f Chain 2f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >5UQ7_f Chain f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >5UQ8_f Chain f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >6CZR_1f Chain 1f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >6CZR_2f Chain 2f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >6FKR_1f Chain 1f, 30S ribosomal protein S6 [Thermus thermophilus HB8] >6FKR_2f Chain 2f, 30S ribosomal protein S6 [Thermus thermophilus HB8]
MRRYEVNIVLNPNLDQSQLALEKEIIQRALENYGARVEKVEELGLRRLAYPIAKDPQGYF
LWYQVEMPEDRVNDLARELRIRDNVRRVMVVKSQEPFLAN
>4ZER_1F Chain 1F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >4ZER_2F Chain 2F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >5F8K_1F Chain 1F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >5F8K_2F Chain 2F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >5FDU_1F Chain 1F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >5FDU_2F Chain 2F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >5FDV_1F Chain 1F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >5FDV_2F Chain 2F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >5UQ7_F Chain F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >5UQ8_F Chain F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >6CZR_1F Chain 1F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >6CZR_2F Chain 2F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >6FKR_1F Chain 1F, 50S ribosomal protein L4 [Thermus thermophilus HB8] >6FKR_2F Chain 2F, 50S ribosomal protein L4 [Thermus thermophilus HB8]
VYQIPVLSPSGRRELAADLPAEINPHLLWEVVRWQLAKRRRGTASTKTRGEVAYSGRKIW
PQKHTGRARHGDIGAPIFVGGGVVFGPKPRDYSYTLPKKVRKKGLAMAVADRAREGKLLL
VEAFAGVNGKTKEFLAWAKEAGLDGSESVLLVTGNELVRRAARNLPWVVTLAPEGLNVYD
IVRTERLVMDLDAWEVFQNRIGG

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

  • describe the problem
  • provide a reproducible example
@shenwei356
Copy link
Owner

Yes, seqkit subseq ignores cases, so it did not work appropriately for your sequences.

@xinehc
Copy link
Author

xinehc commented Jul 24, 2023

Thanks for confirming this.

I wonder do you have a plan of fixing this by adding a flag like --case-sensitive? Using bedtools/seqtk can solve the problem but adding another dependency only for this task seems not necessary.

@shenwei356
Copy link
Owner

Sure, sequence/chromosome IDs are case-sensitive by default now.

Result checked.

$ seq 1000 | rush 'bedtools getfasta -bed t.bed -fi t.fasta | seqkit sum' | csvtk freq -Ht
seqkit.v0.1_PLS_k0_34cb4f6fd383160b28dc0eaedbf57cae     1000

$ seq 1000 | rush 'seqkit subseq --quiet --bed t.bed t.fasta | seqkit sum' | csvtk freq -Ht
seqkit.v0.1_PLS_k0_34cb4f6fd383160b28dc0eaedbf57cae     1000

@xinehc
Copy link
Author

xinehc commented Jul 24, 2023

It is working now, thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants