Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqkit split with regexp does not respect letter case overwriting file output #462

Closed
ammaraziz opened this issue Apr 29, 2024 · 11 comments
Closed

Comments

@ammaraziz
Copy link

Dear ShenWei,

Thank you again for creating and maintaining seqkit. Congrats on seqkit2 publication!

I need to split a fasta file from gisaid. An example fasta looks like this

>hRSV/a/test/123/2021
ATGC
>hRSV/b/test/234/2022
ATGC
>hRSV/A/test/345/2023
ATGC
>hRSV/B/test/567/2024
ATGC

The goal is to split the fasta files into 2 files. The pattern is essentially hRSV/A/ and hRSV/B/. However the fasta file contains capitals and lowercase A/a and B/b in the designation name.

The command I am using:

seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta -d

Terminal output looks correct:

[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta

Folder output:

test.part_B.fasta 
test.part_a.fasta

Note there are only 2 files when there should be 4. The contents are also incorrect, they contain the upper case designations.

I suspect the code which writes out the fasta files is ignoring the letter case, resulting in the overwriting of files.

Using seqkit v2.8.1 on macos (x86 rosetta) installed via conda.

@ammaraziz
Copy link
Author

Not related, I installed seqkit via mamba which reports the version as 2.8.1 but seqkit version reports 2.8.0.

shenwei356 added a commit that referenced this issue Apr 29, 2024
@shenwei356
Copy link
Owner

Supported now. Added a flag --ignore-case.

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d 
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d  --ignore-case
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d  --ignore-case -2
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] create or read FASTA index ...
[INFO] create FASTA index for test.fasta
[INFO]   4 records loaded from test.fasta.seqkit.fai
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta

seqkit_darwin_amd64.tar.gz
seqkit_darwin_arm64.tar.gz
seqkit_linux_amd64.tar.gz

Not related, I installed seqkit via mamba which reports the version as 2.8.1 but seqkit version reports 2.8.0.

Yes, I forgot to bump the version number in the tool.

@ammaraziz
Copy link
Author

Thank you for the super quick response. For my usecase this is solves the issue.

But I think the bug still exists. Seqkit sends a message that 4 files are created, but creates 2 due to case conflict. Seqkit split message needs to reflect the output files.

@shenwei356
Copy link
Owner

But I think the bug still exists. Seqkit sends a message that 4 files are created

Where's it?

@ammaraziz
Copy link
Author

Seqkit prints this:

[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta

But it writes out only 2 files.

It needs to print either this:

[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta

or alternatively actually write case sensitive files (the preferred option in my opinion).

@shenwei356
Copy link
Owner

I just read this issue once again.

seqkit split with regexp does not respect letter case overwriting file output
Note there are only 2 files when there should be 4

And find that SeqKit does so in a case-sensitive way.

seqkit 2.8.1 works as you expect.

$ ./seqkit-2.8.1 split -i --id-regexp "hRSV/(\w)/.+" test.fasta --force
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta

$ seqkit stats test.fasta.split/*
processed files:  4 / 4 [======================================] ETA: 0s. done
file                                format  type  num_seqs  sum_len  min_len  avg_len  max_len
test.fasta.split/test.part_a.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_A.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_b.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_B.fasta  FASTA   DNA          1        4        4        4        4

$ for f in test.fasta.split/*; do echo -e "$f\t$(seqkit head -n 1 $f | seqkit seq -n)"; done
test.fasta.split/test.part_a.fasta      hRSV/a/test/123/2021
test.fasta.split/test.part_A.fasta      hRSV/A/test/345/2023
test.fasta.split/test.part_b.fasta      hRSV/b/test/234/2022
test.fasta.split/test.part_B.fasta      hRSV/B/test/567/2024

Only adding --ignore-case (added yesterday) would ignore the case.

$ seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta --force --ignore-case
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta

$ for f in test.fasta.split/*; do echo -e "$f\t$(seqkit seq -ni $f | paste -sd ,)"; done
test.fasta.split/test.part_a.fasta      hRSV/a/test/123/2021,hRSV/A/test/345/2023
test.fasta.split/test.part_b.fasta      hRSV/b/test/234/2022,hRSV/B/test/567/2024

@botond-sipos
Copy link
Contributor

This is likely to be an issue related to the case (in)sensitivity of MacOS file system.

@shenwei356
Copy link
Owner

MacOS is not a case sensitive file system by default. So you can't have two files named File.txt and file.txt. You can choose to configure the OS as case sensitive if you want to.

OMG, I just learned this.

@ammaraziz
Copy link
Author

That's exactly what's happening. Thanks @botond-sipos

@shenwei356 do you think adding a disclaimer to recommend macos users use the ignore case flag? This will hopefully stop future issues.

shenwei356 added a commit that referenced this issue May 17, 2024
@shenwei356
Copy link
Owner

Added.

  1. For splitting by sequence IDs in Windows/MacOS, where the file systems might be case-insensitive,
    output files might be overwritten if they are only different in cases, like Abc and ABC.

shenwei356 added a commit that referenced this issue May 17, 2024
@shenwei356
Copy link
Owner

Updated:

  1. For splitting by sequence IDs in Windows/MacOS, where the file systems might be case-insensitive,
    output files might be overwritten if they are only different in cases, like Abc and ABC.
    To avoid this, please switch one -I/--ignore-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants