seqkit split with regexp does not respect letter case overwriting file output #462

ammaraziz · 2024-04-29T08:37:07Z

Dear ShenWei,

Thank you again for creating and maintaining seqkit. Congrats on seqkit2 publication!

I need to split a fasta file from gisaid. An example fasta looks like this

>hRSV/a/test/123/2021
ATGC
>hRSV/b/test/234/2022
ATGC
>hRSV/A/test/345/2023
ATGC
>hRSV/B/test/567/2024
ATGC

The goal is to split the fasta files into 2 files. The pattern is essentially hRSV/A/ and hRSV/B/. However the fasta file contains capitals and lowercase A/a and B/b in the designation name.

The command I am using:

seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta -d

Terminal output looks correct:

[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta

Folder output:

test.part_B.fasta 
test.part_a.fasta

Note there are only 2 files when there should be 4. The contents are also incorrect, they contain the upper case designations.

I suspect the code which writes out the fasta files is ignoring the letter case, resulting in the overwriting of files.

Using seqkit v2.8.1 on macos (x86 rosetta) installed via conda.

The text was updated successfully, but these errors were encountered:

ammaraziz · 2024-04-29T08:38:04Z

Not related, I installed seqkit via mamba which reports the version as 2.8.1 but seqkit version reports 2.8.0.

shenwei356 · 2024-04-29T09:17:32Z

Supported now. Added a flag --ignore-case.

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d 
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d  --ignore-case
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d  --ignore-case -2
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] create or read FASTA index ...
[INFO] create FASTA index for test.fasta
[INFO]   4 records loaded from test.fasta.seqkit.fai
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta

seqkit_darwin_amd64.tar.gz
seqkit_darwin_arm64.tar.gz
seqkit_linux_amd64.tar.gz

Not related, I installed seqkit via mamba which reports the version as 2.8.1 but seqkit version reports 2.8.0.

Yes, I forgot to bump the version number in the tool.

ammaraziz · 2024-04-29T23:15:51Z

Thank you for the super quick response. For my usecase this is solves the issue.

But I think the bug still exists. Seqkit sends a message that 4 files are created, but creates 2 due to case conflict. Seqkit split message needs to reflect the output files.

shenwei356 · 2024-04-30T06:16:26Z

But I think the bug still exists. Seqkit sends a message that 4 files are created

Where's it?

ammaraziz · 2024-04-30T06:26:35Z

Seqkit prints this:

[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta

But it writes out only 2 files.

It needs to print either this:

[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta

or alternatively actually write case sensitive files (the preferred option in my opinion).

shenwei356 · 2024-04-30T06:54:51Z

I just read this issue once again.

seqkit split with regexp does not respect letter case overwriting file output
Note there are only 2 files when there should be 4

And find that SeqKit does so in a case-sensitive way.

seqkit 2.8.1 works as you expect.

$ ./seqkit-2.8.1 split -i --id-regexp "hRSV/(\w)/.+" test.fasta --force
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta

$ seqkit stats test.fasta.split/*
processed files:  4 / 4 [======================================] ETA: 0s. done
file                                format  type  num_seqs  sum_len  min_len  avg_len  max_len
test.fasta.split/test.part_a.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_A.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_b.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_B.fasta  FASTA   DNA          1        4        4        4        4

$ for f in test.fasta.split/*; do echo -e "$f\t$(seqkit head -n 1 $f | seqkit seq -n)"; done
test.fasta.split/test.part_a.fasta      hRSV/a/test/123/2021
test.fasta.split/test.part_A.fasta      hRSV/A/test/345/2023
test.fasta.split/test.part_b.fasta      hRSV/b/test/234/2022
test.fasta.split/test.part_B.fasta      hRSV/B/test/567/2024

Only adding --ignore-case (added yesterday) would ignore the case.

$ seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta --force --ignore-case
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta

$ for f in test.fasta.split/*; do echo -e "$f\t$(seqkit seq -ni $f | paste -sd ,)"; done
test.fasta.split/test.part_a.fasta      hRSV/a/test/123/2021,hRSV/A/test/345/2023
test.fasta.split/test.part_b.fasta      hRSV/b/test/234/2022,hRSV/B/test/567/2024

botond-sipos · 2024-04-30T08:46:58Z

This is likely to be an issue related to the case (in)sensitivity of MacOS file system.

shenwei356 · 2024-04-30T12:39:04Z

MacOS is not a case sensitive file system by default. So you can't have two files named File.txt and file.txt. You can choose to configure the OS as case sensitive if you want to.

OMG, I just learned this.

ammaraziz · 2024-05-02T00:06:58Z

That's exactly what's happening. Thanks @botond-sipos

@shenwei356 do you think adding a disclaimer to recommend macos users use the ignore case flag? This will hopefully stop future issues.

shenwei356 · 2024-05-17T15:14:21Z

Added.

For splitting by sequence IDs in Windows/MacOS, where the file systems might be case-insensitive,
output files might be overwritten if they are only different in cases, like Abc and ABC.

shenwei356 · 2024-05-17T15:16:55Z

Updated:

For splitting by sequence IDs in Windows/MacOS, where the file systems might be case-insensitive,
output files might be overwritten if they are only different in cases, like Abc and ABC.
To avoid this, please switch one -I/--ignore-case.

shenwei356 added a commit that referenced this issue Apr 29, 2024

split: add flag --ignore-case. #462

f8ab09c

shenwei356 added the add some doc label May 2, 2024

shenwei356 added a commit that referenced this issue May 17, 2024

add doc. #462

f56b39c

shenwei356 closed this as completed May 17, 2024

shenwei356 added a commit that referenced this issue May 17, 2024

add doc again. #462

79f556b

shenwei356 mentioned this issue May 17, 2024

Update SeqKit to v2.8.2 bioconda/bioconda-recipes#47933

Merged

BrewTestBot mentioned this issue May 17, 2024

seqkit 2.8.2 Homebrew/homebrew-core#172001

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seqkit split with regexp does not respect letter case overwriting file output #462

seqkit split with regexp does not respect letter case overwriting file output #462

ammaraziz commented Apr 29, 2024

ammaraziz commented Apr 29, 2024

shenwei356 commented Apr 29, 2024

ammaraziz commented Apr 29, 2024

shenwei356 commented Apr 30, 2024

ammaraziz commented Apr 30, 2024

shenwei356 commented Apr 30, 2024

botond-sipos commented Apr 30, 2024

shenwei356 commented Apr 30, 2024

ammaraziz commented May 2, 2024

shenwei356 commented May 17, 2024

shenwei356 commented May 17, 2024

seqkit split with regexp does not respect letter case overwriting file output #462

seqkit split with regexp does not respect letter case overwriting file output #462

Comments

ammaraziz commented Apr 29, 2024

ammaraziz commented Apr 29, 2024

shenwei356 commented Apr 29, 2024

ammaraziz commented Apr 29, 2024

shenwei356 commented Apr 30, 2024

ammaraziz commented Apr 30, 2024

shenwei356 commented Apr 30, 2024

botond-sipos commented Apr 30, 2024

shenwei356 commented Apr 30, 2024

ammaraziz commented May 2, 2024

shenwei356 commented May 17, 2024

shenwei356 commented May 17, 2024