Seqkit rmdup by ID does not find duplicates #486

ThDef · 2024-09-05T14:28:24Z

Hi,

Background :

I have some ONT data that I basecalled with Dorado and I wanted to process it with the Filtlong tool.
The problem is that Filtlong does not allow duplicated read IDs and apparently, some were found in my data.
I have searched for ways to remove duplicates in a big fastq file and I have found seqkit and the rmdup command.
I first used the following command :

zcat ./fastq/all_fastq.trimmed.fastq.gz | seqkit rmdup -s -o ./fastq/all_fastq.trimmed.rmduped.fastq.gz
[INFO] 8 duplicated records removed

But it was not enough as Filtlong still could not be used.
I then tried to remove duplicates by ID :

zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | seqkit rmdup -o ./fastq/all_fastq.trimmed.rmduped2.fastq.gz
[INFO] 0 duplicated records removed

But it did not find any duplicates. The problem is that when I zcat | grep the file with an ID from a duplicate given by Filtlong, it results in matches. It seems that there are indeed duplicates by ID as grep and Filtlong find them but when I use seqkit, nothing is found.

Do you have an idea why this happens ?

Thanks for your help,

PS : I am using seqkit v.2.8.2 installed with conda/mamba on a linux cluster.

shenwei356 · 2024-09-05T14:37:02Z

Please paste the result of zcat | grep -i ID.

ThDef · 2024-09-05T14:41:07Z

zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | grep -i "d0668d59-0124-4376-8df9-2a6d5cffee82"

@d0668d59-0124-4376-8df9-2a6d5cffee82 st:Z:2024-07-23T23:14:10.058+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA_1 GeForce RTX 2080 Ti
@d0668d59-0124-4376-8df9-2a6d5cffee82 st:Z:2024-07-23T23:14:10.058+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA_2 GeForce RTX 2080 Ti

shenwei356 · 2024-09-05T14:51:03Z

Weird, it works when I create a fastq file with headers above.

Can you please run the command below, I need to make sure the IDs are extracted correctly.

zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | grep -i "d0668d59-0124-4376-8df9-2a6d5cffee82" | cat -A

and

 seqkit head -n 5  ./fastq/all_fastq.trimmed.rmduped.fastq.gz | seqkit seq -ni

# hope to see results like those:
d0668d59-0124-4376-8df9-2a6d5cffee82
d0668d59-0124-4376-8df9-2a6d5cffee82

ThDef · 2024-09-05T14:57:05Z

zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | grep -i "d0668d59-0124-4376-8df9-2a6d5cffee82" | cat -A

@d0668d59-0124-4376-8df9-2a6d5cffee82^Ist:Z:2024-07-23T23:14:10.058+00:00^IRG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0^IDS:Z:gpu:NVIDIA_1 GeForce RTX 2080 Ti$
@d0668d59-0124-4376-8df9-2a6d5cffee82^Ist:Z:2024-07-23T23:14:10.058+00:00^IRG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0^IDS:Z:gpu:NVIDIA_2 GeForce RTX 2080 Ti$

seqkit head -n 5 ./fastq/all_fastq.trimmed.rmduped.fastq.gz | seqkit seq -ni

0988965e-cf7f-4a13-8c36-83f811fe130b st:Z:2024-07-23T23:18:23.387+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA
473353d0-5aeb-4ed2-a366-93b5063ccb64 st:Z:2024-07-23T23:18:30.690+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA
a476a761-c858-4221-9fb1-4582e8ac89a9 st:Z:2024-07-23T23:18:14.307+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA
7d013023-9e0e-4bda-98c5-f04a0fe899f5 st:Z:2024-07-23T23:18:28.109+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA
f4b14ed9-c92b-4462-9daf-0040727fbb53 st:Z:2024-07-23T23:18:36.487+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA

shenwei356 · 2024-09-05T15:02:31Z

OK, it's due to the tab between the ID and description. However, seqkit is able to handle this.

Please confirm the seqkit version again:

seqkit version

ThDef · 2024-09-05T15:03:13Z

seqkit version

seqkit v2.8.2

shenwei356 · 2024-09-05T15:29:36Z

I've found the reason. I used a trick to speed up id parsing. It works for most cases, but not for yours, where you have a tab between ID and the description, and then some spaces in the description.

Currently, please add this option to seqkit rmdup, which will produce the right result.

--id-regexp "^([^\s]+)\s?"

shenwei356 · 2024-09-05T16:01:25Z

Fixed.

~~seqkit_linux_amd64.tar.gz~~
~~seqkit_linux_arm64.tar.gz~~

Details:

fix sequence ID parsing with the default regular expression for a rare case: "xxx\tyyy zzz" was wrongly parsed as "xxx\tyyy". shenwei356/bio@da8442f
When the default regular expression is used, we actually use bytes.Index() instead as it's faster. However, I didn't consider this case.

ThDef · 2024-09-06T08:10:14Z

It works !

Thanks for your help !

shenwei356 · 2024-09-12T10:44:27Z

The previous bugfix failed to recognize regular header formats with space as the delimiter between header and description... I just fixed it.

shenwei356 added a commit that referenced this issue Sep 5, 2024

fix sequence ID parsing of a rare case. #486

d6b1faf

ThDef closed this as completed Sep 6, 2024

shenwei356 mentioned this issue Nov 1, 2024

Update SeqKit to v2.9.0 bioconda/bioconda-recipes#51860

Merged

BrewTestBot mentioned this issue Nov 1, 2024

seqkit 2.9.0 Homebrew/homebrew-core#196354

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seqkit rmdup by ID does not find duplicates #486

Seqkit rmdup by ID does not find duplicates #486

ThDef commented Sep 5, 2024 •

edited

Loading

shenwei356 commented Sep 5, 2024 •

edited

Loading

ThDef commented Sep 5, 2024 •

edited

Loading

shenwei356 commented Sep 5, 2024

ThDef commented Sep 5, 2024

shenwei356 commented Sep 5, 2024

ThDef commented Sep 5, 2024

shenwei356 commented Sep 5, 2024 •

edited

Loading

shenwei356 commented Sep 5, 2024 •

edited

Loading

ThDef commented Sep 6, 2024

shenwei356 commented Sep 12, 2024

Seqkit rmdup by ID does not find duplicates #486

Seqkit rmdup by ID does not find duplicates #486

Comments

ThDef commented Sep 5, 2024 • edited Loading

shenwei356 commented Sep 5, 2024 • edited Loading

ThDef commented Sep 5, 2024 • edited Loading

shenwei356 commented Sep 5, 2024

ThDef commented Sep 5, 2024

shenwei356 commented Sep 5, 2024

ThDef commented Sep 5, 2024

shenwei356 commented Sep 5, 2024 • edited Loading

shenwei356 commented Sep 5, 2024 • edited Loading

ThDef commented Sep 6, 2024

shenwei356 commented Sep 12, 2024

ThDef commented Sep 5, 2024 •

edited

Loading

shenwei356 commented Sep 5, 2024 •

edited

Loading

ThDef commented Sep 5, 2024 •

edited

Loading

shenwei356 commented Sep 5, 2024 •

edited

Loading

shenwei356 commented Sep 5, 2024 •

edited

Loading