Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqkit rmdup by ID does not find duplicates #486

Closed
ThDef opened this issue Sep 5, 2024 · 10 comments
Closed

Seqkit rmdup by ID does not find duplicates #486

ThDef opened this issue Sep 5, 2024 · 10 comments

Comments

@ThDef
Copy link

ThDef commented Sep 5, 2024

Hi,

Background :

I have some ONT data that I basecalled with Dorado and I wanted to process it with the Filtlong tool.
The problem is that Filtlong does not allow duplicated read IDs and apparently, some were found in my data.
I have searched for ways to remove duplicates in a big fastq file and I have found seqkit and the rmdup command.
I first used the following command :

zcat ./fastq/all_fastq.trimmed.fastq.gz | seqkit rmdup -s -o ./fastq/all_fastq.trimmed.rmduped.fastq.gz
[INFO] 8 duplicated records removed

But it was not enough as Filtlong still could not be used.
I then tried to remove duplicates by ID :

zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | seqkit rmdup -o ./fastq/all_fastq.trimmed.rmduped2.fastq.gz
[INFO] 0 duplicated records removed

But it did not find any duplicates. The problem is that when I zcat | grep the file with an ID from a duplicate given by Filtlong, it results in matches. It seems that there are indeed duplicates by ID as grep and Filtlong find them but when I use seqkit, nothing is found.

Do you have an idea why this happens ?

Thanks for your help,

PS : I am using seqkit v.2.8.2 installed with conda/mamba on a linux cluster.

@shenwei356
Copy link
Owner

shenwei356 commented Sep 5, 2024

Please paste the result of zcat | grep -i ID.

@ThDef
Copy link
Author

ThDef commented Sep 5, 2024

zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | grep -i "d0668d59-0124-4376-8df9-2a6d5cffee82"

@d0668d59-0124-4376-8df9-2a6d5cffee82 st:Z:2024-07-23T23:14:10.058+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA_1 GeForce RTX 2080 Ti
@d0668d59-0124-4376-8df9-2a6d5cffee82 st:Z:2024-07-23T23:14:10.058+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA_2 GeForce RTX 2080 Ti

@shenwei356
Copy link
Owner

Weird, it works when I create a fastq file with headers above.

Can you please run the command below, I need to make sure the IDs are extracted correctly.

zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | grep -i "d0668d59-0124-4376-8df9-2a6d5cffee82" | cat -A

and

 seqkit head -n 5  ./fastq/all_fastq.trimmed.rmduped.fastq.gz | seqkit seq -ni

# hope to see results like those:
d0668d59-0124-4376-8df9-2a6d5cffee82
d0668d59-0124-4376-8df9-2a6d5cffee82

@ThDef
Copy link
Author

ThDef commented Sep 5, 2024

zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | grep -i "d0668d59-0124-4376-8df9-2a6d5cffee82" | cat -A

@d0668d59-0124-4376-8df9-2a6d5cffee82^Ist:Z:2024-07-23T23:14:10.058+00:00^IRG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0^IDS:Z:gpu:NVIDIA_1 GeForce RTX 2080 Ti$
@d0668d59-0124-4376-8df9-2a6d5cffee82^Ist:Z:2024-07-23T23:14:10.058+00:00^IRG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0^IDS:Z:gpu:NVIDIA_2 GeForce RTX 2080 Ti$


seqkit head -n 5 ./fastq/all_fastq.trimmed.rmduped.fastq.gz | seqkit seq -ni

0988965e-cf7f-4a13-8c36-83f811fe130b st:Z:2024-07-23T23:18:23.387+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA
473353d0-5aeb-4ed2-a366-93b5063ccb64 st:Z:2024-07-23T23:18:30.690+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA
a476a761-c858-4221-9fb1-4582e8ac89a9 st:Z:2024-07-23T23:18:14.307+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA
7d013023-9e0e-4bda-98c5-f04a0fe899f5 st:Z:2024-07-23T23:18:28.109+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA
f4b14ed9-c92b-4462-9daf-0040727fbb53 st:Z:2024-07-23T23:18:36.487+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA

@shenwei356
Copy link
Owner

OK, it's due to the tab between the ID and description. However, seqkit is able to handle this.

Please confirm the seqkit version again:

seqkit version

@ThDef
Copy link
Author

ThDef commented Sep 5, 2024

seqkit version

seqkit v2.8.2

@shenwei356
Copy link
Owner

shenwei356 commented Sep 5, 2024

I've found the reason. I used a trick to speed up id parsing. It works for most cases, but not for yours, where you have a tab between ID and the description, and then some spaces in the description.

Currently, please add this option to seqkit rmdup, which will produce the right result.

--id-regexp "^([^\s]+)\s?"

@shenwei356
Copy link
Owner

shenwei356 commented Sep 5, 2024

Fixed.

Details:

  • fix sequence ID parsing with the default regular expression for a rare case: "xxx\tyyy zzz" was wrongly parsed as "xxx\tyyy". shenwei356/bio@da8442f
  • When the default regular expression is used, we actually use bytes.Index() instead as it's faster. However, I didn't consider this case.

@ThDef
Copy link
Author

ThDef commented Sep 6, 2024

It works !

Thanks for your help !

@ThDef ThDef closed this as completed Sep 6, 2024
@shenwei356
Copy link
Owner

The previous bugfix failed to recognize regular header formats with space as the delimiter between header and description... I just fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants