-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seqkit rmdup by ID does not find duplicates #486
Comments
Please paste the result of |
zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | grep -i "d0668d59-0124-4376-8df9-2a6d5cffee82" @d0668d59-0124-4376-8df9-2a6d5cffee82 st:Z:2024-07-23T23:14:10.058+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA_1 GeForce RTX 2080 Ti |
Weird, it works when I create a fastq file with headers above. Can you please run the command below, I need to make sure the IDs are extracted correctly.
and
|
zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | grep -i "d0668d59-0124-4376-8df9-2a6d5cffee82" | cat -A @d0668d59-0124-4376-8df9-2a6d5cffee82^Ist:Z:2024-07-23T23:14:10.058+00:00^IRG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0^IDS:Z:gpu:NVIDIA_1 GeForce RTX 2080 Ti$ seqkit head -n 5 ./fastq/all_fastq.trimmed.rmduped.fastq.gz | seqkit seq -ni 0988965e-cf7f-4a13-8c36-83f811fe130b st:Z:2024-07-23T23:18:23.387+00:00 RG:Z:7f9661239176c399dd58386be589feac999e28be_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA |
OK, it's due to the tab between the ID and description. However, seqkit is able to handle this. Please confirm the seqkit version again:
|
seqkit version seqkit v2.8.2 |
I've found the reason. I used a trick to speed up id parsing. It works for most cases, but not for yours, where you have a tab between ID and the description, and then some spaces in the description. Currently, please add this option to
|
Fixed. Details:
|
It works ! Thanks for your help ! |
The previous bugfix failed to recognize regular header formats with space as the delimiter between header and description... I just fixed it. |
Hi,
Background :
I have some ONT data that I basecalled with Dorado and I wanted to process it with the Filtlong tool.
The problem is that Filtlong does not allow duplicated read IDs and apparently, some were found in my data.
I have searched for ways to remove duplicates in a big fastq file and I have found seqkit and the rmdup command.
I first used the following command :
zcat ./fastq/all_fastq.trimmed.fastq.gz | seqkit rmdup -s -o ./fastq/all_fastq.trimmed.rmduped.fastq.gz
[INFO] 8 duplicated records removed
But it was not enough as Filtlong still could not be used.
I then tried to remove duplicates by ID :
zcat ./fastq/all_fastq.trimmed.rmduped.fastq.gz | seqkit rmdup -o ./fastq/all_fastq.trimmed.rmduped2.fastq.gz
[INFO] 0 duplicated records removed
But it did not find any duplicates. The problem is that when I zcat | grep the file with an ID from a duplicate given by Filtlong, it results in matches. It seems that there are indeed duplicates by ID as grep and Filtlong find them but when I use seqkit, nothing is found.
Do you have an idea why this happens ?
Thanks for your help,
PS : I am using seqkit v.2.8.2 installed with conda/mamba on a linux cluster.
The text was updated successfully, but these errors were encountered: