Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqkit common/seqkit grep #416

Closed
4 tasks done
kakuk9 opened this issue Oct 3, 2023 · 3 comments
Closed
4 tasks done

seqkit common/seqkit grep #416

kakuk9 opened this issue Oct 3, 2023 · 3 comments

Comments

@kakuk9
Copy link

kakuk9 commented Oct 3, 2023

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

  • describe the problem
  • provide a reproducible example

Hi,

I have been trying to find a tool to compare reads from fastq files of different files to look for identical reads to see if there is any indication of cross-contaminations. I was trying to use seqkit common and seqkit grep.

This is the output of seqkit common,

seqkit common -s S1_R1_uniq.fq.gz S2_R1_uniq.fq.gz > S1_S2_uniq_common.fastq
[INFO] read file: S1_R1_uniq.fq.gz
[INFO] read file: S2_R1_uniq.fq.gz
[INFO] find common seqs ...
[INFO] 7830 unique sequences found in 2 files, which belong to 3915 records in the first file: S1_R1_uniq.fq.gz
[INFO] retrieve seqs from the first file: S2_R1_uniq.fq.gz

I am a bit confused with this line - "[INFO] 7830 unique sequences found in 2 files, which belong to 3915 records in the first file: S1_R1_uniq.fq.gz". Does it mean that there are 3915 common sequences shared by two fastq files?

I have also tried to use seqkit grep like this -
seqkit grep -s -f <(seqkit seq -s S2_R1_uniq.fq.gz) S1_R1_uniq.fq.gz > S2_S1_seqkit_grep.fastq
This process seems to take longer than seqkit common in my case. (number of reads in fastq files ~250-380k).

@shenwei356
Copy link
Owner

This process seems to take longer than seqkit common in my case

Thanks for reporting this. The help message below needs to be updated as the search mechanism of seqkit grep -s changed.

  3. For 2 files, 'seqkit grep' is much faster and consumes lesser memory:
     seqkit grep -f <(seqkit seq -n -i small.fq.gz) big.fq.gz # by seq ID
     seqkit grep -s -f <(seqkit seq -s small.fq.gz) big.fq.gz # by seq

Updated:

  3. For 2 files, 'seqkit grep' is much faster and consumes lesser memory:
       seqkit grep -f <(seqkit seq -n -i small.fq.gz) big.fq.gz # by seq ID

     But note that searching by sequence would be much slower, as it's
     partly string matching.
       seqkit grep -s -f <(seqkit seq -s small.fq.gz) big.fq.gz # much slower!!!!

For the information below, the first number indicates the number of signatures. In the case of searching by sequences, they are hash values of both positive and negative strands. I shall make it clearer.

[INFO] 7830 unique sequences found in 2 files, which belong to 3915 records in the first file: S1_R1_uniq.fq.gz

@kakuk9
Copy link
Author

kakuk9 commented Oct 3, 2023

This is really quick and helpful response. Thanks a lot for your clarification. Your tool has been amazing and very useful!

shenwei356 added a commit that referenced this issue Oct 4, 2023
@shenwei356
Copy link
Owner

The number is fixed.

seqkit_linux_amd64.tar.gz

$ seqkit common -s hairpin.fa hairpin.fa | seqkit stats 
[INFO] read file: hairpin.fa
[INFO] read file: hairpin.fa
[INFO] find common seqs ...
[INFO] 26379 unique sequences found in 2 files, which belong to 28645 records in the first file: hairpin.fa
[INFO] retrieve 28645 seqs from the first file: hairpin.fa
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   RNA     28,645  2,949,871       39      103    2,354

$ seqkit rmdup -s hairpin.fa | seqkit stats 
[INFO] 2266 duplicated records removed
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   RNA     26,379  2,748,673       39    104.2    2,354

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants