seqkit common/seqkit grep #416

kakuk9 · 2023-10-03T14:36:30Z

Prerequisites

make sure you're are using the latest version by seqkit version
read the usage

Describe your issue

describe the problem
provide a reproducible example

Hi,

I have been trying to find a tool to compare reads from fastq files of different files to look for identical reads to see if there is any indication of cross-contaminations. I was trying to use seqkit common and seqkit grep.

This is the output of seqkit common,

seqkit common -s S1_R1_uniq.fq.gz S2_R1_uniq.fq.gz > S1_S2_uniq_common.fastq
[INFO] read file: S1_R1_uniq.fq.gz
[INFO] read file: S2_R1_uniq.fq.gz
[INFO] find common seqs ...
[INFO] 7830 unique sequences found in 2 files, which belong to 3915 records in the first file: S1_R1_uniq.fq.gz
[INFO] retrieve seqs from the first file: S2_R1_uniq.fq.gz

I am a bit confused with this line - "[INFO] 7830 unique sequences found in 2 files, which belong to 3915 records in the first file: S1_R1_uniq.fq.gz". Does it mean that there are 3915 common sequences shared by two fastq files?

I have also tried to use seqkit grep like this -
seqkit grep -s -f <(seqkit seq -s S2_R1_uniq.fq.gz) S1_R1_uniq.fq.gz > S2_S1_seqkit_grep.fastq
This process seems to take longer than seqkit common in my case. (number of reads in fastq files ~250-380k).

The text was updated successfully, but these errors were encountered:

shenwei356 · 2023-10-03T15:54:39Z

This process seems to take longer than seqkit common in my case

Thanks for reporting this. The help message below needs to be updated as the search mechanism of seqkit grep -s changed.

  3. For 2 files, 'seqkit grep' is much faster and consumes lesser memory:
     seqkit grep -f <(seqkit seq -n -i small.fq.gz) big.fq.gz # by seq ID
     seqkit grep -s -f <(seqkit seq -s small.fq.gz) big.fq.gz # by seq

Updated:

  3. For 2 files, 'seqkit grep' is much faster and consumes lesser memory:
       seqkit grep -f <(seqkit seq -n -i small.fq.gz) big.fq.gz # by seq ID

     But note that searching by sequence would be much slower, as it's
     partly string matching.
       seqkit grep -s -f <(seqkit seq -s small.fq.gz) big.fq.gz # much slower!!!!

For the information below, the first number indicates the number of signatures. In the case of searching by sequences, they are hash values of both positive and negative strands. I shall make it clearer.

[INFO] 7830 unique sequences found in 2 files, which belong to 3915 records in the first file: S1_R1_uniq.fq.gz

kakuk9 · 2023-10-03T16:04:58Z

This is really quick and helpful response. Thanks a lot for your clarification. Your tool has been amazing and very useful!

…corrected numbers in the log. #416

shenwei356 · 2023-10-04T08:15:24Z

The number is fixed.

seqkit_linux_amd64.tar.gz

$ seqkit common -s hairpin.fa hairpin.fa | seqkit stats 
[INFO] read file: hairpin.fa
[INFO] read file: hairpin.fa
[INFO] find common seqs ...
[INFO] 26379 unique sequences found in 2 files, which belong to 28645 records in the first file: hairpin.fa
[INFO] retrieve 28645 seqs from the first file: hairpin.fa
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   RNA     28,645  2,949,871       39      103    2,354

$ seqkit rmdup -s hairpin.fa | seqkit stats 
[INFO] 2266 duplicated records removed
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   RNA     26,379  2,748,673       39    104.2    2,354

shenwei356 added a commit that referenced this issue Oct 4, 2023

common: for matching by sequences: reduced the memory occupation and …

8de960b

…corrected numbers in the log. #416

shenwei356 closed this as completed Nov 7, 2023

shenwei356 mentioned this issue Nov 9, 2023

Update SeqKit to 2.6.0 bioconda/bioconda-recipes#44191

Merged

BrewTestBot mentioned this issue Nov 9, 2023

seqkit 2.6.0 Homebrew/homebrew-core#153813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seqkit common/seqkit grep #416

seqkit common/seqkit grep #416

kakuk9 commented Oct 3, 2023

shenwei356 commented Oct 3, 2023

kakuk9 commented Oct 3, 2023

shenwei356 commented Oct 4, 2023

seqkit common/seqkit grep #416

seqkit common/seqkit grep #416

Comments

kakuk9 commented Oct 3, 2023

shenwei356 commented Oct 3, 2023

kakuk9 commented Oct 3, 2023

shenwei356 commented Oct 4, 2023