Include pairing metadata in fastq-dump output #56

standage · 2017-01-17T00:41:15Z

When I run fastq-dump --split-files --gzip --A SRR3938279 I get FASTQ headers like this...

@SRR3938279.1 1 length=101
@SRR3938279.2 2 length=101
@SRR3938279.3 3 length=101

...in both the _1.fastq.gz file and the _2.fastq.gz file. Each read is assigned with two copies of a serial number. The pairing information isn't provided anywhere. Some common conventions for pairing metadata are as follows.

@ACCESSION/1 OtherMetaData and @ACCESSION/2 OtherMetadata for left and right reads, respectively
@ACCESSION 1:OtherMetadata and @ACCESSION 2:OtherMetadata
@ACCESSION OtherMetadata/1 and @ACCESSION OtherMetadata/2

Having some way to distinguish left and right reads is important for quality control, especially once reads are interleaved

The text was updated successfully, but these errors were encountered:

buddej · 2017-01-23T17:05:38Z

Try the --defline-seq option. It's not documented (at all) on the NCBI fastq-dump page, but it is in the help. $ri is the variable you want for /1 and /2

fastq-dump --help | grep -A 13 defline-seq

  --defline-seq <fmt>              Defline format specification for sequence.
  --defline-qual <fmt>             Defline format specification for quality.
                                   <fmt> is string of characters and/or
                                   variables. The variables can be one of: $ac
                                   - accession, $si spot id, $sn spot
                                   name, $sg spot group (barcode), $sl spot
                                   length in bases, $ri read number, $rn
                                   read name, $rl read length in bases. '[]'
                                   could be used for an optional output: if
                                   all vars in [] yield empty values whole
                                   group is not printed. Empty value is empty
                                   string or for numeric variables. Ex:
                                   @$sn[_$rn]/$ri '_$rn' is omitted if name
                                   is empty

standage · 2017-01-23T17:13:09Z

Excellent. Thank you!

standage closed this as completed Jan 23, 2017

johnsolk mentioned this issue May 5, 2018

add --defline-seq to fastq-dump command in get_data.py johnsolk/MMETSP#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include pairing metadata in fastq-dump output #56

Include pairing metadata in fastq-dump output #56

standage commented Jan 17, 2017

buddej commented Jan 23, 2017

standage commented Jan 23, 2017

Include pairing metadata in fastq-dump output #56

Include pairing metadata in fastq-dump output #56

Comments

standage commented Jan 17, 2017

buddej commented Jan 23, 2017

standage commented Jan 23, 2017