Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include pairing metadata in fastq-dump output #56

Closed
standage opened this issue Jan 17, 2017 · 2 comments
Closed

Include pairing metadata in fastq-dump output #56

standage opened this issue Jan 17, 2017 · 2 comments

Comments

@standage
Copy link

When I run fastq-dump --split-files --gzip --A SRR3938279 I get FASTQ headers like this...

@SRR3938279.1 1 length=101
@SRR3938279.2 2 length=101
@SRR3938279.3 3 length=101

...in both the _1.fastq.gz file and the _2.fastq.gz file. Each read is assigned with two copies of a serial number. The pairing information isn't provided anywhere. Some common conventions for pairing metadata are as follows.

  • @ACCESSION/1 OtherMetaData and @ACCESSION/2 OtherMetadata for left and right reads, respectively
  • @ACCESSION 1:OtherMetadata and @ACCESSION 2:OtherMetadata
  • @ACCESSION OtherMetadata/1 and @ACCESSION OtherMetadata/2

Having some way to distinguish left and right reads is important for quality control, especially once reads are interleaved

@buddej
Copy link

buddej commented Jan 23, 2017

Try the --defline-seq option. It's not documented (at all) on the NCBI fastq-dump page, but it is in the help. $ri is the variable you want for /1 and /2

fastq-dump --help | grep -A 13 defline-seq

  --defline-seq <fmt>              Defline format specification for sequence.
  --defline-qual <fmt>             Defline format specification for quality.
                                   <fmt> is string of characters and/or
                                   variables. The variables can be one of: $ac
                                   - accession, $si spot id, $sn spot
                                   name, $sg spot group (barcode), $sl spot
                                   length in bases, $ri read number, $rn
                                   read name, $rl read length in bases. '[]'
                                   could be used for an optional output: if
                                   all vars in [] yield empty values whole
                                   group is not printed. Empty value is empty
                                   string or for numeric variables. Ex:
                                   @$sn[_$rn]/$ri '_$rn' is omitted if name
                                   is empty

@standage
Copy link
Author

Excellent. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants