phredsort
is a command-line tool for sorting sequences in FASTQ files by their quality scores.
Basic usage:
# Read from `input.fastq.gz` and write to `output.fastq.gz`
phredsort -i input.fastq.gz -o output.fastq.gz
# Read from stdin and write to stdout
zcat input.fastq.gz | phredsort --in - --out - | less -S
wget https://github.com/vmikk/phredsort/releases/download/1.3.0/phredsort
chmod +x phredsort
./phredsort --help
git clone --depth 1 https://github.com/vmikk/phredsort
cd phredsort
go build -ldflags="-s -w" phredsort.go
./phredsort --help
phredsort
supports several metrics (--metric
parameter) to assess sequence quality:
- Properly calculated mean quality score that accounts for the logarithmic nature of Phred scores
- Converts Phred scores to error probabilities, calculates their arithmetic mean, then converts back to Phred scale
- Formula:
-10 * log10(mean(10^(-Q/10)))
- More accurate than simple arithmetic mean of Phred scores, which would overestimate quality
- Sum of error probabilities for all bases in a sequence
- Formula:
sum(10^(-Q/10))
- Higher values indicate lower quality
- Depends on sequence length (longer sequences tend to have higher MaxEE)
- MaxEE standardized by sequence length
- Represents expected number of errors per 100 bases
- Formula:
(MaxEE * 100) / sequence_length
- Higher values indicate lower quality
- Allows fair comparison between sequences of different lengths
- Number of bases below specified quality threshold
- Useful for binned quality scores (e.g., data from Illumina NovaSeq platform)
- Counts bases with Phred score < threshold (default: 15)
- Higher values indicate lower quality
- Percentage of bases below quality threshold
- Formula:
(lqcount * 100) / sequence_length
- Higher values indicate lower quality
- Normalizes low-quality base count by sequence length