Benchmark xz and zstd performance #298

heuermh · 2022-03-31T21:57:08Z

FYI

I wanted to benchmark xz and zstd performance for dsh-bio, which uses BioJava FASTQ parsing circa 2003 and compression provided by Apache Commons Compress with default settings. dsh-bio parses fully and validates all FASTQ records so I used similar settings for seqkit.

Shell script in this Gist.

Compression results:

Command	Real time	Time (sec)	Disk usage
`xz --compress --stdout -0 dataset_C.fq > dataset_C.0.fq.xz`	1m24.844s	84	528M
`xz --compress --stdout dataset_C.fq > dataset_C.default.fq.xz`	20m36.807s	1236	416M
`xz --compress --stdout -9 dataset_C.fq > dataset_C.9.fq.xz`	31m19.497s	1879	384M
`xz --compress --stdout --extreme dataset_C.fq > dataset_C.extreme.fq.xz`	26m40.244s	1600	400M
`zstd -1 -k dataset_C.fq -o dataset_C.1.fq.zst`	0m5.379s	5	565M
`zstd -k dataset_C.fq -o dataset_C.default.fq.zst`	0m6.863s	6	541M
`zstd -6 -k dataset_C.fq -o dataset_C.6.fq.zst`	0m28.487s	28	512M
`zstd -19 -k dataset_C.fq -o dataset_C.19.fq.zst`	20m7.238s	1207	416M
`dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.xz`	31m23.348s	1883	400M
`dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.gz`	3m14.372s	194	512M
`dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bgz`	1m22.520s	82	544M
`dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bzip2`	0m15.845s	15	2.1G
`dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.zst`	0m23.295s	23	528M
`seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.xz`	2m26.924s	146	512M
`seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.zst`	0m8.624s	8	528M

File info:

$ xz --list *.xz
Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1       1    524.4 MiB  2,212.5 MiB  0.237  CRC64   dataset_C.0.fq.xz
    1       1    377.2 MiB  2,212.5 MiB  0.171  CRC64   dataset_C.9.fq.xz
    1       1    407.1 MiB  2,212.5 MiB  0.184  CRC64   dataset_C.default.fq.xz
    1       1    392.2 MiB  2,195.0 MiB  0.179  CRC64   dataset_C.dsh-bio.fq.xz
    1       1    396.9 MiB  2,212.5 MiB  0.179  CRC64   dataset_C.extreme.fq.xz
    1       1    500.9 MiB  2,195.0 MiB  0.228  CRC64   dataset_C.seqkit.fq.xz
-------------------------------------------------------------------------------
    6       6  2,598.7 MiB     12.9 GiB  0.196  CRC64   6 files

$ zstd -l *.zst
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0     561 MiB      2.16 GiB  3.943  XXH64  dataset_C.1.fq.zst
     1      0     415 MiB      2.16 GiB  5.335  XXH64  dataset_C.19.fq.zst
     1      0     502 MiB      2.16 GiB  4.406  XXH64  dataset_C.6.fq.zst
     1      0     539 MiB      2.16 GiB  4.108  XXH64  dataset_C.default.fq.zst
     1      0     521 MiB                        None  dataset_C.dsh-bio.fq.zst
     1      0     527 MiB                       XXH64  dataset_C.seqkit.fq.zst
-----------------------------------------------------------------
     6      0    2.99 GiB                              6 files

Decompression results:

Command	Real time	Time (sec)
`dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.xz`	0m54.403s	54
`dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.zst`	0m14.891s	14
`dsh-bio compress-fastq -i dataset_C.seqkit.fq.xz`	0m55.885s	55
`dsh-bio compress-fastq -i dataset_C.seqkit.fq.zst`	0m16.153s	16
`seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.xz`	0m55.961s	55
`seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.zst`	0m2.951s	2
`seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.xz`	1m6.193s	66
`seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.zst`	0m3.097s	3

TL;DR

seqkit is fast, and prefer zstd 😉

The text was updated successfully, but these errors were encountered:

shenwei356 · 2022-04-01T06:55:01Z

That looks good! The fast speed is owed to the great packages:

akhst7 · 2022-07-21T20:27:30Z

@heuermh

Can you try xz --best -T0 and see how fast it is ?

shenwei356 · 2023-03-14T14:39:44Z

I just noticed that the pgzip package sets the default compression level of gzip as 5 rather than 6 (default of gzip), which (5) should be faster.

I've added a global flag (--compress-level) to set the compression level for gzip, zstd, and bzip2. #320

heuermh · 2024-04-10T17:39:20Z

Thank you! Closing this issue as complete.

heuermh closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark xz and zstd performance #298

Benchmark xz and zstd performance #298

heuermh commented Mar 31, 2022

shenwei356 commented Apr 1, 2022

akhst7 commented Jul 21, 2022

shenwei356 commented Mar 14, 2023

heuermh commented Apr 10, 2024

Benchmark xz and zstd performance #298

Benchmark xz and zstd performance #298

Comments

heuermh commented Mar 31, 2022

TL;DR

shenwei356 commented Apr 1, 2022

akhst7 commented Jul 21, 2022

shenwei356 commented Mar 14, 2023

heuermh commented Apr 10, 2024