Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark xz and zstd performance #298

Closed
heuermh opened this issue Mar 31, 2022 · 4 comments
Closed

Benchmark xz and zstd performance #298

heuermh opened this issue Mar 31, 2022 · 4 comments

Comments

@heuermh
Copy link

heuermh commented Mar 31, 2022

FYI

I wanted to benchmark xz and zstd performance for dsh-bio, which uses BioJava FASTQ parsing circa 2003 and compression provided by Apache Commons Compress with default settings. dsh-bio parses fully and validates all FASTQ records so I used similar settings for seqkit.

Shell script in this Gist.

Compression results:

Command Real time Time (sec) Disk usage
xz --compress --stdout -0 dataset_C.fq > dataset_C.0.fq.xz 1m24.844s 84 528M
xz --compress --stdout dataset_C.fq > dataset_C.default.fq.xz 20m36.807s 1236 416M
xz --compress --stdout -9 dataset_C.fq > dataset_C.9.fq.xz 31m19.497s 1879 384M
xz --compress --stdout --extreme dataset_C.fq > dataset_C.extreme.fq.xz 26m40.244s 1600 400M
zstd -1 -k dataset_C.fq -o dataset_C.1.fq.zst 0m5.379s 5 565M
zstd -k dataset_C.fq -o dataset_C.default.fq.zst 0m6.863s 6 541M
zstd -6 -k dataset_C.fq -o dataset_C.6.fq.zst 0m28.487s 28 512M
zstd -19 -k dataset_C.fq -o dataset_C.19.fq.zst 20m7.238s 1207 416M
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.xz 31m23.348s 1883 400M
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.gz 3m14.372s 194 512M
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bgz 1m22.520s 82 544M
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bzip2 0m15.845s 15 2.1G
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.zst 0m23.295s 23 528M
seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.xz 2m26.924s 146 512M
seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.zst 0m8.624s 8 528M

File info:

$ xz --list *.xz
Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1       1    524.4 MiB  2,212.5 MiB  0.237  CRC64   dataset_C.0.fq.xz
    1       1    377.2 MiB  2,212.5 MiB  0.171  CRC64   dataset_C.9.fq.xz
    1       1    407.1 MiB  2,212.5 MiB  0.184  CRC64   dataset_C.default.fq.xz
    1       1    392.2 MiB  2,195.0 MiB  0.179  CRC64   dataset_C.dsh-bio.fq.xz
    1       1    396.9 MiB  2,212.5 MiB  0.179  CRC64   dataset_C.extreme.fq.xz
    1       1    500.9 MiB  2,195.0 MiB  0.228  CRC64   dataset_C.seqkit.fq.xz
-------------------------------------------------------------------------------
    6       6  2,598.7 MiB     12.9 GiB  0.196  CRC64   6 files

$ zstd -l *.zst
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0     561 MiB      2.16 GiB  3.943  XXH64  dataset_C.1.fq.zst
     1      0     415 MiB      2.16 GiB  5.335  XXH64  dataset_C.19.fq.zst
     1      0     502 MiB      2.16 GiB  4.406  XXH64  dataset_C.6.fq.zst
     1      0     539 MiB      2.16 GiB  4.108  XXH64  dataset_C.default.fq.zst
     1      0     521 MiB                        None  dataset_C.dsh-bio.fq.zst
     1      0     527 MiB                       XXH64  dataset_C.seqkit.fq.zst
-----------------------------------------------------------------
     6      0    2.99 GiB                              6 files

Decompression results:

Command Real time Time (sec)
dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.xz 0m54.403s 54
dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.zst 0m14.891s 14
dsh-bio compress-fastq -i dataset_C.seqkit.fq.xz 0m55.885s 55
dsh-bio compress-fastq -i dataset_C.seqkit.fq.zst 0m16.153s 16
seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.xz 0m55.961s 55
seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.zst 0m2.951s 2
seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.xz 1m6.193s 66
seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.zst 0m3.097s 3

TL;DR

seqkit is fast, and prefer zstd 😉

@shenwei356
Copy link
Owner

That looks good! The fast speed is owed to the great packages:

@akhst7
Copy link

akhst7 commented Jul 21, 2022

@heuermh

Can you try xz --best -T0 and see how fast it is ?

@shenwei356
Copy link
Owner

I just noticed that the pgzip package sets the default compression level of gzip as 5 rather than 6 (default of gzip), which (5) should be faster.

I've added a global flag (--compress-level) to set the compression level for gzip, zstd, and bzip2. #320

@heuermh
Copy link
Author

heuermh commented Apr 10, 2024

Thank you! Closing this issue as complete.

@heuermh heuermh closed this as completed Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants