-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark xz and zstd performance #298
Comments
That looks good! The fast speed is owed to the great packages: |
Can you try xz --best -T0 and see how fast it is ? |
Thank you! Closing this issue as complete. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
FYI
I wanted to benchmark
xz
andzstd
performance fordsh-bio
, which uses BioJava FASTQ parsing circa 2003 and compression provided by Apache Commons Compress with default settings.dsh-bio
parses fully and validates all FASTQ records so I used similar settings forseqkit
.Shell script in this Gist.
Compression results:
xz --compress --stdout -0 dataset_C.fq > dataset_C.0.fq.xz
xz --compress --stdout dataset_C.fq > dataset_C.default.fq.xz
xz --compress --stdout -9 dataset_C.fq > dataset_C.9.fq.xz
xz --compress --stdout --extreme dataset_C.fq > dataset_C.extreme.fq.xz
zstd -1 -k dataset_C.fq -o dataset_C.1.fq.zst
zstd -k dataset_C.fq -o dataset_C.default.fq.zst
zstd -6 -k dataset_C.fq -o dataset_C.6.fq.zst
zstd -19 -k dataset_C.fq -o dataset_C.19.fq.zst
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.xz
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.gz
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bgz
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bzip2
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.zst
seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.xz
seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.zst
File info:
Decompression results:
dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.xz
dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.zst
dsh-bio compress-fastq -i dataset_C.seqkit.fq.xz
dsh-bio compress-fastq -i dataset_C.seqkit.fq.zst
seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.xz
seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.zst
seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.xz
seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.zst
TL;DR
seqkit
is fast, and preferzstd
😉The text was updated successfully, but these errors were encountered: