Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_uring based implementation of b3sum #328

Open
1f604 opened this issue Jul 27, 2023 · 2 comments
Open

io_uring based implementation of b3sum #328

1f604 opened this issue Jul 27, 2023 · 2 comments

Comments

@1f604
Copy link
Contributor

1f604 commented Jul 27, 2023

Hi all! I wrote an io_uring-based implementation of b3sum here: https://github.com/1f604/liburing_b3sum

I wrote two versions: a single-threaded version in C and a multi-threaded version in C++. The single-threaded version is around 25% faster than the official Rust b3sum on my system and is slightly faster than cat to /dev/null on my system, and is also slightly faster than fio on my system. The single-threaded version is able to hash a 10GiB file in 2.899s, which works out to around 3533MiB/s, which is roughly the same as the read speed advertised for my NVME drive ("3500MB/s"). The multi-threaded implementation is around 1% slower than my single-threaded implementation.

Benchmarks

For these tests, I used the same 1 GiB (or 10 GiB) input file and always flushed the page cache before each test, thus ensuring that the programs are always reading from disk. Each command was run 10 times and I used the "real" result from time to calculate the statistics. I ran these commands on a Debian 12 system (uname -r returns "6.1.0-9-amd64") using ext4 without disk encryption and without LVM.

Command Min Median Max
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 1 0.404s 0.4105s 0.416s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 2 0.474s 0.4755s 0.481s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 3 0.44s 0.4415s 0.451s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 4 0.443s 0.4475s 0.452s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 5 0.454s 0.4585s 0.462s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 6 0.456s 0.4605s 0.463s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 7 0.461s 0.4635s 0.468s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 8 0.461s 0.464s 0.47s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --no-mmap 0.381s 0.386s 0.394s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./b3sum_linux 1GB.txt --no-mmap 0.379s 0.39s 0.404s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 1GB.txt | ./example 0.364s 0.3745s 0.381s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 1GB.txt > /dev/null 0.302s 0.302s 0.303s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K | ./example 0.338s 0.341s 0.348s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K of=/dev/null 0.303s 0.306s 0.308s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M | ./example 0.538s 0.5415s 0.544s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M of=/dev/null 0.302s 0.303s 0.304s
fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting 0.302s 0.3025s 0.303s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 1GB.txt 512 2 1 0 2 0 0 0.301s 0.301s 0.302s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_multithread 1GB.txt 512 2 1 0 2 0 0 0.303s 0.304s 0.305s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 1GB.txt 128 20 0 0 8 0 0 0.375s 0.378s 0.384s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_multithread 1GB.txt 128 20 0 0 8 0 0 0.304s 0.305s 0.307s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time xxhsum 1GB.txt 0.318s 0.3205s 0.325s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 10GB.txt > /dev/null 2.903s 2.904s 2.908s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 10GB.txt 512 4 1 0 4 0 0 2.898s 2.899s 2.903s

In the table above, liburing_b3sum_singlethread and liburing_b3sum_multithread are my own io_uring-based implementations of b3sum (more details below), and I verified that my b3sum implementations always produced the same BLAKE3 hash output as the official b3sum implementation. The 1GB.txt file was generated using this command:

dd if=/dev/urandom of=1GB.txt bs=1G count=1

I installed b3sum using this command:

cargo install b3sum
$ b3sum --version b3sum 1.4.1

I downloaded the b3sum_linux program from the BLAKE3 Github Releases page (it was the latest Linux binary):

$ ./b3sum_linux --version b3sum 1.4.1

I compiled the example program from the example.c file in the BLAKE3 C repository as per the instructions in the BLAKE3 C repository:

gcc -O3 -o example example.c blake3.c blake3_dispatch.c blake3_portable.c \ blake3_sse2_x86-64_unix.S blake3_sse41_x86-64_unix.S blake3_avx2_x86-64_unix.S \ blake3_avx512_x86-64_unix.S

I installed xxhsum using this command:

apt install xxhash
$ xxhsum --version  
xxhsum 0.8.1 by Yann Collet  
compiled as 64-bit x86_64 autoVec little endian with GCC 11.2.0`

Note

Note that, as the table above shows, the single-threaded version needs O_DIRECT in order to be fast (the flag that controls whether or not to use O_DIRECT is the third number after the filename in the command line arguments). The multi-threaded version is fast even without O_DIRECT (as the table shows, the multi-threaded version will hash a 1GiB file in 0.304s with O_DIRECT and 0.305s without O_DIRECT). For more details, see the article.md in the repository, or you can view the same article here (somewhat nicer formatting than Github) or here or here

I should also mention that my implementation does sequential reads from disk and uses the BLAKE3 C library so isn't capable of hashing on multiple cores.

I would very much appreciate any feedback!

@elichai
Copy link
Contributor

elichai commented Jul 27, 2023

FWIW Tokio has an io-uring rust crate, not sure how production-ready it is though: https://github.com/tokio-rs/tokio-uring

@oconnor663-travel
Copy link

I'm out of the country and mostly off the grid until August 20. Apologies for not reviewing this sooner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants