Improve NEON rot16/rot8 #319

sdlyyxy · 2023-06-24T07:27:15Z

On MacBook Air M1:

$ time dd if=/dev/zero bs=1048576 count=1024|./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 0.713978 secs (1503886428 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1048576 count=1024  0.00s user 0.08s system 11% cpu 0.718 total
./target/release/b3sum  0.66s user 0.04s system 98% cpu 0.717 total

before:

$ time dd if=/dev/zero bs=1048576 count=1024|./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 0.793620 secs (1352967193 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1048576 count=1024  0.00s user 0.08s system 9% cpu 0.798 total
./target/release/b3sum  0.76s user 0.03s system 99% cpu 0.798 total

oconnor663 · 2023-06-24T18:37:28Z

I'm seeing an error on some other ARM targets in CI:

  running: "arm-linux-gnueabihf-gcc" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-gdwarf-4" "-fno-omit-frame-pointer" "-march=armv7-a" "-mfpu=vfpv3-d16" "-Wall" "-Wextra" "-std=c11" "-mfpu=neon-vfpv4" "-mfloat-abi=hard" "-o" "/target/armv7-unknown-linux-gnueabihf/debug/build/blake3-f22b00965a848420/out/c/blake3_neon.o" "-c" "c/blake3_neon.c"
  cargo:warning=c/blake3_neon.c: In function 'rot8_128':
  cargo:warning=c/blake3_neon.c:47:31: warning: implicit declaration of function '__builtin_shufflevector' [-Wimplicit-function-declaration]
  cargo:warning=   return vreinterpretq_u32_u8(__builtin_shufflevector(vreinterpretq_u8_u32(x), vreinterpretq_u8_u32(x), 1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12));
  cargo:warning=                               ^
  cargo:warning=c/blake3_neon.c:47:31: error: incompatible type for argument 1 of 'vreinterpretq_u32_u8'
  cargo:warning=In file included from c/blake3_neon.c:3:0:
  cargo:warning=/usr/lib/gcc-cross/arm-linux-gnueabihf/5/include/arm_neon.h:14673:1: note: expected 'uint8x16_t' but argument is of type 'int'
  cargo:warning= vreinterpretq_u32_u8 (uint8x16_t __a)
  cargo:warning= ^
  cargo:warning=c/blake3_neon.c:48:1: warning: control reaches end of non-void function [-Wreturn-type]
  cargo:warning= }
  cargo:warning= ^
  exit status: 1

sneves · 2023-06-24T20:11:14Z

The ~~correct~~more portable way to write this is

INLINE uint32x4_t rot8_128(uint32x4_t x) {
  static const uint8x16_t r8 = {1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12};
  return vreinterpretq_u32_u8(vqtbl1q_u8(vreinterpretq_u8_u32(x), r8));
}

The downside is, of course, that this (vqtbl1q_u8) is AArch64-specific. For 32-bit ARM we need to split the vector into two halves and do two separate vtbl1_u8 and recombine into a 128-bit vector.

All that being said, while it seems clear that the vrev32q_u16 is a win in general, the table lookup is less clear. The M1 is an outlier in that it can do 4 vqtbl1q_u8 per cycle, which is much higher than most, so the above numbers for it may or may not reflect an improvement on the rest of the ARM landscape...

sneves · 2023-06-24T20:24:03Z

On that note, implementing the generic rotations as follows:

INLINE uint32x4_t rot12_128(uint32x4_t x) {
  // return vorrq_u32(vshrq_n_u32(x, 12), vshlq_n_u32(x, 32 - 12));
  return vsriq_n_u32(vshlq_n_u32(x, 32-12), x, 12);
}

might also be an improvement on the M1, but once again in other chips it varies.

sdlyyxy · 2023-06-25T03:54:10Z

Thanks for the advice!

I change the code to use __builtin_shuffle on GCC. On Godbolt it seems this can generate two vtbl1_u8 instructions for 32-bit ARM.

Additionally using shl+sri for rot7+rot12, user time is reduced to 0.64s. On M1 icestorm core this effect is more significant:

Original:

$ time dd if=/dev/zero bs=1048576 count=1024|taskpolicy -c background ./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 3.648963 secs (294259444 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1048576 count=1024  0.00s user 0.17s system 4% cpu 3.654 total
taskpolicy -c background ./target/release/b3sum  2.92s user 0.32s system 88% cpu 3.654 total

rot8/rot16:

$ time dd if=/dev/zero bs=1048576 count=1024|taskpolicy -c background ./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 3.011343 secs (356565766 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1048576 count=1024  0.00s user 0.12s system 4% cpu 3.015 total
taskpolicy -c background ./target/release/b3sum  2.67s user 0.32s system 99% cpu 3.015 total

rot8/rot16+rot7/rot12:

$ time dd if=/dev/zero bs=1M count=1K |taskpolicy -c background ./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 2.691741 secs (398902355 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1M count=1K  0.00s user 0.12s system 4% cpu 2.697 total
taskpolicy -c background ./target/release/b3sum  2.22s user 0.32s system 94% cpu 2.697 total

oconnor663 · 2023-06-28T17:29:12Z

Wow, that's a huge improvement. I'll try to test these changes on one of my Raspberry Pis tonight.

sdlyyxy · 2023-06-29T08:34:02Z

Here are some test results on a Android Xiaomi Mi Note Qualcomm Snapdragon 801 'Krait 400' ARMv7 core:

User time for dd if=/dev/zero bs=1M count=1K|b3sum

commit 3f396d: 4.28s
commit 7a9a32: 4.38s
commit 6040c7: 3.96s

Based on commit 6040c7, if the two table lookup instructions in rot8_128 are replaced with shl+sri, the time will be reduced to 3.49s.

So according to the tests on M1 Firestorm, Icestorm and this 'Krait 400' core, for rot7/rot12/rot16 vrev32q_u16 and shl+sri is a win in general. As for rot8, on Firestorm and Icestorm vqtbl1q_u8 is better, but on 'Krait 400' ARMv7 core shl+sri wins. So the rot8_128 perhaps should be implemented as:

INLINE uint32x4_t rot8_128(uint32x4_t x) {
#if defined(IS_AARCH64)
  static const uint8x16_t r8 = {1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12};
  return vreinterpretq_u32_u8(vqtbl1q_u8(vreinterpretq_u8_u32(x), r8));
#else
  return vsriq_n_u32(vshlq_n_u32(x, 32-8), x, 8);
#endif
}

Let's see what happens on Raspberry Pis' ARM cores……

oconnor663 · 2023-07-03T04:58:13Z

Here's my RPi3 (Cortex-A53) on master, all measurements median of 5 runs:

$ cargo +nightly bench --features=neon bench_many_chunks_neon
...
test bench_many_chunks_neon            ... bench:      20,101 ns/iter (+/- 31) = 203 MB/s

And the RPi3 on sdlyyxy/neon (6040c73):

$ cargo +nightly bench --features=neon bench_many_chunks_neon
...
test bench_many_chunks_neon            ... bench:      15,286 ns/iter (+/- 40) = 267 MB/s

Here's my RPi4 (Cortex-A72, in 32-bit/armv7l mode) on master:

$ cargo +nightly bench --features=neon bench_many_chunks_neon
...
test bench_many_chunks_neon            ... bench:      13,844 ns/iter (+/- 53) = 295 MB/s

And the RPi4 on sdlyyxy/neon (6040c73):

$ cargo +nightly bench --features=neon bench_many_chunks_neon                                          
...
test bench_many_chunks_neon            ... bench:      14,093 ns/iter (+/- 40) = 290 MB/s

@sneves mentioned some AArch64-specific instructions above, so I'm curious whether getting a proper 64-bit Linux kernel onto my RPi4 might lead to a better result. The improvement on the RPi3 is quite dramatic (~1.3x), in line with what you reported for the M1.

oconnor663 · 2023-07-03T08:26:34Z

Wow, yes. I reimaged my RPi4 with ArchLinuxARM-rpi-aarch64-latest.tar.gz, and now I see the improvement. On master:

$ cargo +nightly bench --features=neon bench_many_chunks_neon
...
test bench_many_chunks_neon            ... bench:      11,744 ns/iter (+/- 28) = 348 MB/s

And on sdlyyxy-neon (6040c73):

$ cargo +nightly bench --features=neon bench_many_chunks_neon                    
...
test bench_many_chunks_neon            ... bench:      10,676 ns/iter (+/- 16) = 383 MB/s

A 10% speedup there.

oconnor663 · 2023-07-03T08:30:11Z

@sneves if you approve, I'll merge.

sneves · 2023-07-05T15:41:56Z

Sure, why not. I probably wouldn't remove the old rotations but comment them out or something, seeing that ARM is a tower of babel of microarchitectures and you never know what's fast or not there.

oconnor663 · 2023-07-05T17:30:01Z

Landed and added the old rotations in comments in f7e1a74. Thanks @sdlyyxy!

sdlyyxy · 2023-07-06T11:59:40Z

Thank you for reviewing and testing :)

@sdlyyxy

Changes since 1.4.0: - Improved performance in the ARM NEON implementation for both C and Rust callers. This affects AArch64 targets by default and ARMv7 targets that explicitly enable (and support) NEON. The size of the improvement depends on the microarchitecture, but I've benchmarked ~1.3x on a Cortex-A53 and ~1.2x on an Apple M1. Contributed by @sdlyyxy in #319. - The MSRV is now 1.66.1 for both the `blake3` crate and `b3sum`.

@sdlyyxy

Changes since 1.4.0: - Improved performance in the ARM NEON implementation for both C and Rust callers. This affects AArch64 targets by default and ARMv7 targets that explicitly enable (and support) NEON. The size of the improvement depends on the microarchitecture, but I've benchmarked ~1.3x on a Cortex-A53 and ~1.2x on an Apple M1. Contributed by @sdlyyxy in BLAKE3-team#319. - The MSRV is now 1.66.1 for both the `blake3` crate and `b3sum`.

Improve NEON rot16/rot8

7a9a32a

sdlyyxy added 2 commits June 25, 2023 11:14

NEON: only use __builtin_shufflevector on clang

60e4583

NEON rot7/rot12 use shl+sri

6040c73

oconnor663 merged commit 7038dad into BLAKE3-team:master Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve NEON rot16/rot8 #319

Improve NEON rot16/rot8 #319

sdlyyxy commented Jun 24, 2023

oconnor663 commented Jun 24, 2023

sneves commented Jun 24, 2023 •

edited

Loading

sneves commented Jun 24, 2023

sdlyyxy commented Jun 25, 2023

oconnor663 commented Jun 28, 2023

sdlyyxy commented Jun 29, 2023

oconnor663 commented Jul 3, 2023

oconnor663 commented Jul 3, 2023

oconnor663 commented Jul 3, 2023

sneves commented Jul 5, 2023

oconnor663 commented Jul 5, 2023

sdlyyxy commented Jul 6, 2023

Improve NEON rot16/rot8 #319

Improve NEON rot16/rot8 #319

Conversation

sdlyyxy commented Jun 24, 2023

oconnor663 commented Jun 24, 2023

sneves commented Jun 24, 2023 • edited Loading

sneves commented Jun 24, 2023

sdlyyxy commented Jun 25, 2023

oconnor663 commented Jun 28, 2023

sdlyyxy commented Jun 29, 2023

oconnor663 commented Jul 3, 2023

oconnor663 commented Jul 3, 2023

oconnor663 commented Jul 3, 2023

sneves commented Jul 5, 2023

oconnor663 commented Jul 5, 2023

sdlyyxy commented Jul 6, 2023

sneves commented Jun 24, 2023 •

edited

Loading