Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NEON rot16/rot8 #319

Merged
merged 3 commits into from
Jul 5, 2023
Merged

Improve NEON rot16/rot8 #319

merged 3 commits into from
Jul 5, 2023

Conversation

sdlyyxy
Copy link
Contributor

@sdlyyxy sdlyyxy commented Jun 24, 2023

On MacBook Air M1:

$ time dd if=/dev/zero bs=1048576 count=1024|./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 0.713978 secs (1503886428 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1048576 count=1024  0.00s user 0.08s system 11% cpu 0.718 total
./target/release/b3sum  0.66s user 0.04s system 98% cpu 0.717 total

before:

$ time dd if=/dev/zero bs=1048576 count=1024|./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 0.793620 secs (1352967193 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1048576 count=1024  0.00s user 0.08s system 9% cpu 0.798 total
./target/release/b3sum  0.76s user 0.03s system 99% cpu 0.798 total

@oconnor663
Copy link
Member

I'm seeing an error on some other ARM targets in CI:

  running: "arm-linux-gnueabihf-gcc" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-gdwarf-4" "-fno-omit-frame-pointer" "-march=armv7-a" "-mfpu=vfpv3-d16" "-Wall" "-Wextra" "-std=c11" "-mfpu=neon-vfpv4" "-mfloat-abi=hard" "-o" "/target/armv7-unknown-linux-gnueabihf/debug/build/blake3-f22b00965a848420/out/c/blake3_neon.o" "-c" "c/blake3_neon.c"
  cargo:warning=c/blake3_neon.c: In function 'rot8_128':
  cargo:warning=c/blake3_neon.c:47:31: warning: implicit declaration of function '__builtin_shufflevector' [-Wimplicit-function-declaration]
  cargo:warning=   return vreinterpretq_u32_u8(__builtin_shufflevector(vreinterpretq_u8_u32(x), vreinterpretq_u8_u32(x), 1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12));
  cargo:warning=                               ^
  cargo:warning=c/blake3_neon.c:47:31: error: incompatible type for argument 1 of 'vreinterpretq_u32_u8'
  cargo:warning=In file included from c/blake3_neon.c:3:0:
  cargo:warning=/usr/lib/gcc-cross/arm-linux-gnueabihf/5/include/arm_neon.h:14673:1: note: expected 'uint8x16_t' but argument is of type 'int'
  cargo:warning= vreinterpretq_u32_u8 (uint8x16_t __a)
  cargo:warning= ^
  cargo:warning=c/blake3_neon.c:48:1: warning: control reaches end of non-void function [-Wreturn-type]
  cargo:warning= }
  cargo:warning= ^
  exit status: 1

@sneves
Copy link
Collaborator

sneves commented Jun 24, 2023

The correctmore portable way to write this is

INLINE uint32x4_t rot8_128(uint32x4_t x) {
  static const uint8x16_t r8 = {1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12};
  return vreinterpretq_u32_u8(vqtbl1q_u8(vreinterpretq_u8_u32(x), r8));
}

The downside is, of course, that this (vqtbl1q_u8) is AArch64-specific. For 32-bit ARM we need to split the vector into two halves and do two separate vtbl1_u8 and recombine into a 128-bit vector.

All that being said, while it seems clear that the vrev32q_u16 is a win in general, the table lookup is less clear. The M1 is an outlier in that it can do 4 vqtbl1q_u8 per cycle, which is much higher than most, so the above numbers for it may or may not reflect an improvement on the rest of the ARM landscape...

@sneves
Copy link
Collaborator

sneves commented Jun 24, 2023

On that note, implementing the generic rotations as follows:

INLINE uint32x4_t rot12_128(uint32x4_t x) {
  // return vorrq_u32(vshrq_n_u32(x, 12), vshlq_n_u32(x, 32 - 12));
  return vsriq_n_u32(vshlq_n_u32(x, 32-12), x, 12);
}

might also be an improvement on the M1, but once again in other chips it varies.

@sdlyyxy
Copy link
Contributor Author

sdlyyxy commented Jun 25, 2023

Thanks for the advice!

I change the code to use __builtin_shuffle on GCC. On Godbolt it seems this can generate two vtbl1_u8 instructions for 32-bit ARM.

Additionally using shl+sri for rot7+rot12, user time is reduced to 0.64s. On M1 icestorm core this effect is more significant:

Original:

$ time dd if=/dev/zero bs=1048576 count=1024|taskpolicy -c background ./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 3.648963 secs (294259444 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1048576 count=1024  0.00s user 0.17s system 4% cpu 3.654 total
taskpolicy -c background ./target/release/b3sum  2.92s user 0.32s system 88% cpu 3.654 total

rot8/rot16:

$ time dd if=/dev/zero bs=1048576 count=1024|taskpolicy -c background ./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 3.011343 secs (356565766 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1048576 count=1024  0.00s user 0.12s system 4% cpu 3.015 total
taskpolicy -c background ./target/release/b3sum  2.67s user 0.32s system 99% cpu 3.015 total

rot8/rot16+rot7/rot12:

$ time dd if=/dev/zero bs=1M count=1K |taskpolicy -c background ./target/release/b3sum
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 2.691741 secs (398902355 bytes/sec)
94b4ec39d8d42ebda685fbb5429e8ab0086e65245e750142c1eea36a26abc24d  -
dd if=/dev/zero bs=1M count=1K  0.00s user 0.12s system 4% cpu 2.697 total
taskpolicy -c background ./target/release/b3sum  2.22s user 0.32s system 94% cpu 2.697 total

@oconnor663
Copy link
Member

Wow, that's a huge improvement. I'll try to test these changes on one of my Raspberry Pis tonight.

@sdlyyxy
Copy link
Contributor Author

sdlyyxy commented Jun 29, 2023

Here are some test results on a Android Xiaomi Mi Note Qualcomm Snapdragon 801 'Krait 400' ARMv7 core:

User time for dd if=/dev/zero bs=1M count=1K|b3sum

commit 3f396d: 4.28s
commit 7a9a32: 4.38s
commit 6040c7: 3.96s

Based on commit 6040c7, if the two table lookup instructions in rot8_128 are replaced with shl+sri, the time will be reduced to 3.49s.

So according to the tests on M1 Firestorm, Icestorm and this 'Krait 400' core, for rot7/rot12/rot16 vrev32q_u16 and shl+sri is a win in general. As for rot8, on Firestorm and Icestorm vqtbl1q_u8 is better, but on 'Krait 400' ARMv7 core shl+sri wins. So the rot8_128 perhaps should be implemented as:

INLINE uint32x4_t rot8_128(uint32x4_t x) {
#if defined(IS_AARCH64)
  static const uint8x16_t r8 = {1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12};
  return vreinterpretq_u32_u8(vqtbl1q_u8(vreinterpretq_u8_u32(x), r8));
#else
  return vsriq_n_u32(vshlq_n_u32(x, 32-8), x, 8);
#endif
}

Let's see what happens on Raspberry Pis' ARM cores……

@oconnor663
Copy link
Member

Here's my RPi3 (Cortex-A53) on master, all measurements median of 5 runs:

$ cargo +nightly bench --features=neon bench_many_chunks_neon
...
test bench_many_chunks_neon            ... bench:      20,101 ns/iter (+/- 31) = 203 MB/s

And the RPi3 on sdlyyxy/neon (6040c73):

$ cargo +nightly bench --features=neon bench_many_chunks_neon
...
test bench_many_chunks_neon            ... bench:      15,286 ns/iter (+/- 40) = 267 MB/s

Here's my RPi4 (Cortex-A72, in 32-bit/armv7l mode) on master:

$ cargo +nightly bench --features=neon bench_many_chunks_neon
...
test bench_many_chunks_neon            ... bench:      13,844 ns/iter (+/- 53) = 295 MB/s

And the RPi4 on sdlyyxy/neon (6040c73):

$ cargo +nightly bench --features=neon bench_many_chunks_neon                                          
...
test bench_many_chunks_neon            ... bench:      14,093 ns/iter (+/- 40) = 290 MB/s

@sneves mentioned some AArch64-specific instructions above, so I'm curious whether getting a proper 64-bit Linux kernel onto my RPi4 might lead to a better result. The improvement on the RPi3 is quite dramatic (~1.3x), in line with what you reported for the M1.

@oconnor663
Copy link
Member

Wow, yes. I reimaged my RPi4 with ArchLinuxARM-rpi-aarch64-latest.tar.gz, and now I see the improvement. On master:

$ cargo +nightly bench --features=neon bench_many_chunks_neon
...
test bench_many_chunks_neon            ... bench:      11,744 ns/iter (+/- 28) = 348 MB/s

And on sdlyyxy-neon (6040c73):

$ cargo +nightly bench --features=neon bench_many_chunks_neon                    
...
test bench_many_chunks_neon            ... bench:      10,676 ns/iter (+/- 16) = 383 MB/s

A 10% speedup there.

@oconnor663
Copy link
Member

@sneves if you approve, I'll merge.

@sneves
Copy link
Collaborator

sneves commented Jul 5, 2023

Sure, why not. I probably wouldn't remove the old rotations but comment them out or something, seeing that ARM is a tower of babel of microarchitectures and you never know what's fast or not there.

@oconnor663 oconnor663 merged commit 7038dad into BLAKE3-team:master Jul 5, 2023
@oconnor663
Copy link
Member

Landed and added the old rotations in comments in f7e1a74. Thanks @sdlyyxy!

@sdlyyxy
Copy link
Contributor Author

sdlyyxy commented Jul 6, 2023

Thank you for reviewing and testing :)

oconnor663 added a commit that referenced this pull request Jul 6, 2023
Changes since 1.4.0:
- Improved performance in the ARM NEON implementation for both C and
  Rust callers. This affects AArch64 targets by default and ARMv7
  targets that explicitly enable (and support) NEON. The size of the
  improvement depends on the microarchitecture, but I've benchmarked
  ~1.3x on a Cortex-A53 and ~1.2x on an Apple M1. Contributed by
  @sdlyyxy in #319.
- The MSRV is now 1.66.1 for both the `blake3` crate and `b3sum`.
kevingoh pushed a commit to ITS-AT-dev/BLAKE3 that referenced this pull request Oct 23, 2023
Changes since 1.4.0:
- Improved performance in the ARM NEON implementation for both C and
  Rust callers. This affects AArch64 targets by default and ARMv7
  targets that explicitly enable (and support) NEON. The size of the
  improvement depends on the microarchitecture, but I've benchmarked
  ~1.3x on a Cortex-A53 and ~1.2x on an Apple M1. Contributed by
  @sdlyyxy in BLAKE3-team#319.
- The MSRV is now 1.66.1 for both the `blake3` crate and `b3sum`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants