-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve NEON rot16/rot8 #319
Conversation
I'm seeing an error on some other ARM targets in CI:
|
The
The downside is, of course, that this ( All that being said, while it seems clear that the |
On that note, implementing the generic rotations as follows:
might also be an improvement on the M1, but once again in other chips it varies. |
Thanks for the advice! I change the code to use __builtin_shuffle on GCC. On Godbolt it seems this can generate two vtbl1_u8 instructions for 32-bit ARM. Additionally using shl+sri for rot7+rot12, user time is reduced to 0.64s. On M1 icestorm core this effect is more significant: Original:
rot8/rot16:
rot8/rot16+rot7/rot12:
|
Wow, that's a huge improvement. I'll try to test these changes on one of my Raspberry Pis tonight. |
Here are some test results on a Android Xiaomi Mi Note Qualcomm Snapdragon 801 'Krait 400' ARMv7 core: User time for commit 3f396d: 4.28s Based on commit 6040c7, if the two table lookup instructions in rot8_128 are replaced with shl+sri, the time will be reduced to 3.49s. So according to the tests on M1 Firestorm, Icestorm and this 'Krait 400' core, for rot7/rot12/rot16 vrev32q_u16 and shl+sri is a win in general. As for rot8, on Firestorm and Icestorm vqtbl1q_u8 is better, but on 'Krait 400' ARMv7 core shl+sri wins. So the rot8_128 perhaps should be implemented as: INLINE uint32x4_t rot8_128(uint32x4_t x) {
#if defined(IS_AARCH64)
static const uint8x16_t r8 = {1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12};
return vreinterpretq_u32_u8(vqtbl1q_u8(vreinterpretq_u8_u32(x), r8));
#else
return vsriq_n_u32(vshlq_n_u32(x, 32-8), x, 8);
#endif
} Let's see what happens on Raspberry Pis' ARM cores…… |
Here's my RPi3 (Cortex-A53) on
And the RPi3 on
Here's my RPi4 (Cortex-A72, in 32-bit/armv7l mode) on
And the RPi4 on
@sneves mentioned some AArch64-specific instructions above, so I'm curious whether getting a proper 64-bit Linux kernel onto my RPi4 might lead to a better result. The improvement on the RPi3 is quite dramatic (~1.3x), in line with what you reported for the M1. |
Wow, yes. I reimaged my RPi4 with
And on
A 10% speedup there. |
@sneves if you approve, I'll merge. |
Sure, why not. I probably wouldn't remove the old rotations but comment them out or something, seeing that ARM is a tower of babel of microarchitectures and you never know what's fast or not there. |
Thank you for reviewing and testing :) |
Changes since 1.4.0: - Improved performance in the ARM NEON implementation for both C and Rust callers. This affects AArch64 targets by default and ARMv7 targets that explicitly enable (and support) NEON. The size of the improvement depends on the microarchitecture, but I've benchmarked ~1.3x on a Cortex-A53 and ~1.2x on an Apple M1. Contributed by @sdlyyxy in #319. - The MSRV is now 1.66.1 for both the `blake3` crate and `b3sum`.
Changes since 1.4.0: - Improved performance in the ARM NEON implementation for both C and Rust callers. This affects AArch64 targets by default and ARMv7 targets that explicitly enable (and support) NEON. The size of the improvement depends on the microarchitecture, but I've benchmarked ~1.3x on a Cortex-A53 and ~1.2x on an Apple M1. Contributed by @sdlyyxy in BLAKE3-team#319. - The MSRV is now 1.66.1 for both the `blake3` crate and `b3sum`.
On MacBook Air M1:
before: