-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect output from AVX-512 intrinsics in debug mode under GCC 5.4 and 6.1 #271
Comments
I'm adding the i32::MAX test case here because I personally screwed it up while I was working on #271. The correct implementation of the carry bit is the ANDNOT of old high bit (1) and the new high bit (0). Using XOR instead of ANDNOT gives the correct answer in the overflow case, but it also reports an incorrect "extra" overflow when the high bit goes from 0 to 1.
CI testing against GCC 5.4 added in 62772b2. Unfortunately it seems about 50/50 whether GitHub runs the test on an AVX-512-supporting server. (If not, it trivially passes because it doesn't execute the buggy AVX-512 code.) But sometimes is better than never, and |
Here's an example of that test failing before I push the fix: https://github.com/BLAKE3-team/BLAKE3/actions/runs/3530069576/jobs/5921672644 |
For the fix, @sneves has proposed: __m512i carry = _mm512_srli_epi32(
_mm512_xor_epi32(l, _mm512_ternarylogic_epi32(
l, add1, _mm512_set1_epi32((int32_t)counter),
(0xf0 ^ 0xcc) | (0xaa ^ 0xcc))),
31); I'd love to simplify that down to something like: // The carry bit is 1 if the high bit of the word was 1 before addition and is 0 after.
__m512i carry = _mm512_srli_epi32(
_mm512_andnot_si512(
low_words, // 0 after (gets inverted by andnot)
_mm512_set1_epi32((int32_t)counter)), // and 1 before
31); However when I measure the latter, it's (drumroll...) ~1% slower. Samuel do you have any idea why that is? And could you help me understand the magic constants in |
That is a specialization of the known-good generic less-than formula -- There should be no difference in speed between the two; either you're seeing noise or the different versions nudged the compiler to emit slightly different codegen resulting in the difference. Either way, keep your version -- it's cleaner. |
Changes since 1.3.2: - Fix incorrect output from AVX-512 intrinsics under GCC 5.4 and 6.1 in debug mode. This bug was found in unit tests and probably doesn't affect the public API in practice. See #271.
I'm adding the i32::MAX test case here because I personally screwed it up while I was working on BLAKE3-team#271. The correct implementation of the carry bit is the ANDNOT of old high bit (1) and the new high bit (0). Using XOR instead of ANDNOT gives the correct answer in the overflow case, but it also reports an incorrect "extra" overflow when the high bit goes from 0 to 1.
…5.4 and 6.1 Fixes BLAKE3-team#271. The `_mm512_cmp_epu32_mask` intrinsic is broken under GCC 5.4 and 6.1. This led to incorrect output in the AVX-512 implementation when building with intrinsics instead of assembly. This fix is a simplified version of Samuel's proposed fix here: BLAKE3-team@f10816e#commitcomment-90742995
Changes since 1.3.2: - Fix incorrect output from AVX-512 intrinsics under GCC 5.4 and 6.1 in debug mode. This bug was found in unit tests and probably doesn't affect the public API in practice. See BLAKE3-team#271.
The easiest way to repro this failure is with Docker:
This affects both Rust and C when building the intrinsics implementations rather than the assembly implementations. This repros with
gcc:6.1
andgcc:5.4
, but not withgcc:6.2
orgcc:5.5
. It also does not repro in--release
mode, where GCC gets invoked with-O3
rather than-O0
.The underlying cause of this failure is incorrect arithmetic in the chunk counter, specifically when the lower 32-bit word of the 64-bit counter overflows within a group of 16 chunks. This is a weird situation to be in (16 divides 232, so normally this wrapping happens between groups), and I don't think it's actually possible to trigger it from our public API, but one of our tests triggers it deliberately with an initial counter value of
(1<<32) - 1
. Getting rid of that spicy counter value makes the failure go away, for example like this:The bad arithmetic seems to be happening in this
_mm512_cmp_epu32_mask
call. Printing all that out in place is a little hairy, but we can minimize it down to the following C program:That computes
0 < 1
sixteen times and returns the result as a 16-bit int, which should have all its bits set to one (0xffff = 65535). If we save that C code as/tmp/test.c
and run it under different GCC versions, here's what we see:The lower 8 bits of output are correct in all cases, but the higher 8 bits are zero in the buggy versions. This corresponds to the fact that the original Rust test failure above happened after logging
n = 8
. That is, it failed when checking the 9th output out of 16.The assembly output difference for the minimal
test.c
program between GCC 6.1 and GCC 6.2 seems to be these instructions:All of this seems like an exact match for this GCC ticket: Bug 72805 - AVX512: invalid code generation involving masks
This is extremely unlikely to affect real users, for several reasons:
update
with 1 byte and then again with 4 GiB.-O0
), which runs almost 10x slower.All that said, I found this bug by running into it myself (in tests, which circumvents the first two requirements above) on an Ubuntu 16.04 machine, which will continue to enjoy official support until 2026. So this is probably worth fixing. It's also possible that there are other ways to trigger this that I haven't thought of, or that there are other intrinsics getting miscompiled but somehow evading our test cases.
The text was updated successfully, but these errors were encountered: