incorrect output from AVX-512 intrinsics in debug mode under GCC 5.4 and 6.1 #271

oconnor663 · 2022-11-22T01:03:42Z

The easiest way to repro this failure is with Docker:

# start a shell inside a GCC 6.1 Docker container
$ docker run --tty --interactive --rm gcc:6.1

# install Rust
$ curl https://sh.rustup.rs -sSf | sh -s -- -y --profile minimal
...

# clone the BLAKE3 repo
$ git clone https://github.com/BLAKE3-team/BLAKE3
...

# run the tests with `prefer_intrinsics`
$ cd BLAKE3 && ~/.cargo/bin/cargo test --features=prefer_intrinsics
...
---- avx512::test::test_hash_many stdout ----
[src/test.rs:153] n = 0
[src/test.rs:153] n = 1
[src/test.rs:153] n = 2
[src/test.rs:153] n = 3
[src/test.rs:153] n = 4
[src/test.rs:153] n = 5
[src/test.rs:153] n = 6
[src/test.rs:153] n = 7
[src/test.rs:153] n = 8
thread 'avx512::test::test_hash_many' panicked at 'assertion failed: `(left == right)`
  left: `[216, 31, 242, 1, 242, 65, 11, 63, 21, 230, 5, 173, 239, 46, 134, 22, 143, 255, 18, 188, 246, 127, 225, 23, 118, 122, 84, 221, 84, 97, 56, 68]`,
 right: `[222, 74, 39, 53, 195, 239, 139, 145, 159, 123, 236, 149, 169, 8, 118, 49, 22, 98, 109, 92, 148, 40, 108, 160, 58, 121, 170, 165, 30, 227, 192, 36]`', src/test.rs:154:9

This affects both Rust and C when building the intrinsics implementations rather than the assembly implementations. This repros with gcc:6.1 and gcc:5.4, but not with gcc:6.2 or gcc:5.5. It also does not repro in --release mode, where GCC gets invoked with -O3 rather than -O0.

The underlying cause of this failure is incorrect arithmetic in the chunk counter, specifically when the lower 32-bit word of the 64-bit counter overflows within a group of 16 chunks. This is a weird situation to be in (16 divides 2³², so normally this wrapping happens between groups), and I don't think it's actually possible to trigger it from our public API, but one of our tests triggers it deliberately with an initial counter value of (1<<32) - 1. Getting rid of that spicy counter value makes the failure go away, for example like this:

diff --git a/src/test.rs b/src/test.rs
index c2bea95..b2ac070 100644
--- a/src/test.rs
+++ b/src/test.rs
@@ -116,7 +116,7 @@ pub fn test_hash_many_fn(
     let mut input_buf = [0; CHUNK_LEN * NUM_INPUTS];
     crate::test::paint_test_input(&mut input_buf);
     // A counter just prior to u32::MAX.
-    let counter = (1u64 << 32) - 1;
+    let counter = 0;
 
     // First hash chunks.
     let mut chunks = ArrayVec::<&[u8; CHUNK_LEN], NUM_INPUTS>::new();

The bad arithmetic seems to be happening in this _mm512_cmp_epu32_mask call. Printing all that out in place is a little hairy, but we can minimize it down to the following C program:

#include <immintrin.h>
#include <stdio.h>

int main() {
  __m512i zeros = _mm512_set1_epi32(0);
  __m512i ones = _mm512_set1_epi32(1);
  __mmask16 mask = _mm512_cmp_epu32_mask(zeros, ones, _MM_CMPINT_LT);
  printf("%d\n", (int)mask);
  return 0;
}

That computes 0 < 1 sixteen times and returns the result as a 16-bit int, which should have all its bits set to one (0xffff = 65535). If we save that C code as /tmp/test.c and run it under different GCC versions, here's what we see:

$ docker run --tty --interactive --rm --volume /tmp:/mnt:ro gcc:5.4 bash -c "gcc -mavx512f /mnt/test.c && ./a.out"
255
$ docker run --tty --interactive --rm --volume /tmp:/mnt:ro gcc:5.5 bash -c "gcc -mavx512f /mnt/test.c && ./a.out"
65535
$ docker run --tty --interactive --rm --volume /tmp:/mnt:ro gcc:6.1 bash -c "gcc -mavx512f /mnt/test.c && ./a.out"
255
$ docker run --tty --interactive --rm --volume /tmp:/mnt:ro gcc:6.2 bash -c "gcc -mavx512f /mnt/test.c && ./a.out"
65535

The lower 8 bits of output are correct in all cases, but the higher 8 bits are zero in the buggy versions. This corresponds to the fact that the original Rust test failure above happened after logging n = 8. That is, it failed when checking the 9th output out of 16.

The assembly output difference for the minimal test.c program between GCC 6.1 and GCC 6.2 seems to be these instructions:

38,40c38
< 	kmovw	%k1, %eax
< 	movzbl	%al, %eax
< 	movw	%ax, -178(%rbp)
---
> 	kmovw	%k1, -178(%rbp)
57c55
< 	.ident	"GCC: (GNU) 6.1.0"
---
> 	.ident	"GCC: (GNU) 6.2.0"

All of this seems like an exact match for this GCC ticket: Bug 72805 - AVX512: invalid code generation involving masks

This is extremely unlikely to affect real users, for several reasons:

I'm pretty sure our public API will never actually group chunks in the "unaligned" arrangement that's necessary to trigger this.
Even if I'm wrong or if you just hack it to do this, you then have to hash a large input unevenly, i.e. calling update with 1 byte and then again with 4 GiB.
You have to be on AVX-512-supporting hardware.
You have to build the intrinsics implementations (not the default in Rust, not the recommended configuration in C), in debug mode (-O0), which runs almost 10x slower.
You have to use a GCC version from 2016.

All that said, I found this bug by running into it myself (in tests, which circumvents the first two requirements above) on an Ubuntu 16.04 machine, which will continue to enjoy official support until 2026. So this is probably worth fixing. It's also possible that there are other ways to trigger this that I haven't thought of, or that there are other intrinsics getting miscompiled but somehow evading our test cases.

The text was updated successfully, but these errors were encountered:

I'm adding the i32::MAX test case here because I personally screwed it up while I was working on #271. The correct implementation of the carry bit is the ANDNOT of old high bit (1) and the new high bit (0). Using XOR instead of ANDNOT gives the correct answer in the overflow case, but it also reports an incorrect "extra" overflow when the high bit goes from 0 to 1.

oconnor663 · 2022-11-23T07:35:15Z

CI testing against GCC 5.4 added in 62772b2. Unfortunately it seems about 50/50 whether GitHub runs the test on an AVX-512-supporting server. (If not, it trivially passes because it doesn't execute the buggy AVX-512 code.) But sometimes is better than never, and cat /proc/cpu is in there so we can at least see what ran after the fact.

oconnor663 · 2022-11-23T07:43:14Z

Here's an example of that test failing before I push the fix: https://github.com/BLAKE3-team/BLAKE3/actions/runs/3530069576/jobs/5921672644

oconnor663 · 2022-11-23T07:50:43Z

For the fix, @sneves has proposed:

__m512i carry = _mm512_srli_epi32(
    _mm512_xor_epi32(l, _mm512_ternarylogic_epi32(
                            l, add1, _mm512_set1_epi32((int32_t)counter),
                            (0xf0 ^ 0xcc) | (0xaa ^ 0xcc))),
    31);

I'd love to simplify that down to something like:

// The carry bit is 1 if the high bit of the word was 1 before addition and is 0 after.
__m512i carry = _mm512_srli_epi32(
  _mm512_andnot_si512(
      low_words, // 0 after (gets inverted by andnot)
      _mm512_set1_epi32((int32_t)counter)), // and 1 before
  31);

However when I measure the latter, it's (drumroll...) ~1% slower. Samuel do you have any idea why that is? And could you help me understand the magic constants in _mm512_ternarylogic_epi32? :)

sneves · 2022-11-24T22:00:51Z

That is a specialization of the known-good generic less-than formula -- (x^((x^y)|((x-y)^y)))>>31; -- to the current setting: x = l, y = add1, x-y = counter. The VPTERNLOG operation then corresponds to (l^add1)|(counter^add1). The constants correspond to each variable in F(x, y, z): x = 0xf0, y = 0xcc, z = 0xaa, letting you encode the boolean expression in the constant in a somewhat readable way (cf. this).

There should be no difference in speed between the two; either you're seeing noise or the different versions nudged the compiler to emit slightly different codegen resulting in the difference.

Either way, keep your version -- it's cleaner.

Changes since 1.3.2: - Fix incorrect output from AVX-512 intrinsics under GCC 5.4 and 6.1 in debug mode. This bug was found in unit tests and probably doesn't affect the public API in practice. See #271.

I'm adding the i32::MAX test case here because I personally screwed it up while I was working on BLAKE3-team#271. The correct implementation of the carry bit is the ANDNOT of old high bit (1) and the new high bit (0). Using XOR instead of ANDNOT gives the correct answer in the overflow case, but it also reports an incorrect "extra" overflow when the high bit goes from 0 to 1.

…5.4 and 6.1 Fixes BLAKE3-team#271. The `_mm512_cmp_epu32_mask` intrinsic is broken under GCC 5.4 and 6.1. This led to incorrect output in the AVX-512 implementation when building with intrinsics instead of assembly. This fix is a simplified version of Samuel's proposed fix here: BLAKE3-team@f10816e#commitcomment-90742995

Changes since 1.3.2: - Fix incorrect output from AVX-512 intrinsics under GCC 5.4 and 6.1 in debug mode. This bug was found in unit tests and probably doesn't affect the public API in practice. See BLAKE3-team#271.

oconnor663 closed this as completed in 342f9f8 Nov 25, 2022

oconnor663 added a commit that referenced this issue Nov 26, 2022

version 1.3.3

67e4d04

Changes since 1.3.2: - Fix incorrect output from AVX-512 intrinsics under GCC 5.4 and 6.1 in debug mode. This bug was found in unit tests and probably doesn't affect the public API in practice. See #271.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect output from AVX-512 intrinsics in debug mode under GCC 5.4 and 6.1 #271

incorrect output from AVX-512 intrinsics in debug mode under GCC 5.4 and 6.1 #271

oconnor663 commented Nov 22, 2022

oconnor663 commented Nov 23, 2022 •

edited

Loading

oconnor663 commented Nov 23, 2022

oconnor663 commented Nov 23, 2022 •

edited

Loading

sneves commented Nov 24, 2022

incorrect output from AVX-512 intrinsics in debug mode under GCC 5.4 and 6.1 #271

incorrect output from AVX-512 intrinsics in debug mode under GCC 5.4 and 6.1 #271

Comments

oconnor663 commented Nov 22, 2022

oconnor663 commented Nov 23, 2022 • edited Loading

oconnor663 commented Nov 23, 2022

oconnor663 commented Nov 23, 2022 • edited Loading

sneves commented Nov 24, 2022

oconnor663 commented Nov 23, 2022 •

edited

Loading

oconnor663 commented Nov 23, 2022 •

edited

Loading