Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect output from AVX-512 intrinsics in debug mode under GCC 5.4 and 6.1 #271

Closed
oconnor663 opened this issue Nov 22, 2022 · 4 comments

Comments

@oconnor663
Copy link
Member

The easiest way to repro this failure is with Docker:

# start a shell inside a GCC 6.1 Docker container
$ docker run --tty --interactive --rm gcc:6.1

# install Rust
$ curl https://sh.rustup.rs -sSf | sh -s -- -y --profile minimal
...

# clone the BLAKE3 repo
$ git clone https://github.com/BLAKE3-team/BLAKE3
...

# run the tests with `prefer_intrinsics`
$ cd BLAKE3 && ~/.cargo/bin/cargo test --features=prefer_intrinsics
...
---- avx512::test::test_hash_many stdout ----
[src/test.rs:153] n = 0
[src/test.rs:153] n = 1
[src/test.rs:153] n = 2
[src/test.rs:153] n = 3
[src/test.rs:153] n = 4
[src/test.rs:153] n = 5
[src/test.rs:153] n = 6
[src/test.rs:153] n = 7
[src/test.rs:153] n = 8
thread 'avx512::test::test_hash_many' panicked at 'assertion failed: `(left == right)`
  left: `[216, 31, 242, 1, 242, 65, 11, 63, 21, 230, 5, 173, 239, 46, 134, 22, 143, 255, 18, 188, 246, 127, 225, 23, 118, 122, 84, 221, 84, 97, 56, 68]`,
 right: `[222, 74, 39, 53, 195, 239, 139, 145, 159, 123, 236, 149, 169, 8, 118, 49, 22, 98, 109, 92, 148, 40, 108, 160, 58, 121, 170, 165, 30, 227, 192, 36]`', src/test.rs:154:9

This affects both Rust and C when building the intrinsics implementations rather than the assembly implementations. This repros with gcc:6.1 and gcc:5.4, but not with gcc:6.2 or gcc:5.5. It also does not repro in --release mode, where GCC gets invoked with -O3 rather than -O0.

The underlying cause of this failure is incorrect arithmetic in the chunk counter, specifically when the lower 32-bit word of the 64-bit counter overflows within a group of 16 chunks. This is a weird situation to be in (16 divides 232, so normally this wrapping happens between groups), and I don't think it's actually possible to trigger it from our public API, but one of our tests triggers it deliberately with an initial counter value of (1<<32) - 1. Getting rid of that spicy counter value makes the failure go away, for example like this:

diff --git a/src/test.rs b/src/test.rs
index c2bea95..b2ac070 100644
--- a/src/test.rs
+++ b/src/test.rs
@@ -116,7 +116,7 @@ pub fn test_hash_many_fn(
     let mut input_buf = [0; CHUNK_LEN * NUM_INPUTS];
     crate::test::paint_test_input(&mut input_buf);
     // A counter just prior to u32::MAX.
-    let counter = (1u64 << 32) - 1;
+    let counter = 0;
 
     // First hash chunks.
     let mut chunks = ArrayVec::<&[u8; CHUNK_LEN], NUM_INPUTS>::new();

The bad arithmetic seems to be happening in this _mm512_cmp_epu32_mask call. Printing all that out in place is a little hairy, but we can minimize it down to the following C program:

#include <immintrin.h>
#include <stdio.h>

int main() {
  __m512i zeros = _mm512_set1_epi32(0);
  __m512i ones = _mm512_set1_epi32(1);
  __mmask16 mask = _mm512_cmp_epu32_mask(zeros, ones, _MM_CMPINT_LT);
  printf("%d\n", (int)mask);
  return 0;
}

That computes 0 < 1 sixteen times and returns the result as a 16-bit int, which should have all its bits set to one (0xffff = 65535). If we save that C code as /tmp/test.c and run it under different GCC versions, here's what we see:

$ docker run --tty --interactive --rm --volume /tmp:/mnt:ro gcc:5.4 bash -c "gcc -mavx512f /mnt/test.c && ./a.out"
255
$ docker run --tty --interactive --rm --volume /tmp:/mnt:ro gcc:5.5 bash -c "gcc -mavx512f /mnt/test.c && ./a.out"
65535
$ docker run --tty --interactive --rm --volume /tmp:/mnt:ro gcc:6.1 bash -c "gcc -mavx512f /mnt/test.c && ./a.out"
255
$ docker run --tty --interactive --rm --volume /tmp:/mnt:ro gcc:6.2 bash -c "gcc -mavx512f /mnt/test.c && ./a.out"
65535

The lower 8 bits of output are correct in all cases, but the higher 8 bits are zero in the buggy versions. This corresponds to the fact that the original Rust test failure above happened after logging n = 8. That is, it failed when checking the 9th output out of 16.

The assembly output difference for the minimal test.c program between GCC 6.1 and GCC 6.2 seems to be these instructions:

38,40c38
< 	kmovw	%k1, %eax
< 	movzbl	%al, %eax
< 	movw	%ax, -178(%rbp)
---
> 	kmovw	%k1, -178(%rbp)
57c55
< 	.ident	"GCC: (GNU) 6.1.0"
---
> 	.ident	"GCC: (GNU) 6.2.0"

All of this seems like an exact match for this GCC ticket: Bug 72805 - AVX512: invalid code generation involving masks

This is extremely unlikely to affect real users, for several reasons:

  • I'm pretty sure our public API will never actually group chunks in the "unaligned" arrangement that's necessary to trigger this.
  • Even if I'm wrong or if you just hack it to do this, you then have to hash a large input unevenly, i.e. calling update with 1 byte and then again with 4 GiB.
  • You have to be on AVX-512-supporting hardware.
  • You have to build the intrinsics implementations (not the default in Rust, not the recommended configuration in C), in debug mode (-O0), which runs almost 10x slower.
  • You have to use a GCC version from 2016.

All that said, I found this bug by running into it myself (in tests, which circumvents the first two requirements above) on an Ubuntu 16.04 machine, which will continue to enjoy official support until 2026. So this is probably worth fixing. It's also possible that there are other ways to trigger this that I haven't thought of, or that there are other intrinsics getting miscompiled but somehow evading our test cases.

oconnor663 added a commit that referenced this issue Nov 23, 2022
I'm adding the i32::MAX test case here because I personally screwed it
up while I was working on
#271. The correct
implementation of the carry bit is the ANDNOT of old high bit (1) and
the new high bit (0). Using XOR instead of ANDNOT gives the correct
answer in the overflow case, but it also reports an incorrect "extra"
overflow when the high bit goes from 0 to 1.
@oconnor663
Copy link
Member Author

oconnor663 commented Nov 23, 2022

CI testing against GCC 5.4 added in 62772b2. Unfortunately it seems about 50/50 whether GitHub runs the test on an AVX-512-supporting server. (If not, it trivially passes because it doesn't execute the buggy AVX-512 code.) But sometimes is better than never, and cat /proc/cpu is in there so we can at least see what ran after the fact.

@oconnor663
Copy link
Member Author

Here's an example of that test failing before I push the fix: https://github.com/BLAKE3-team/BLAKE3/actions/runs/3530069576/jobs/5921672644

@oconnor663
Copy link
Member Author

oconnor663 commented Nov 23, 2022

For the fix, @sneves has proposed:

__m512i carry = _mm512_srli_epi32(
    _mm512_xor_epi32(l, _mm512_ternarylogic_epi32(
                            l, add1, _mm512_set1_epi32((int32_t)counter),
                            (0xf0 ^ 0xcc) | (0xaa ^ 0xcc))),
    31);

I'd love to simplify that down to something like:

// The carry bit is 1 if the high bit of the word was 1 before addition and is 0 after.
__m512i carry = _mm512_srli_epi32(
  _mm512_andnot_si512(
      low_words, // 0 after (gets inverted by andnot)
      _mm512_set1_epi32((int32_t)counter)), // and 1 before
  31);

However when I measure the latter, it's (drumroll...) ~1% slower. Samuel do you have any idea why that is? And could you help me understand the magic constants in _mm512_ternarylogic_epi32? :)

@sneves
Copy link
Collaborator

sneves commented Nov 24, 2022

That is a specialization of the known-good generic less-than formula -- (x^((x^y)|((x-y)^y)))>>31; -- to the current setting: x = l, y = add1, x-y = counter. The VPTERNLOG operation then corresponds to (l^add1)|(counter^add1). The constants correspond to each variable in F(x, y, z): x = 0xf0, y = 0xcc, z = 0xaa, letting you encode the boolean expression in the constant in a somewhat readable way (cf. this).

There should be no difference in speed between the two; either you're seeing noise or the different versions nudged the compiler to emit slightly different codegen resulting in the difference.

Either way, keep your version -- it's cleaner.

oconnor663 added a commit that referenced this issue Nov 26, 2022
Changes since 1.3.2:
- Fix incorrect output from AVX-512 intrinsics under GCC 5.4 and 6.1 in
  debug mode. This bug was found in unit tests and probably doesn't
  affect the public API in practice. See
  #271.
kevingoh pushed a commit to ITS-AT-dev/BLAKE3 that referenced this issue Oct 23, 2023
I'm adding the i32::MAX test case here because I personally screwed it
up while I was working on
BLAKE3-team#271. The correct
implementation of the carry bit is the ANDNOT of old high bit (1) and
the new high bit (0). Using XOR instead of ANDNOT gives the correct
answer in the overflow case, but it also reports an incorrect "extra"
overflow when the high bit goes from 0 to 1.
kevingoh pushed a commit to ITS-AT-dev/BLAKE3 that referenced this issue Oct 23, 2023
…5.4 and 6.1

Fixes BLAKE3-team#271.

The `_mm512_cmp_epu32_mask` intrinsic is broken under GCC 5.4 and 6.1.
This led to incorrect output in the AVX-512 implementation when building
with intrinsics instead of assembly. This fix is a simplified version of
Samuel's proposed fix here:
BLAKE3-team@f10816e#commitcomment-90742995
kevingoh pushed a commit to ITS-AT-dev/BLAKE3 that referenced this issue Oct 23, 2023
Changes since 1.3.2:
- Fix incorrect output from AVX-512 intrinsics under GCC 5.4 and 6.1 in
  debug mode. This bug was found in unit tests and probably doesn't
  affect the public API in practice. See
  BLAKE3-team#271.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants