Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bfloat16 detection on aarch64 #55417

Closed

Conversation

giordano
Copy link
Contributor

@giordano giordano commented Aug 8, 2024

As discussed offline with @gbaraldi, I adapted the test in the have_fp16 function. This kinda fixes #54389 for me in the sense that now I see have_bf16 returns true on a64fx and Nvidia Grace (neoverse-v2), but BFloat16 code generation doesn't look right to me, on both systems I get:

julia> using BFloat16s

julia> code_llvm(+, NTuple{2,BFloat16}; debuginfo=:none)
; Function Signature: +(Core.BFloat16, Core.BFloat16)
define bfloat @"julia_+_1885"(bfloat %"x::BFloat16", bfloat %"y::BFloat16") #0 {
top:
  %0 = insertelement <2 x bfloat> poison, bfloat %"x::BFloat16", i64 0
  %1 = insertelement <2 x bfloat> %0, bfloat %"y::BFloat16", i64 1
  %2 = bitcast <2 x bfloat> %1 to <2 x i16>
  %3 = zext <2 x i16> %2 to <2 x i32>
  %4 = shl nuw <2 x i32> %3, <i32 16, i32 16>
  %5 = bitcast <2 x i32> %4 to <2 x float>
  %shift = shufflevector <2 x float> %5, <2 x float> poison, <2 x i32> <i32 1, i32 poison>
  %6 = fadd <2 x float> %shift, %5
  %7 = extractelement <2 x float> %6, i64 0
  %8 = fcmp ord float %7, 0.000000e+00
  br i1 %8, label %L13, label %L30

L13:                                              ; preds = %top
  %bitcast_coercion11 = bitcast float %7 to i32
  %9 = lshr i32 %bitcast_coercion11, 16
  %10 = and i32 %9, 1
  %narrow = add nuw nsw i32 %10, 32767
  %11 = zext nneg i32 %narrow to i64
  %12 = zext i32 %bitcast_coercion11 to i64
  %13 = add nuw nsw i64 %11, %12
  %14 = lshr i64 %13, 16
  %15 = trunc i64 %14 to i16
  %bitcast_coercion20 = bitcast i16 %15 to bfloat
  br label %L30

L30:                                              ; preds = %L13, %top
  %value_phi = phi bfloat [ %bitcast_coercion20, %L13 ], [ 0xR7FC0, %top ]
  ret bfloat %value_phi
}

This is still demoting bfloat to float, and this isn't much different from #54025 (comment).

@giordano giordano added the system:arm ARMv7 and AArch64 label Aug 8, 2024
@giordano giordano requested review from maleadt and gbaraldi August 8, 2024 13:39
@giordano giordano added the compiler:llvm For issues that relate to LLVM label Aug 8, 2024
@maleadt
Copy link
Member

maleadt commented Aug 8, 2024

If I'm interpreting llvm/llvm-project#97975 (comment) correctly, LLVM's problematic excess precision handling might not be in use with bfloat. If we can verify this, maybe we can just get rid of the demote pass for BFloat16 and have_bf16.

The other condition is that all our targets that do not support bfloat (i.e., most of them) properly legalize bfloat instructions to something that they do support. IIRC I encountered some instruction selection failures when first working on this, but those might have been fixed now that we're on a more recent version of LLVM.

@giordano
Copy link
Contributor Author

giordano commented Aug 8, 2024

Removing the entire bf16 demotion business would probably be a better option: @gbaraldi pointed out that bf16 on aarch64 doesn't support much, probably only conversions to other floats and dot operations: a fadd bfloat bfloat in LLVM IR is turned into fadd of floats in native code anyway: https://godbolt.org/z/93YaG19oc. Which means that while this PR correctly detects bf16 feature on aarch64, what the bf16 feature does on aarch64 isn't quite enough for our demotion pass.

Probably relevant: https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/bfloat16-processing-for-neural-networks-on-armv8_2d00_a

@giordano
Copy link
Contributor Author

Closing in favour of #55479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:llvm For issues that relate to LLVM system:arm ARMv7 and AArch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

have_bf16 always returns false on Nvidia Grace CPU (neoverse-v2) on Linux
2 participants