-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: try integrate BFloat16 #124
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #124 +/- ##
==========================================
- Coverage 88.77% 88.61% -0.16%
==========================================
Files 5 5
Lines 561 562 +1
==========================================
Hits 498 498
- Misses 63 64 +1 ☔ View full report in Codecov by Sentry. |
Ah nice, that simplifies my example: julia> v = Vec(ntuple(i->Float32(rand()), Val(16)))
<16 x Float32>[0.35481927, 0.14949146, 0.33511126, 0.23023836, 0.16776331, 0.9152977, 0.19988814, 0.22910726, 0.10502812, 0.54989743, 0.14419909, 0.19571519, 0.21844539, 0.84552854, 0.03142407, 0.9895877]
julia> @code_native convert(Vec{16,BFloat16},v)
push rbp
mov rbp, rsp
mov rax, rdi
; │ @ /home/sdp/SIMD/src/simdvec.jl:57 within `convert`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:716 within `fptrunc`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:730 within `macro expansion`
vcvtneps2bf16 ymm0, zmmword ptr [rsi]
; │└└
vmovups ymmword ptr [rdi], ymm0
pop rbp
vzeroupper
ret Is it possible to express a dot product using SIMD.jl? That should ideally result in the other AVX512BF16 instruction being emitted. Other operations, like the |
There is |
Hmm, with that the demote to Float32 is likely to get in the way and prevent any potential fusion between the julia> @code_llvm f(v)
; Function Signature: f(SIMD.Vec{16, Core.BFloat16})
; @ REPL[33]:1 within `f`
define bfloat @julia_f_11886(ptr nocapture noundef nonnull readonly align 16 dereferenceable(32) %"v::Vec") #0 {
top:
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:257 within `*`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221 within `fmul` @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:230 within `macro expansion`
%"v::Vec.data_ptr.unbox" = load <16 x bfloat>, ptr %"v::Vec", align 16
%0 = fpext <16 x bfloat> %"v::Vec.data_ptr.unbox" to <16 x float>
%1 = fmul <16 x float> %0, %0
%2 = fptrunc <16 x float> %1 to <16 x bfloat>
; └└└
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:483 within `sum`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:858 within `reduce_fadd`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:874 within `macro expansion`
%res.i = call reassoc bfloat @llvm.vector.reduce.fadd.v16bf16(bfloat 0xR0000, <16 x bfloat> %2)
ret bfloat %res.i
; └└└
} I guess that needs some solution at the LLVM level so that we don't need these demotes (i.e., llvm/llvm-project#97975). In the meantime, it would be possible to add explicit calls to e.g. the |
So far, it has worked quite well here to only map to the "generic" LLVM intrinsic and let the LLVM make the choice of the actual native intrinsic to run. That gives the package a reasonably limited scope compared to supporting n_architectures * n_instructions and it also means that code written with SIMD.jl is not platform or vector size specific. |
The "problem" is that LLVM doesn't have a generic dot product intrinsic, so for any matching of dot product-like instructions to Anyway, this isn't terribly important right now. It's a good start that we already match the conversion intrinsics without more invasive changes. And this is only an x86 problem; on some ARM platforms we shouldn't be demoting at all (JuliaLang/julia#55417). |
It would be interesting to see how this PR performs on such a system. |
Alright, with JuliaLang/julia#55486 the julia> @code_llvm f(v)
; Function Signature: f(SIMD.Vec{8, Core.BFloat16})
; @ REPL[4]:1 within `f`
define bfloat @julia_f_6882(ptr nocapture noundef nonnull readonly align 16 dereferenceable(16) %"v::Vec") #0 {
top:
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:257 within `*`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221 within `fmul` @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:230 within `macro expansion`
%"v::Vec.data_ptr.unbox" = load <8 x bfloat>, ptr %"v::Vec", align 16
%0 = fmul <8 x bfloat> %"v::Vec.data_ptr.unbox", %"v::Vec.data_ptr.unbox"
; └└└
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:483 within `sum`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:858 within `reduce_fadd`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:874 within `macro expansion`
%res.i = call reassoc bfloat @llvm.vector.reduce.fadd.v8bf16(bfloat 0xR0000, <8 x bfloat> %0)
ret bfloat %res.i
; └└└
} On LLVM 18 as used by current master branch of Julia that isn't enough to match to AVXBF16 operations other than the conversion to and from single precision though:
LLVM trunk looks better, but still uses single precision: https://godbolt.org/z/obhd33MKc I guess this is as much as we can do in Julia though; |
Hopefully, yes. |
This is just to see how things behave (#123)
For
v+v
we generate the following LLVM code:but:
cc @maleadt