-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d #15918
base: main
Are you sure you want to change the base?
Conversation
dc1b629
to
fe30368
Compare
Very interesting PR!
Can you not use
I like this a lot, would make the TIR much easier to reason about! |
Thank you @ekalda !
The good point is that these are lowered to exactly what is needed (even single insn, optim) for the target arch (x86 here).
Hmm no,
The work here (x86) is a pseudo kind of "scalable-vector" having _m128, _m256, _m512 but "hand unrolled" ones. |
Yes, all of these intrinsics are architecture independent in LLVM, so this is a great addition from the point of other backends as well. Regarding to using
is not more complex in my opinion :) I appreciate the convenience of printing the index array inplace in TIR though. In general, I won't argue against this change if there is a wider consensus that this is a necessary addition. However, I think we have to think carefully about adding another mechanism into TIR for representing data arrays that is opaque to the memory planning. From what I can see, there are no restrictions to the size of the data it can hold, so it's rather susceptible to misuse.
That's cool! Yes, I hope we can come up with a design that is going to work for all the scalable vector architectures out there. Feel free to chip in with your thoughts there! |
Hmm ... that's really short ! tvm/tests/python/unittest/test_tvmscript_ir_builder_tir.py Lines 386 to 394 in a13eadb
I also find attracted to pass a plain simple python
For the misuse part, yes, also agree, but my understanding of Well, I reconsider |
I add these few lines here showing more contrasted aspects against/pro
I remain with the idea to use |
Sorry for delay on this, I was on training for two days -
Ah oops I had quoted the TVMScript interface there and didn't also realise all of these arguments are required. I suppose these could be made optional if this would make
There's an in flight patch where @Lunderberg has done a great job in integrating the Also, it seems to me that there are a few core compiler changes in this patch that are needed for the new TOPI schedules, but are in essence target independent changes that would warrant separate discussions. How do you feel about breaking this patch up into smaller patches? From eyeballing the patch, e.g.
|
See now, thanks for pointing this out !
That was exactly what I was thinking, so I will split this up. |
This allows printing of the LLVM function real name in TIR printer. Prior to this a counter-intuitive T.int32() value was printed instead of the real name. Changes Before: T.call_llvm_pure_intrin("int32x4", T.uint32(62), T.uint32(0)) After: T.call_llvm_pure_intrin("int32x4", "llvm.donothing", T.uint32(0)) This is part of #15918 .
This PR enhances x86 SIMD (u)int8 coverage for dense and conv2d operators.
It extends current SIMD support with avx2 & ssse3, and adds a new set of non-overflowing SIMD method.
Tracker:
call_{pure}_llvm_intrin
pretty print.ArrayIntImm
nodeThis PR will hold only the TOPI part.
Changes:
[x86][TOPI]
fast-math
overflowing one, withavx2
andssse3
.avx512
,avx2
andssse3
.[TIR][LLVM]
zextend
,sextend
,truncate
for type conversions.vectorpermute
,vectorshuffle
for vector manipulation.call_llvm_pure_intrin
&call_llvm_intrin
now holds instructionStringImm
instead ofIntImm
abstract.atomic_add
mapped to proper LLVM intrinsic guarnteed (best-effort) to lower to single instruction.[TE]
ArrayIntImm
expression for small immediate list of integer constants.[Target]
-key=cpu,fast-math
to switch from the precise SIMD (default) to the overflowing SIMD set.Performance
For the new avx2 & ssse3 the
fast
vs.precise
SIMD sets:Notes
fast-math
) is the default now.amx
andvnni
schedules remains unchanged, their specific intrinsics never overflows.zextend
,sextend
,truncate
lowers on x86 into single specialized instruction e.g:punpcklwd
&punpckhwd
vectorpermute
,vectorshuffle
also lowers on x86 into appropriate single specialized instruction.ArrayIntImm
is for the new ops:tir.vectorpermute("int32x8", whatever_vector, [0, 1, 4, 5, 2, 3, 6, 7])
fast-math
mode will always warn the user:Using `fast-math` may overflow, make sure ranges for either data is [0,128] or weight is [-64,+64]
{...} T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2) {....}
Samples
Lowering results for the
ssse3
case.The
precise
one:The
fast-math
one:Credits
There is a compact full x86 SIMD table guide here.
This work here follows some suggestions from intel's onednn int8 compute notes.
Next
(WiP) This work here will be extended to metaschedule auto-tensorization.
(WiP) Will try enable
int4
(not native) using best possible SIMD bit manipulation.Cc: @masahi , @anijain2305, @jianyuh, @Qianshui-Jiang, @kparzysz-quic , @junrushao , @tqchen , @elvin-n , @vvchernov , @echuraev