Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d #15918

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cbalint13
Copy link
Contributor

@cbalint13 cbalint13 commented Oct 11, 2023

This PR enhances x86 SIMD (u)int8 coverage for dense and conv2d operators.
It extends current SIMD support with avx2 & ssse3, and adds a new set of non-overflowing SIMD method.


Tracker:

  • PR for TIR call_{pure}_llvm_intrin pretty print.
  • PR for new TIR intrinsics
  • PR for ArrayIntImm node

This PR will hold only the TOPI part.


Changes:

[x86][TOPI]

  • Extends current tensorizer set, now called fast-math overflowing one, with avx2 and ssse3.
  • Adds a new tensorizer set, a "precision" non-overflowing one, supporting: avx512, avx2 and ssse3.
  • Enable proper operator data alignment to gurantee any data (even smallest) will fit the SIMD vector width.

[TIR][LLVM]

  • Adds new TIR ops mapped to LLVM instrinsics: zextend, sextend, truncate for type conversions.
  • Adds new TIR ops mapped to LLVM instrinsics: vectorpermute, vectorshuffle for vector manipulation.
  • The call_llvm_pure_intrin & call_llvm_intrin now holds instruction StringImm instead of IntImm abstract.
  • Enables TIR op atomic_add mapped to proper LLVM intrinsic guarnteed (best-effort) to lower to single instruction.

[TE]

  • Introduce new ArrayIntImm expression for small immediate list of integer constants.

[Target]

  • Introduce a flag -key=cpu,fast-math to switch from the precise SIMD (default) to the overflowing SIMD set.

Performance

For the new avx2 & ssse3 the fast vs. precise SIMD sets:

$ python3 tests/python/contrib/test_gemm_acc32_simd.py
Task tensorized: {True } [llvm -mcpu=ivybridge                     ], running time: 3.655 ms, 587.58 Gops/s
Task tensorized: {True } [llvm -mcpu=ivybridge -keys=cpu,fast-math ], running time: 3.678 ms, 583.86 Gops/s
Task tensorized: {True } [llvm -mcpu=haswell                       ], running time: 3.708 ms, 579.09 Gops/s
Task tensorized: {True } [llvm -mcpu=haswell -keys=cpu,fast-math   ], running time: 3.668 ms, 585.52 Gops/s
Task tensorized: {False} [llvm -mcpu=ivybridge                     ], running time: 41.152 ms, 52.18 Gops/s
Task tensorized: {False} [llvm -mcpu=haswell                       ], running time: 41.194 ms, 52.13 Gops/s

Notes

  • Precision (non fast-math) is the default now.
  • x86 amx and vnni schedules remains unchanged, their specific intrinsics never overflows.
  • The zextend, sextend, truncate lowers on x86 into single specialized instruction e.g: punpcklwd & punpckhwd
  • The vectorpermute, vectorshuffle also lowers on x86 into appropriate single specialized instruction.
  • ArrayIntImm is for the new ops: tir.vectorpermute("int32x8", whatever_vector, [0, 1, 4, 5, 2, 3, 6, 7])
  • The fast-math mode will always warn the user:
    Using `fast-math` may overflow, make sure ranges for either data is [0,128] or weight is [-64,+64]
  • TIR printer (for debug purpose) will now list the very name of the instruction, not an abstract IntImm:
    {...} T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2) {....}

Samples

Lowering results for the ssse3 case.

The precise one:

@I.ir_module
class Module:
    @T.prim_func
    def tvmgen_default_fused_nn_contrib_dense_pack(
                    p0: T.Buffer((4, 4), "uint8"), 
                    p1: T.Buffer((1, 1, 4, 4), "int8"),
                    compute: T.Buffer((4, 4), "int32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        for i_inner in range(4):
            compute_1 = T.Buffer((16,), "int32", data=compute.data)
            compute_1[i_inner * 4:i_inner * 4 + 4] = T.Broadcast(0, 4)
            p0_1 = T.Buffer((16,), "uint8", data=p0.data)
            p1_1 = T.Buffer((16,), "int8", data=p1.data)
            compute_1[i_inner * 4:i_inner * 4 + 4] = 
              T.call_llvm_pure_intrin("int32x4", "llvm.x86.ssse3.phadd.d.128", T.uint32(2), 
                T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2), 
                  T.vectorlow("void", T.zextend("int16x16", 
                    T.reinterpret("int8x16", T.Broadcast(
                      T.reinterpret("int32", p0_1[i_inner * 4:i_inner * 4 + 4]), 4)))),
                        T.vectorlow("void", T.sextend("int16x16", p1_1[0:16]))),
                          T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2),
                            T.vectorhigh("void", T.zextend("int16x16", T.reinterpret("int8x16", 
                              T.Broadcast(T.reinterpret("int32", p0_1[i_inner * 4:i_inner * 4 + 4]), 4)))),
                                T.vectorhigh("void", T.sextend("int16x16", p1_1[0:16])))) 
                                  + compute_1[i_inner * 4:i_inner * 4 + 4]
000000000001e90 <tvmgen_default_fused_nn_contrib_dense_pack_compute_>:
    1e90:	c4 e2 79 18 16       	vbroadcastss (%rsi),%xmm2
    1e95:	c5 f9 ef c0          	vpxor  %xmm0,%xmm0,%xmm0
    1e99:	c5 e9 68 d8          	vpunpckhbw %xmm0,%xmm2,%xmm3
    1e9d:	c4 e2 79 20 4a 08    	vpmovsxbw 0x8(%rdx),%xmm1
    1ea3:	c4 e2 79 30 e2       	vpmovzxbw %xmm2,%xmm4
    1ea8:	c4 e2 79 20 12       	vpmovsxbw (%rdx),%xmm2
    1ead:	c5 d9 f5 e2          	vpmaddwd %xmm2,%xmm4,%xmm4
    1eb1:	c5 e1 f5 d9          	vpmaddwd %xmm1,%xmm3,%xmm3
    1eb5:	c4 e2 59 02 eb       	vphaddd %xmm3,%xmm4,%xmm5
    {...}

define internal fastcc void @tvmgen_default_fused_nn_contrib_dense_pack_compute {
entry:
  {...}
  %3 = load i32, ptr %1, align 64, !tbaa !310
  %4 = insertelement <4 x i32> undef, i32 %3, i64 0
  %5 = bitcast <4 x i32> %4 to <16 x i8>
  %6 = shufflevector <16 x i8> %5, <16 x i8> poison, <16 x i32> 
           <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, 
            i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
  %7 = zext <16 x i8> %6 to <16 x i16>
  %8 = shufflevector <16 x i16> %7, <16 x i16> poison, <8 x i32>
           <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %9 = load <16 x i8>, ptr %2, align 64, !tbaa !312
  %10 = sext <16 x i8> %9 to <16 x i16>
  %11 = shufflevector <16 x i16> %10, <16 x i16> poison, <8 x i32>
             <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %12 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %8, <8 x i16> %11)
  %13 = shufflevector <16 x i16> %7, <16 x i16> poison, <8 x i32>
              <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %14 = shufflevector <16 x i16> %10, <16 x i16> poison, <8 x i32>
             <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %15 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %13, <8 x i16> %14)
  %16 = tail call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %12, <4 x i32> %15)
  %17 = getelementptr inbounds i8, ptr %1, i64 4
  {...}

The fast-math one:

@I.ir_module
class Module:
    @T.prim_func
    def tvmgen_default_fused_nn_contrib_dense_pack(
                  p0: T.Buffer((4, 4), "uint8"), 
                  p1: T.Buffer((1, 1, 4, 4), "int8"), 
                  compute: T.Buffer((4, 4), "int32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        for i_inner in range(4):
            compute_1 = T.Buffer((16,), "int32", data=compute.data)
            compute_1[i_inner * 4:i_inner * 4 + 4] = T.Broadcast(0, 4)
            p0_1 = T.Buffer((16,), "uint8", data=p0.data)
            p1_1 = T.Buffer((16,), "int8", data=p1.data)
            compute_1[i_inner * 4:i_inner * 4 + 4] = 
              T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2), 
                T.call_llvm_pure_intrin("int16x8", "llvm.x86.ssse3.pmadd.ub.sw.128", T.uint32(2), 
                  T.reinterpret("int8x16", T.Broadcast(
                    T.reinterpret("int32", p0_1[i_inner * 4:i_inner * 4 + 4]), 4)), p1_1[0:16]), 
                      T.Broadcast(T.int16(1), 8)) + compute_1[i_inner * 4:i_inner * 4 + 4]
0000000000001e90 <tvmgen_default_fused_nn_contrib_dense_pack_compute_>:
    1e90:	c4 e2 79 18 06       	vbroadcastss (%rsi),%xmm0
    1e95:	c5 f9 6f 12          	vmovdqa (%rdx),%xmm2
    1e99:	c5 f9 6f 4a 10       	vmovdqa 0x10(%rdx),%xmm1
    1e9e:	c4 e2 79 04 da       	vpmaddubsw %xmm2,%xmm0,%xmm3
    1ea3:	c4 e2 79 18 05 e4 11 	vbroadcastss 0x11e4(%rip),%xmm0        # 3090 <_fini+0x620>
    1eaa:	00 00 
    1eac:	c5 e1 f5 d8          	vpmaddwd %xmm0,%xmm3,%xmm3

define internal fastcc void @tvmgen_default_fused_nn_contrib_dense_pack_compute_{
entry:
  {...}
  %3 = load i32, ptr %1, align 64, !tbaa !310
  %4 = insertelement <4 x i32> undef, i32 %3, i64 0
  %5 = bitcast <4 x i32> %4 to <16 x i8>
  %6 = shufflevector <16 x i8> %5, <16 x i8> poison, <16 x i32>
            <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3,
            i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
  %7 = load <16 x i8>, ptr %2, align 64, !tbaa !312
  %8 = tail call <8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(<16 x i8> %6, <16 x i8> %7)
  %9 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %8, <8 x i16> 
            <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>)
  %10 = getelementptr inbounds i8, ptr %1, i64 4
  {...}

Credits

There is a compact full x86 SIMD table guide here.
This work here follows some suggestions from intel's onednn int8 compute notes.

Next

(WiP) This work here will be extended to metaschedule auto-tensorization.
(WiP) Will try enable int4 (not native) using best possible SIMD bit manipulation.


Cc: @masahi , @anijain2305, @jianyuh, @Qianshui-Jiang, @kparzysz-quic , @junrushao , @tqchen , @elvin-n , @vvchernov , @echuraev

@cbalint13 cbalint13 force-pushed the x86-simd branch 15 times, most recently from dc1b629 to fe30368 Compare October 13, 2023 10:22
@cbalint13 cbalint13 marked this pull request as ready for review October 13, 2023 21:41
@ekalda
Copy link
Contributor

ekalda commented Oct 16, 2023

Very interesting PR!

Introduce new ArrayIntImm expression for small immediate list of integer constants.

Can you not use AllocateConst? It's a bit awkward to introduce a whole new TIR node to support a small set of intrinsics.

The call_llvm_pure_intrin & call_llvm_intrin now holds instruction StringImm instead of IntImm abstract.

I like this a lot, would make the TIR much easier to reason about!

@cbalint13
Copy link
Contributor Author

cbalint13 commented Oct 16, 2023

Very interesting PR!

Thank you @ekalda !

Introduce new ArrayIntImm expression for small immediate list of integer constants.

  • Main issue is that within LLVM some of x86 (also other arches) instructions are not exposed at all.
  • So, I had to look into adding zextend, sextend, truncate plus the vectorpermute, vectorshuffle instead.

The good point is that these are lowered to exactly what is needed (even single insn, optim) for the target arch (x86 here).

Can you not use AllocateConst? It's a bit awkward to introduce a whole new TIR node to support a small set of intrinsics.

Hmm no, AllocateConst looked bit too complicated to me (not so "immediate").

  • There are already the {Float,Int,String}Imm so why not ArrayIntImm.
  • It is needed for simple arguments of TIR functions, not representing real data (external, buffer allocations).
  • See the simple usage from python: tir.vectorpermute("int32x8", whatever_vector, [0, 1, 4, 5, 2, 3, 6, 7])
  • See the lowered IR (llvm -mcpu=haswell) as {..} T.ArrayIntImm([0, 1, 4, 5, 2, 3, 6, 7]) {...}:
@I.ir_module
class Module:
    @T.prim_func
    def tvmgen_default_fused_nn_contrib_dense_pack(
           p0: T.Buffer((4, 8), "uint8"), 
           p1: T.Buffer((1, 2, 8, 4), "int8"), 
           compute: T.Buffer((4, 8), "int32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        for i_inner in range(4):
          compute_1 = T.Buffer((32,), "int32", data=compute.data)
          compute_1[i_inner * 8:i_inner * 8 + 8] = T.Broadcast(0, 8)
          for k_outer in range(2):
            p0_1 = T.Buffer((32,), "uint8", data=p0.data)
            p1_1 = T.Buffer((64,), "int8", data=p1.data)
            compute_1[i_inner * 8:i_inner * 8 + 8] =
            T.vectorpermute("int32x8", 
            T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.phadd.d", T.uint32(2),
            T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.pmadd.wd",  T.uint32(2),
            T.vectorlow("void", T.zextend("int16x32", T.reinterpret("int8x32", T.Broadcast(
            T.reinterpret("int32", p0_1[i_inner * 8 + k_outer * 4:i_inner * 8 + k_outer * 4 + 4]), 8)))), 
            T.vectorlow("void", T.sextend("int16x32", p1_1[k_outer * 32:k_outer * 32 + 32]))), 
            T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.pmadd.wd", T.uint32(2),
            T.vectorhigh("void", 
            T.zextend("int16x32", T.reinterpret("int8x32",  T.Broadcast(
            T.reinterpret("int32", p0_1[i_inner * 8 + k_outer * 4:i_inner * 8 + k_outer * 4 + 4]), 8)))), 
            T.vectorhigh("void", 
            T.sextend("int16x32", p1_1[k_outer * 32:k_outer * 32 + 32])))), 
            T.ArrayIntImm([0, 1, 4, 5, 2, 3, 6, 7])) + compute_1[i_inner * 8:i_inner * 8 + 8]

The call_llvm_pure_intrin & call_llvm_intrin now holds instruction StringImm instead of IntImm abstract.

I like this a lot, would make the TIR much easier to reason about!

@ekalda

The work here (x86) is a pseudo kind of "scalable-vector" having _m128, _m256, _m512 but "hand unrolled" ones.
I also follow your RFC related to scalable vectors, I am interested in similar ideas for the riscv64 "v" extension.

@ekalda
Copy link
Contributor

ekalda commented Oct 16, 2023

  • So, I had to look into adding zextend, sextend, truncate plus the vectorpermute, vectorshuffle instead.

The good point is that these are lowered to exactly what is needed (even single insn, optim) for the target arch (x86 here).

Yes, all of these intrinsics are architecture independent in LLVM, so this is a great addition from the point of other backends as well.

Regarding to using AllocateConst, I agree that it is designed (and named) with the intent of holding larger chunks of runtime data, but it is still essentially a TIR container for constants that you can query in the LLVM codegen.

  • There are already the {Float,Int,String}Imm so why not ArrayIntImm.

{Float,Int,String}Imm are not duplicating functionality :)

  • See the simple usage from python: tir.vectorpermute("int32x8", whatever_vector, [0, 1, 4, 5, 2, 3, 6, 7])
idx_vec = T.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
tir.vectorpermute("int32x8", whatever_vector, idx_vec)

is not more complex in my opinion :) I appreciate the convenience of printing the index array inplace in TIR though.

In general, I won't argue against this change if there is a wider consensus that this is a necessary addition. However, I think we have to think carefully about adding another mechanism into TIR for representing data arrays that is opaque to the memory planning. From what I can see, there are no restrictions to the size of the data it can hold, so it's rather susceptible to misuse.

I also follow your RFC related to scalable vectors, I am interested in similar ideas for the riscv64 "v" extension.

That's cool! Yes, I hope we can come up with a design that is going to work for all the scalable vector architectures out there. Feel free to chip in with your thoughts there!

@cbalint13
Copy link
Contributor Author

idx_vec = T.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
tir.vectorpermute("int32x8", whatever_vector, idx_vec)

is not more complex in my opinion :) I appreciate the convenience of printing the index array inplace in TIR though.

Hmm ... that's really short !
I don't know why I found AllocateConst (and stayed convinced) that is something complicated for "immediate" use.

buffer_var = tir.Var("v", tvm.ir.PointerType(tvm.ir.PrimType("int32")))
ir_expected = tir.AllocateConst(
buffer_var,
"int32",
[10],
ndarray.array(np.asarray(data, "int32")),
tir.Evaluate(1),
annotations={},
)

I also find attracted to pass a plain simple python list().

In general, I won't argue against this change if there is a wider consensus that this is a necessary addition. However, I think we have to think carefully about adding another mechanism into TIR for representing data arrays that is opaque to the memory planning. From what I can see, there are no restrictions to the size of the data it can hold, so it's rather susceptible to misuse.

For the misuse part, yes, also agree, but my understanding of AllocateConst is that inherits many properties of the buffers, while ArrayIntImm is more restrained kind of object treatable only as plain "constant" within TIR parts.

Well, I reconsider AllocateConst if there is no consensus.
Thanks for pointing this out with a succinct example !

@cbalint13
Copy link
Contributor Author

cbalint13 commented Oct 16, 2023

idx_vec = T.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
tir.vectorpermute("int32x8", whatever_vector, idx_vec)

is not more complex in my opinion :) I appreciate the convenience of printing the index array inplace in TIR though.

Hmm ... that's really short ! I don't know why I found AllocateConst (and stayed convinced) that is something complicated for "immediate" use.

I add these few lines here showing more contrasted aspects against/pro ArrayIntImm vs. AllocateConst.

  • Tried to stick with AllocateConst, by default it does not come at a cheap:
        idx_vec = tvm.tir.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
                  ^^^^^^^^^^^^^^^^^^^^^^
    AttributeError: module 'tvm.tir' has no attribute 'allocate_const'. Did you mean: 'AllocateConst'?
    
    https://tvm.apache.org/docs/reference/api/python/tir.html#tvm.tir.AllocateConst
      idx_vec = tvm.tir.AllocateConst([0, 1, 4, 5, 2, 3, 6, 7], "int32")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    TypeError: AllocateConst.__init__() missing 3 required 
     positional arguments: 'extents', 'data_or_idx', and 'body'
    
  • Can be "aliased" or something to get the simple form: tir.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
  • Also remembered a remark regarding "the not exposed yet status" of tir.allocate_const at this point in the codebase.

I remain with the idea to use AllocateConst if there will be no consensus on ArrayIntImm.

@ekalda
Copy link
Contributor

ekalda commented Oct 19, 2023

Sorry for delay on this, I was on training for two days -

  • Tried to stick with AllocateConst, by default it does not come at a cheap:

Ah oops I had quoted the TVMScript interface there and didn't also realise all of these arguments are required. I suppose these could be made optional if this would make AllocateConst usable for wider range of use cases.

  • Also remembered a remark regarding "the not exposed yet status" of tir.allocate_const at this point in the codebase.

There's an in flight patch where @Lunderberg has done a great job in integrating the AllocateConst into microTVM implementation (it just keeps failing the CI 🙈).

Also, it seems to me that there are a few core compiler changes in this patch that are needed for the new TOPI schedules, but are in essence target independent changes that would warrant separate discussions. How do you feel about breaking this patch up into smaller patches? From eyeballing the patch, e.g.

  1. Changes to call_llvm_intrin and call_llvm_pure_intrin
  2. ArrayIntImm node
  3. New TIR intrinsics
  4. TOPI schedules

@cbalint13
Copy link
Contributor Author

There's an in flight patch where @Lunderberg has done a great job in integrating the AllocateConst into microTVM implementation (it just keeps failing the CI 🙈).

Also, it seems to me that there are a few core compiler changes in this patch that are needed for the new TOPI schedules, but are in essence target independent changes that would warrant separate discussions.

See now, thanks for pointing this out !

How do you feel about breaking this patch up into smaller patches? From eyeballing the patch, e.g.

  1. Changes to call_llvm_intrin and call_llvm_pure_intrin
  2. ArrayIntImm node
  3. New TIR intrinsics
  4. TOPI schedules

That was exactly what I was thinking, so I will split this up.
I created a tracking checker-box in the first comment of this PR.

@Lunderberg
Copy link
Contributor

Thank you for pointing out #15300. It's been on the back-burner for quite a while, as a pre-cursor to (hopefully/finally) landing #14985. If anybody who is more familiar with tvm.contrib.ethosu implementation, any help on resolving the breakages would be appreciated.

@cbalint13 cbalint13 marked this pull request as draft October 24, 2023 00:16
ekalda pushed a commit that referenced this pull request Oct 25, 2023
This allows printing of the LLVM function real name in TIR printer.
Prior to this a counter-intuitive T.int32() value was printed instead of the real name.
Changes

Before: T.call_llvm_pure_intrin("int32x4", T.uint32(62), T.uint32(0))
After: T.call_llvm_pure_intrin("int32x4", "llvm.donothing", T.uint32(0))

This is part of #15918 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants