simplify karg in device/grid of split-k op #644

carlushuang · 2023-03-20T08:17:37Z

No description provided.

…ify_op

aosewski · 2023-03-23T13:59:15Z

include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdlops_v2r4r2.hpp

+        // clang-format off
+        str << "GemmXdlSplitKCShuffle_"
+            << getGemmSpecializationString(GemmSpec) << "_"
+            << LStr<ALayout>::Get() << LStr<BLayout>::Get() << LStr<CLayout>::Get() << "_"


We have overloaded stream operator for that: https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/include/ck/tensor_operation/gpu/device/tensor_layout.hpp#L410-L414, but that would give you "full" names like "RowMajor".

…ify_op

zjing14 · 2023-03-28T15:33:08Z

@carlushuang Have you done all changes? Is the PR ready for review?

This reverts commit bb5530a.

#665) This reverts commit bb5530a.

* [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <[email protected]> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: zjing14 <[email protected]> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by: root <[email protected]> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit bb5530a. * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <[email protected]> * fixed quant example (#672) Co-authored-by: root <[email protected]> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: ROCm/MIOpen#1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by: Rosty Geyyer <[email protected]> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * add vector load check * solve conflicts --------- Co-authored-by: Sam Wu <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: rocking5566 <[email protected]> Co-authored-by: zjing14 <[email protected]> Co-authored-by: Rostyslav Geyyer <[email protected]> Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Jun Liu <[email protected]> Co-authored-by: Illia Silin <[email protected]>

* [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <[email protected]> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: zjing14 <[email protected]> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by: root <[email protected]> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit bb5530a. * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <[email protected]> * fixed quant example (#672) Co-authored-by: root <[email protected]> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: ROCm/MIOpen#1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by: Rosty Geyyer <[email protected]> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * Disable SkipLDS & Align AIT api * Update dependabot config (#682) Co-authored-by: samjwu <[email protected]> * update attn api * solve type_convert bug + enable --------- Co-authored-by: Sam Wu <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: rocking5566 <[email protected]> Co-authored-by: zjing14 <[email protected]> Co-authored-by: Rostyslav Geyyer <[email protected]> Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Jun Liu <[email protected]> Co-authored-by: Illia Silin <[email protected]> Co-authored-by: samjwu <[email protected]> Co-authored-by: haocwang <[email protected]>

* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size * navi3x_multipleD+example * temp save * workable * batchedgemm[OK], groupconv[debug] * groupconv: Sanity check[OK], Performance[Bad] * navi3x_groupconv_need_optimization * create necessary files * save progress * Add Inter-Row thread transfer * save progress * save debugging progress * sanity check pass * fix a host tensor bug and clean up flash-attn code * format * cancel unnecessary change * cancel unnecessary change * cancel unnecessary change * temp save, add asm backend flag to amd_wmma * Mat-A LDS Bypass sanity pass * temp save * gemm sanity fix * Porting new blockwise gemm to flash attention * Example branch provide to compiler team * tempsave * Fix a bug * batched gemm ported * conv A-skip lds ported * Skip B-Lds real gemm * Skip B Lds Gemm + MulD * batched gemm, conv, skip b lds * format * Attn, skip b lds * Change GridwiseOp nam * fix a typo caused bug * Skip A_Lds sanity pass, Skip B_Lds scratch occured * Bug found, intra-row permute off caused * bug found * a fix * disable buffer load due to incorrect 3rd dword * update fmha config, no scratch generated * update 3rd dword * fmha config update * FMHA, add support to gfx1101/gfx1102 * Merge origin dev (#2) * [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <[email protected]> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: zjing14 <[email protected]> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by: root <[email protected]> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit bb5530a. * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <[email protected]> * fixed quant example (#672) Co-authored-by: root <[email protected]> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: ROCm/MIOpen#1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by: Rosty Geyyer <[email protected]> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * add vector load check * solve conflicts --------- Co-authored-by: Sam Wu <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: rocking5566 <[email protected]> Co-authored-by: zjing14 <[email protected]> Co-authored-by: Rostyslav Geyyer <[email protected]> Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Jun Liu <[email protected]> Co-authored-by: Illia Silin <[email protected]> * Disable SkipLDS & Align AIT api (#3) * fix layernorm, reduction Ops (#4) * [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <[email protected]> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: zjing14 <[email protected]> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by: root <[email protected]> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit bb5530a. * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <[email protected]> * fixed quant example (#672) Co-authored-by: root <[email protected]> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: ROCm/MIOpen#1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by: Rosty Geyyer <[email protected]> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * Disable SkipLDS & Align AIT api * Update dependabot config (#682) Co-authored-by: samjwu <[email protected]> * update attn api * solve type_convert bug + enable --------- Co-authored-by: Sam Wu <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: rocking5566 <[email protected]> Co-authored-by: zjing14 <[email protected]> Co-authored-by: Rostyslav Geyyer <[email protected]> Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Jun Liu <[email protected]> Co-authored-by: Illia Silin <[email protected]> Co-authored-by: samjwu <[email protected]> Co-authored-by: haocwang <[email protected]> * fix typo * Fix attention with causal mask * multiple fix, try ait compile * Add A/B not use LDS pipeline * Clang format, Add gfx1101, gfx1102 support of FMHA example * cancel change of format script * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling) 2. Enable FP16 accumulator blockwise_gemm * clang-format * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement) 2. change kernel timing mode to 50 warmup + 50 timed repeat * Update low level abstration of blockwise gemm wmma * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds * (4/5) grouped conv pass * (5/5) attention pass, todo: debug lds perf bug * AIT Attention API refactor (#8) * sanity pass * sanity pass 2 * confirm significant performance regression. * turn on all instances * turn off instance format * Fix bug & tunning & format * DML meta, self_attn+cross_attn * sanity pass * remove useless flag * update tile and problem size used in AIT attention * bug fix in grouped conv supporting check * deprecate inline asm wmma * Bug fix: double lds skip * clang-format * Fix errors in 1. example, fmha 2. gridwise pipeline 3. deviceop, fmha, change some containers from vector to array * part2 of previous commit * clang format * API fix of gridwisegemmpipeline * separate array base and vector base attention tensor transformation * fix gemm * clang format * add gemm fp16 instances * Temp save * fpAintB kernel compile pass * Sanity pass. * Temp save * debug code enabled * Fp16AInt8B_GEMM sanity * MQA implementation * GQA-4 example * tempsave * Compile pass * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm * format * Todo: fix gemm_bilinear_wmma instances compilation bug * Solve a bug when K1=16 * remove unnecessary changes * Remove tensor layout limitation to LDS usage in tesnor contraction * update self-attention and cross-attention * fix a typo of name * Add arch limiter for fp8 gemm * enable fp8 gemm_xdl for all gfx9 targets * temporarily disable gemm_xdl_fp16_fp8 on MI100/200 * fix the cmake logic for gemm_xdl_fp16_fp8 * re-enable the gemm_xdl_fp16_fp8 on MI100/200 --------- Co-authored-by: aska-0096 <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: rocking5566 <[email protected]> Co-authored-by: Rostyslav Geyyer <[email protected]> Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Jun Liu <[email protected]> Co-authored-by: Illia Silin <[email protected]> Co-authored-by: samjwu <[email protected]> Co-authored-by: haocwang <[email protected]> Co-authored-by: illsilin <[email protected]>

simplify karg in device/grid split-k op

fd26911

zjing14 requested review from asroy and zjing14 March 22, 2023 13:59

carlushuang added 2 commits March 22, 2023 16:42

fix mk_kn_mn instances

e21267f

Merge remote-tracking branch 'origin/develop' into simplified_karg_un…

2456e9c

…ify_op

aosewski reviewed Mar 23, 2023

View reviewed changes

carlushuang mentioned this pull request Mar 23, 2023

Create tensor descriptor inside kernel to improve performance of small/tiny gemm cases #596

Closed

carlushuang added 4 commits March 23, 2023 18:48

add more instances

2863635

Merge remote-tracking branch 'origin/develop' into simplified_karg_un…

115f8a4

…ify_op

Merge remote-tracking branch 'origin/develop' into simplified_karg_un…

6ef19ca

…ify_op

use name from tensor layout

1e7f00d

asroy approved these changes Mar 30, 2023

View reviewed changes

asroy merged commit bb5530a into develop Mar 30, 2023

junliume added a commit that referenced this pull request Apr 6, 2023

Revert "simplify karg in device/grid of split-k op (#644)"

abbc239

This reverts commit bb5530a.

junliume added a commit that referenced this pull request Apr 7, 2023

Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (

3248387

#665) This reverts commit bb5530a.

carlushuang mentioned this pull request Apr 25, 2023

Grouped Gemm + SplitK + simplified Kernel Args #669

Merged

poyenc mentioned this pull request May 4, 2023

Simplify kernel argument of device operator DeviceGemm_Xdl_CShuffle<> #696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplify karg in device/grid of split-k op #644

simplify karg in device/grid of split-k op #644

carlushuang commented Mar 20, 2023

aosewski Mar 23, 2023

zjing14 commented Mar 28, 2023

simplify karg in device/grid of split-k op #644

simplify karg in device/grid of split-k op #644

Conversation

carlushuang commented Mar 20, 2023

aosewski Mar 23, 2023

Choose a reason for hiding this comment

zjing14 commented Mar 28, 2023