Support gfx950 layouts #692

zhanglx13 · 2024-12-26T05:46:42Z

No description provided.

API change: - For blocked layout, use -tensorShape, which only takes two dims as dim0,dim1 - For dot layout, use -dotShape, which takes three dims as M,N,K

Separate each layout's code into their own files

- When kWidth is large, use a smaller elemSize honrizontally to save space - Improve the labels, such as - change vec to kWidth for operands - change opA/opB to inA/inB and include operand dims - remove group dims in the operands so that they don't overlap with operand block dims - Better alignment: dot op and mfma zoomed-in pics are bottom aligned

kGroup is defined as total elements per thread / kWidth for one mfma instruction. We need kGroup = 2 only for the newly added mfma_f32_16x16x128_f8f6f4 and mfma_f32_32x32x64_f8f6f4 with f8 input type on MI350.

And print mfma instruction name accordingly. For now, mixed precision mfma between 8-bit and 4- or 6-bit is not supported yet.

- Support data types - Support both 32 and 64 banks - Still working on LDS accesses

- Fixed the issue with maxPhase computation. Need to submit a PR to fix it in the triton compiler - For ds_read_b64 with 64 banks, there are bank conflicts. We need to figure out a different swizzling pattern to avoid bank conflicts.

Assumed a basic global access pattern

mfma_transpose_load instructions - Elements along the M/N dim are contiguous in both global memory and LDS. Note that this is not the in-thread transpose case. - Swizzling is disabled

…d instructions

jtang10 · 2025-01-24T03:44:36Z

Aside from using it a bit to get a feel if there's anything hard to use, the only thing I can comment is that it might be worthwhile to break the python file down to several pieces to be more modular. The tex files are now broken into several ones already so it makes sense to have python follow the same thing, for readability and future-proof.

The plot_layout.py itself is quite modular already, with the new dataclass and many small utils function so it should be fairly easy to do so.

jtang10

LGTM. Can address the modularity and any potential feedbacks from usage in the future.

zhanglx13 force-pushed the support_layout_gfx950 branch 2 times, most recently from 7910a91 to fd5c641 Compare December 30, 2024 16:12

zhanglx13 force-pushed the support_layout_gfx950 branch 2 times, most recently from 7370be5 to 6bfb674 Compare January 8, 2025 20:33

zhanglx13 marked this pull request as ready for review January 8, 2025 20:34

zhanglx13 added 17 commits January 15, 2025 21:21

Move preamble code into tikzplot.tex

abf4bde

Rename kpack to kWidth and allow kWidth = 32

1965058

[API change] Take user input to set dim names

8f21efa

API change: - For blocked layout, use -tensorShape, which only takes two dims as dim0,dim1 - For dot layout, use -dotShape, which takes three dims as M,N,K

Re-structure files

e1d9ab4

Separate each layout's code into their own files

[API change] Add support for kGroup

dfc59fa

kGroup is defined as total elements per thread / kWidth for one mfma instruction. We need kGroup = 2 only for the newly added mfma_f32_16x16x128_f8f6f4 and mfma_f32_32x32x64_f8f6f4 with f8 input type on MI350.

[API change] Add support for data types of both operands

d4307d0

And print mfma instruction name accordingly. For now, mixed precision mfma between 8-bit and 4- or 6-bit is not supported yet.

Support mixed mfma with bf8/fp8 and fp6/bf6/f4

01a515e

[API change] Add support for scale

2a0a161

[NFC] Fix format

5d537a6

[API change] Refactor tensor and LDS layout

70fa502

- Support data types - Support both 32 and 64 banks - Still working on LDS accesses

[LDS layout] Add support for ds_write access pattern

84cab9d

Assumed a basic global access pattern

[LDS layout] Support access pattern for MN-contig without using

60342e1

mfma_transpose_load instructions - Elements along the M/N dim are contiguous in both global memory and LDS. Note that this is not the in-thread transpose case. - Swizzling is disabled

[LDS layout] Support access pattern for MN-contig with mfma_trans_loa…

d738a34

…d instructions

Clean up the code

e1b9179

[lds layout] support padding

1b331cb

zhanglx13 force-pushed the support_layout_gfx950 branch 2 times, most recently from 80d0c5f to 1b331cb Compare January 16, 2025 03:33

Reduce tex package required

4964d49

zhanglx13 requested a review from jtang10 January 24, 2025 03:39

jtang10 approved these changes Jan 24, 2025

View reviewed changes

zhanglx13 merged commit 7613c4d into main_perf Jan 24, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support gfx950 layouts #692

Support gfx950 layouts #692

zhanglx13 commented Dec 26, 2024

jtang10 commented Jan 24, 2025 •

edited

Loading

jtang10 left a comment

Support gfx950 layouts #692

Support gfx950 layouts #692

Conversation

zhanglx13 commented Dec 26, 2024

jtang10 commented Jan 24, 2025 • edited Loading

jtang10 left a comment

Choose a reason for hiding this comment

jtang10 commented Jan 24, 2025 •

edited

Loading