Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gfx950 layouts #692

Merged
merged 18 commits into from
Jan 24, 2025
Merged

Support gfx950 layouts #692

merged 18 commits into from
Jan 24, 2025

Conversation

zhanglx13
Copy link

No description provided.

@zhanglx13 zhanglx13 force-pushed the support_layout_gfx950 branch 2 times, most recently from 7910a91 to fd5c641 Compare December 30, 2024 16:12
@zhanglx13 zhanglx13 force-pushed the support_layout_gfx950 branch 2 times, most recently from 7370be5 to 6bfb674 Compare January 8, 2025 20:33
@zhanglx13 zhanglx13 marked this pull request as ready for review January 8, 2025 20:34
API change:
- For blocked layout, use -tensorShape, which only takes two dims as dim0,dim1
- For dot layout, use -dotShape, which takes three dims as M,N,K
Separate each layout's code into their own files
- When kWidth is large, use a smaller elemSize honrizontally to save
space
- Improve the labels, such as
  - change vec to kWidth for operands
  - change opA/opB to inA/inB and include operand dims
  - remove group dims in the operands so that they don't overlap with
  operand block dims
- Better alignment: dot op and mfma zoomed-in pics are bottom aligned
kGroup is defined as total elements per thread / kWidth for one mfma
instruction.
We need kGroup = 2 only for the newly added mfma_f32_16x16x128_f8f6f4
and mfma_f32_32x32x64_f8f6f4 with f8 input type on MI350.
And print mfma instruction name accordingly.
For now, mixed precision mfma between 8-bit and 4- or 6-bit is not
supported yet.
- Support data types
- Support both 32 and 64 banks
- Still working on LDS accesses
- Fixed the issue with maxPhase computation. Need to submit a PR to
fix it in the triton compiler
- For ds_read_b64 with 64 banks, there are bank conflicts. We need to
figure out a different swizzling pattern to avoid bank conflicts.
Assumed a basic global access pattern
mfma_transpose_load instructions

- Elements along the M/N dim are contiguous in both global memory and
LDS. Note that this is not the in-thread transpose case.
- Swizzling is disabled
@zhanglx13 zhanglx13 force-pushed the support_layout_gfx950 branch 2 times, most recently from 80d0c5f to 1b331cb Compare January 16, 2025 03:33
@zhanglx13 zhanglx13 requested a review from jtang10 January 24, 2025 03:39
@jtang10
Copy link

jtang10 commented Jan 24, 2025

Aside from using it a bit to get a feel if there's anything hard to use, the only thing I can comment is that it might be worthwhile to break the python file down to several pieces to be more modular. The tex files are now broken into several ones already so it makes sense to have python follow the same thing, for readability and future-proof.

The plot_layout.py itself is quite modular already, with the new dataclass and many small utils function so it should be fairly easy to do so.

Copy link

@jtang10 jtang10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can address the modularity and any potential feedbacks from usage in the future.

@zhanglx13 zhanglx13 merged commit 7613c4d into main_perf Jan 24, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants