-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add 1D TMA UBLKCP #3739
add 1D TMA UBLKCP #3739
Conversation
PR Reviewer Guide 🔍(Review updated until commit 5365076)Here are some key observations to aid the review process:
|
!test |
!test |
csrc/index_compute.cpp
Outdated
} else { | ||
ids_to_index.push_back(group->front()->as<IterDomain>()); | ||
NVF_ERROR(true, "S2G not implemented yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: NVF_THROW
!build |
Why add this?
There are two types of TMA ops:
The second version supports copying n-d data, with each dimension limited to a maximum size of 256. It requires the use of a tensor map.
The first version is designed for copying 1-d data, with lengths exceeding 256. It does not require a tensor map, making it better suited for non-matmul fusions, where 2D tiling is unnecessary. In these scenarios, each block typically loads n elements, with n ranging from 1K to 32K or more. This version requires only a single TMA instruction.
Code changes
(1) Adding a new loading type
LoadStoreOpType::CpAsyncBulk
and lowered tocp.async.bulk
, the existing n-D TMA usesLoadStoreOpType::CpAsyncBulkTensorTile
and lowered tocp.async.bulk.tensor.nd
(2) Added a unit test loading 512 elements in one TMA instruction.
(3) 1D TMA is lowered to: