-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU ukernels: Bazel build, separate bitcode files, c-embed archives. #19274
Conversation
bjacob
commented
Nov 22, 2024
- Implement Bazel, generate CMake from Bazel.
- Split .bc bitcode files, one .bc file <-> one ukernel function.
- Generate embedded-data archives.
- Update the compiler code to use the embedded-data archives.
- Simplify setAlwaysInline now that we are no longer dealing with HIP symbols.
b0bbed8
to
bd7f1af
Compare
gpu_archs = [ | ||
"gfx90a", | ||
"gfx942", | ||
"gfx1030", | ||
"gfx1100", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many of these architectures are we going to need [built into the compiler]? How much binary size and compile time does each add?
If less than 10, that seems acceptable. Any more than that and I'd worry about maintenance, binary size, and compile time costs.
Some suggestions:
- Could add a comment here with a link to a list of available architectures and some description of which are worth including here. Docs: https://iree.dev/guides/deployment-configurations/gpu-rocm/#compile-a-program and https://llvm.org/docs/AMDGPUUsage.html#processors could work.
- Would we want to add a CMake option to add other architectures to the list? (and possibly Bazel...)
- Can
iree-compile
generate these on-demand if a user passes a target chip that isn't in the list of packaged ukernels? (Or, if these are embedded in the runtime, can they be configured when building programs likeiree-run-module
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current argmax
ukernel here is more aggressive in supporting "all" architectures than what I envision adding going forward.
I mostly envision ukernels for iree_gpu.multi_mma
, each painstakingly hand-written in amdgpu assembly (wrapped as inline-assembly into a C form). My whole motivation is to work around defects in the LLVM AMDGPU backend, and assembly is the only thing that doesn't ultimately depend on that. So, the ukernels that I will be adding will by construction be specific to one architecture, and expensive in engineering time.
This argmax ukernel is different: written in C and compiled for "all currently relevant architectures", what it works around is the difficulty of generating efficient code for this kind of kernel in the MLIR programming model, due to the very specific things that argmax needs to do (see wave reductions (wfred
) and ballots in the C sources).
Even then, even for such argmax
-style C ukernels, the number of architectures is still less than 10 in any foreseeable future -- since this is an optimization-only thing (argmax still works without ukernels, which is also why they are not enabled by default) --- we never need to support more architectures than we really care about performance on. At the moment, the number of architectures that we really care about performance on, equals 1 (that is: CDNA3)...
Would we want to add a CMake option to add other architectures to the list? (and possibly Bazel...)
That would make sense if the current argmax-style ukernels (which are C that can simply be recompiled to any target arch) were the generic case going forward (but see above explanation).
Can iree-compile generate these on-demand if a user passes a target chip that isn't in the list of packaged ukernels?
In a future where this genericity were really important (meaning, a future where such retargetable C ukernels are important and we have many target architectures) we could do that by actually shipping ukernels in source C form... and building clang
C-compiling powers into iree-compile
... just a theoretical possibility. See above explanation for why we are not headed in that direction.
The really important high level point here is that our plan A is still to root for LLVM AMDGPU improvements. So that before any such future happens... LLVM AMDGPU will be in a better shape and will remove the need for ukernels. And then for things like argmax where the compiler can't magically invent the right implementation with wave reductions and ballot, it'll be possible to at least implement that as a MLIR lowering, mirroring the current C code in that ukernel. (While for my use case with multi_mma
, it really is just a compiler bug that it isn't currently generating the fastest code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that helps a lot. I don't have any extra concerns with this approach for now then.
Can we get some of that text into a comment somewhere in the source and/or a markdown file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added an allusion to that in addressing your review comment on BUILD.bazel. I expect that the code is going to start conveying intent better once (1) the next PR moves the code loading the bitcode to LowerToUkernels (see other reply) and (2) I start adding multi_mma ukernels that provide the better model for the generic case going forward. At that point we'll have fewer things requiring comment (because always best when the code speaks for itself) and more stable places to add comments on what still needs commenting.
#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx1030.h" | ||
#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx1100.h" | ||
#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx90a.h" | ||
#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx942.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could group these into a single header file, perhaps with the contents of that header file generated by the build system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is by construction one header file per c_embed_data
library. So the question becomes whether to group these 4 c_embed_data
libraries into one. I have a specific reason not to, which will become apparent in the next PR: I want the c_embed_data
library for a given target architecture to contain exactly the ukernels that are supported on that architecture, so that just by looking up the table of contents, LowerToUkernels will be able to tell whether a ukernel lowering should succeed or fail.
static SmallVector<llvm::Expected<std::unique_ptr<llvm::Module>>> | ||
loadUKernelBitcodeFiles(llvm::LLVMContext &context, StringRef targetGpuArch) { | ||
const iree_file_toc_t *toc = nullptr; | ||
int toc_size = 0; | ||
if (targetGpuArch == "gfx90a") { | ||
toc = iree_uk_amdgpu_gfx90a_create(); | ||
toc_size = iree_uk_amdgpu_gfx90a_size(); | ||
} else if (targetGpuArch == "gfx942") { | ||
toc = iree_uk_amdgpu_gfx942_create(); | ||
toc_size = iree_uk_amdgpu_gfx942_size(); | ||
} else if (targetGpuArch == "gfx1030") { | ||
toc = iree_uk_amdgpu_gfx1030_create(); | ||
toc_size = iree_uk_amdgpu_gfx1030_size(); | ||
} else if (targetGpuArch == "gfx1100") { | ||
toc = iree_uk_amdgpu_gfx1100_create(); | ||
toc_size = iree_uk_amdgpu_gfx1100_size(); | ||
} else { | ||
return {}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here - could have the build system generate this if the list is expected to grow or be dynamic.
Or have a registration mechanism / dependency injection, so we don't have 5 places to update whenever adding a new architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See reply to previous comment. This code, currently in ROCMTargetUtils (ie linking-time) will move up to the earlier LowerToUkernels time where it will be key to how LowerToUkernels decides whether to succeed or fail. Separate arch-specific TOC's are going to be key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. May want at least one other person familiar with AMDGPU codegen to review though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % nits
Signed-off-by: Benoit Jacob <[email protected]>
Signed-off-by: Benoit Jacob <[email protected]>
43af469
to
a483247
Compare
…s. (iree-org#19274) 1. Implement Bazel, generate CMake from Bazel. 2. Split .bc bitcode files, one .bc file <-> one ukernel function. 3. Generate embedded-data archives. 4. Update the compiler code to use the embedded-data archives. 5. Simplify setAlwaysInline now that we are no longer dealing with HIP symbols. --------- Signed-off-by: Benoit Jacob <[email protected]>
…s. (iree-org#19274) 1. Implement Bazel, generate CMake from Bazel. 2. Split .bc bitcode files, one .bc file <-> one ukernel function. 3. Generate embedded-data archives. 4. Update the compiler code to use the embedded-data archives. 5. Simplify setAlwaysInline now that we are no longer dealing with HIP symbols. --------- Signed-off-by: Benoit Jacob <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>