Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU ukernels: Bazel build, separate bitcode files, c-embed archives. #19274

Merged
merged 2 commits into from
Nov 27, 2024

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Nov 22, 2024

  1. Implement Bazel, generate CMake from Bazel.
  2. Split .bc bitcode files, one .bc file <-> one ukernel function.
  3. Generate embedded-data archives.
  4. Update the compiler code to use the embedded-data archives.
  5. Simplify setAlwaysInline now that we are no longer dealing with HIP symbols.

@bjacob bjacob marked this pull request as ready for review November 22, 2024 21:04
Base automatically changed from users/bjacob/uk-split to main November 24, 2024 02:14
@bjacob bjacob force-pushed the users/bjacob/uk-bazel-embed branch from b0bbed8 to bd7f1af Compare November 25, 2024 16:39
gfx942_rocm Outdated Show resolved Hide resolved
build_tools/cmake/iree_bitcode_library.cmake Show resolved Hide resolved
build_tools/cmake/iree_bitcode_library.cmake Outdated Show resolved Hide resolved
Comment on lines +26 to +36
gpu_archs = [
"gfx90a",
"gfx942",
"gfx1030",
"gfx1100",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many of these architectures are we going to need [built into the compiler]? How much binary size and compile time does each add?

If less than 10, that seems acceptable. Any more than that and I'd worry about maintenance, binary size, and compile time costs.

Some suggestions:

Copy link
Contributor Author

@bjacob bjacob Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current argmax ukernel here is more aggressive in supporting "all" architectures than what I envision adding going forward.

I mostly envision ukernels for iree_gpu.multi_mma , each painstakingly hand-written in amdgpu assembly (wrapped as inline-assembly into a C form). My whole motivation is to work around defects in the LLVM AMDGPU backend, and assembly is the only thing that doesn't ultimately depend on that. So, the ukernels that I will be adding will by construction be specific to one architecture, and expensive in engineering time.

This argmax ukernel is different: written in C and compiled for "all currently relevant architectures", what it works around is the difficulty of generating efficient code for this kind of kernel in the MLIR programming model, due to the very specific things that argmax needs to do (see wave reductions (wfred) and ballots in the C sources).

Even then, even for such argmax-style C ukernels, the number of architectures is still less than 10 in any foreseeable future -- since this is an optimization-only thing (argmax still works without ukernels, which is also why they are not enabled by default) --- we never need to support more architectures than we really care about performance on. At the moment, the number of architectures that we really care about performance on, equals 1 (that is: CDNA3)...

Would we want to add a CMake option to add other architectures to the list? (and possibly Bazel...)

That would make sense if the current argmax-style ukernels (which are C that can simply be recompiled to any target arch) were the generic case going forward (but see above explanation).

Can iree-compile generate these on-demand if a user passes a target chip that isn't in the list of packaged ukernels?

In a future where this genericity were really important (meaning, a future where such retargetable C ukernels are important and we have many target architectures) we could do that by actually shipping ukernels in source C form... and building clang C-compiling powers into iree-compile... just a theoretical possibility. See above explanation for why we are not headed in that direction.

The really important high level point here is that our plan A is still to root for LLVM AMDGPU improvements. So that before any such future happens... LLVM AMDGPU will be in a better shape and will remove the need for ukernels. And then for things like argmax where the compiler can't magically invent the right implementation with wave reductions and ballot, it'll be possible to at least implement that as a MLIR lowering, mirroring the current C code in that ukernel. (While for my use case with multi_mma, it really is just a compiler bug that it isn't currently generating the fastest code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that helps a lot. I don't have any extra concerns with this approach for now then.

Can we get some of that text into a comment somewhere in the source and/or a markdown file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added an allusion to that in addressing your review comment on BUILD.bazel. I expect that the code is going to start conveying intent better once (1) the next PR moves the code loading the bitcode to LowerToUkernels (see other reply) and (2) I start adding multi_mma ukernels that provide the better model for the generic case going forward. At that point we'll have fewer things requiring comment (because always best when the code speaks for itself) and more stable places to add comments on what still needs commenting.

Comment on lines +9 to +12
#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx1030.h"
#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx1100.h"
#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx90a.h"
#include "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_gfx942.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could group these into a single header file, perhaps with the contents of that header file generated by the build system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is by construction one header file per c_embed_data library. So the question becomes whether to group these 4 c_embed_data libraries into one. I have a specific reason not to, which will become apparent in the next PR: I want the c_embed_data library for a given target architecture to contain exactly the ukernels that are supported on that architecture, so that just by looking up the table of contents, LowerToUkernels will be able to tell whether a ukernel lowering should succeed or fail.

Comment on lines 187 to 205
static SmallVector<llvm::Expected<std::unique_ptr<llvm::Module>>>
loadUKernelBitcodeFiles(llvm::LLVMContext &context, StringRef targetGpuArch) {
const iree_file_toc_t *toc = nullptr;
int toc_size = 0;
if (targetGpuArch == "gfx90a") {
toc = iree_uk_amdgpu_gfx90a_create();
toc_size = iree_uk_amdgpu_gfx90a_size();
} else if (targetGpuArch == "gfx942") {
toc = iree_uk_amdgpu_gfx942_create();
toc_size = iree_uk_amdgpu_gfx942_size();
} else if (targetGpuArch == "gfx1030") {
toc = iree_uk_amdgpu_gfx1030_create();
toc_size = iree_uk_amdgpu_gfx1030_size();
} else if (targetGpuArch == "gfx1100") {
toc = iree_uk_amdgpu_gfx1100_create();
toc_size = iree_uk_amdgpu_gfx1100_size();
} else {
return {};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here - could have the build system generate this if the list is expected to grow or be dynamic.

Or have a registration mechanism / dependency injection, so we don't have 5 places to update whenever adding a new architecture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See reply to previous comment. This code, currently in ROCMTargetUtils (ie linking-time) will move up to the earlier LowerToUkernels time where it will be key to how LowerToUkernels decides whether to succeed or fail. Separate arch-specific TOC's are going to be key.

@bjacob bjacob requested a review from ScottTodd November 25, 2024 18:20
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. May want at least one other person familiar with AMDGPU codegen to review though.

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nits

build_tools/bazel/iree_bitcode_library.bzl Outdated Show resolved Hide resolved
build_tools/bazel/iree_bitcode_library.bzl Outdated Show resolved Hide resolved
compiler/plugins/target/ROCM/ROCMTargetUtils.cpp Outdated Show resolved Hide resolved
compiler/plugins/target/ROCM/ROCMTargetUtils.cpp Outdated Show resolved Hide resolved
@bjacob bjacob force-pushed the users/bjacob/uk-bazel-embed branch from 43af469 to a483247 Compare November 27, 2024 19:16
@bjacob bjacob enabled auto-merge (squash) November 27, 2024 19:17
@bjacob bjacob merged commit 8272490 into main Nov 27, 2024
40 checks passed
@bjacob bjacob deleted the users/bjacob/uk-bazel-embed branch November 27, 2024 19:36
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
…s. (iree-org#19274)

1. Implement Bazel, generate CMake from Bazel.
2. Split .bc bitcode files, one .bc file <-> one ukernel function.
3. Generate embedded-data archives.
4. Update the compiler code to use the embedded-data archives.
5. Simplify setAlwaysInline now that we are no longer dealing with HIP
symbols.

---------

Signed-off-by: Benoit Jacob <[email protected]>
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
…s. (iree-org#19274)

1. Implement Bazel, generate CMake from Bazel.
2. Split .bc bitcode files, one .bc file <-> one ukernel function.
3. Generate embedded-data archives.
4. Update the compiler code to use the embedded-data archives.
5. Simplify setAlwaysInline now that we are no longer dealing with HIP
symbols.

---------

Signed-off-by: Benoit Jacob <[email protected]>
Signed-off-by: Giacomo Serafini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants