-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[compiler-rt] Initial support for builtins on GPU targets #95304
Conversation
# FIXME: This doesn't work with no CUDA installation. | ||
set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} -march=sm_75") | ||
endif() | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be extracted somehow? I suspect that we need this in base-config-ix.cmake as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference? I thought this was the one place that checked the default argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for a regular build (not cross compiling directly) it will be fine since it will just use clang
which should hopefully be supported by the user's system, this is mostly required for when the Runtimes build invoked it with CMAKE_C_TARGET=nvptx64-nvidia-cuda
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying CMAKE_C_TARGET
doesn't imply cross-compilation though. I could use a x64 compiler, hosted on x64, and still specify the x64 target and that would not be cross-compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I'm just going off of the expected case where the compiler has an implicit --target=
version. I'm assuming the base logic just checks if it can target any of them? The only logic I saw for that was in base-config-ix.cmake
and it went off of the default target, so I'm unsure where else to modify.
86c2f38
to
5b44600
Compare
b149b1c
to
b8e4c2b
Compare
If It is desirable to build compiler-rt builtin for both the default host target and amdgcn target in one build. And the build for amdgcn target can be optional and controlled by a cmake variable. You may use a similar approach in https://github.com/llvm/llvm-project/pull/71978/files to achieve that. |
It should work with this configuration, (Though I just remembered I need to add some more flags to the other case). Ideally we'd just build these if the compiler can target it and don't need an extra variable. What I did here was mostly testing (direct) targeting, however it should work in various cases. Ultimately what I want is to install these to |
What's the expected way to target multiple architectures in a single build by the way? The logic I see seems to just take the first value from the LLVM default. |
By default, the cmake file only build the LLVM default. However for HIP we want to also build the target for device. Since normal users may not want that, we need a LLVM variable to enable it. |
My current plan is to emit it as part of a GPU toolchain. Right now I do this with the -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=compiler-rt;libc \
-DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa However, this will require a few tweaks since the runtimes build handles the builtins separately. Going to work on a follow-up for that. @compnerd is there a canonical way to build for multiple architectures at the same time? I.e. given a standard CMake invocation I'd like to be able to build output for x64, amdgcn, and nvptx64 if they are all supported by the compiler. |
Invoke CMake multiple times :) CMake is designed to build for a single host at a time, so you would need to run a build per host. |
Okay, that tracks with the runtimes option I'm planning on using. The case that @yxsamliu is describing can probably be done with |
Is it possible to build compiler-rt runtime by using LLVM built in a different build directory? This is to avoid building LLVM twice if we want to build compiler-rt for x86_64 and and amdgcn. |
@compnerd Okay, I figured out my original problem. So, in #95234 the reason I stopped seeing the issue was because I moved to using |
There are two ways to build compiler-rt:
I'd like to eventually fully deprecate (1) and make (2) the only way since (1) is a maintenance nightmare: it doesn't use standard CMake logic, instead it invokes Clang directly as |
No, that patch is not valid - compiler-rt is a grab bag of libraries, we cannot design the build system for just the builtins. If you want to build just the builtins, you need to configure the build accordingly to avoid the other libraries. |
@yxsamliu - if you are using the runtimes build, the just built compiler can be used to build the runtimes for the hosts that you would like. |
Okay, (2) is basically what I've been testing this against. I've run into some issues getting it to work in the runtimes case however. This seems to stem from the fact that we have a separate |
@petrhosek I'm trying to get it working through the |
I removed the version check for now since I have no clue why it's failing and it's not strictly necessary. I updated some logic to make the default configuration work when targeted directly. The patch #95554 fixes the issues I had with enabling it. @yxsamliu with both of these patches, the following build should work. $ cmake ../llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=clang;lld \
-DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=compiler-rt;libc \
-DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=compiler-rt;libc \
-DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda \ This will create |
I'd rather not bake in this assumption. We should be able to support .o and .bc builds |
It's true for now at least, so we get the |
Nothing is stopping you from building and using the object today. We just aren't going to guarantee a stable ABI for it now |
No stable ABI and incorrect resource usage metadata. The other issue is we'd need to build multiple objects like |
Potentially we could put the subarchitecture in the triple (I think Arm and DXIL do this) and then build these builtins multiple times. Unfortunately this would take a stupidly long time to configure, since |
Is AMDGPU backend able to do ISA-level linking now? |
AFAIK, you can do ISA-level linking in the sense that relocations work. However it's "broken" for the following reasons.
So, currently it "works" if you just so happen to have enough registers, don't use LDS, and expect the ABI not to change in the future. |
Summary: This patch adds initial support to build the `builtins` library for GPU targets. Primarily this requires adding a few new architectures for `amdgcn` and `nvptx64`. I built this using the following invocations. ```console $ cmake ../compiler-rt -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release -GNinja -DCMAKE_C_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa> -DCMAKE_CXX_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa> -DCMAKE_C_COMPILER_WORKS=1 -DCMAKE_CXX_COMPILER_WORKS=1 -DLLVM_CMAKE_DIR=../cmake/Modules -DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON -C ../compiler-rt/cmake/caches/GPU.cmake ``` Some pointers would be appreciated for how to test this using a standard (non-default target only) build. GPU builds are somewhat finnicky. We only expect this to be built with a sufficiently new clang, as it's the only compiler that supports the target and output we distribute. Distribution is done as LLVM-IR blobs. GPUs have little backward compatibility, so linking object files is difficult. However, this prevents us from calling these functions post-LTO as they will have been optimized out. Another issue is the CMake flag querying functions, currently these fail on nvptx if you don't have CUDA installed because they want to use the `ptxas` and `nvlink` binaries. More work is necessary to build correctly for all targets and ship into the correct clang resource directory. Additionally we need to use the `libc` project's support for running unit tests.
There are three things here. One is the very complicated logic in compiler-rt cmake for deciding what to build. Two is the long-standing machine code linking puzzle for amdgpu. Three is that amdgpu really could do with a compiler-rt builtins to factor parts of the compiler backend / intrinsics into. We can postpone 2 by noting that builtins.a is an archive and archives can have object or bitcode in. 1 is just really careful programming. I'd like there to be a target-builtins for the two amdgpu and nvptx targets, provided it serves a similar role to on the other targets, and doesn't import a load of gpu-offload-centric design decisions like unexpectedly wrapping everything in x64 elf files or deciding archive files don't mean what they do everywhere else. At the lowest level, a gpu triple is just a bare metal embedded target. Compiler-rt builtins is the lowest level, it can and should use that model. |
Making the backend rely on it is difficult because of (2), but agreed.
The short-term plan here is to make this
Yep, the compilation method here is pretty much just "I want to build |
Not for compiler-rt, we can just assume it's version locked. The resource problem is also simpler for compiler-rt. We can more easily just define a fixed register budget, and then just assume there's no stack usage |
Yeah, and LDS likely isn't going to be an issue unless we start putting more things in here. Would that require some special handling for like "If callee is a compiler-rt function assume some thread count?" I know for most cases the default for an unknown callee is generous enough to not cause issues. Also, the issues with machine code linking is that you'd need to somehow wrangle many different builds. This would only work as part of the build system if you put it into the Triple (like Arm Architecture does), but even that would then necessitate like 20 separate builds which would take ages. (Configuring just one of these adds like an extra 15 seconds to my build). The issue with the IR / LTO route is that the library gets resolved / internalized before the backend runs, so we don't know if we actually needed some function from the static library. I'm wondering if it would be possible to do some kind of "late" LTO that runs after everything else. That way we don't need 50 different builds and we get ISA linking that's correct. |
The compatibility window isn't that big, we should be able to start using the new generic targets for this too. But ultimately I don't see this as a big deal. Part of my compiler-rt interest is it's a restricted support case where we can start with object linking |
I'm just wondering how this would work in practice. We'd need special logic for both the build system and the linker job in clang. And like I said, building it for even a few architectures is going to add at least 10 minutes to a trivial rebuild. The end goal is definitely to have this supported in the backend, since it should simplify a lot of stuff. But I think the initial version of this patch doesn't need to tackle the very wide problem of ISA linking on GPUs just yet. |
Are there any issues with this patch overall? I think this is a good first step and we can investigate ABI linking on GPUs as we add features. |
LGTM on HIP side. We could use it like we use device libraries since it is GPU-arch-neutral. I think it is good it have the bitcode lib built first and then consider adding object libs in a progressive way. |
ping |
@compnerd Is there anything that needs to be changed? I'd like to merge this before the LLVM19 fork if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this seems safe enough as is to merge I think. I wonder if we can push the flags into the cache file, but that isn't a requirement for this I think - we can clean that up subsequently.
Summary: This patch adds initial support to build the `builtins` library for GPU targets. Primarily this requires adding a few new architectures for `amdgcn` and `nvptx64`. I built this using the following invocations. ```console $ cmake ../compiler-rt -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release -GNinja -DCMAKE_C_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa> -DCMAKE_CXX_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa> -DCMAKE_C_COMPILER_WORKS=1 -DCMAKE_CXX_COMPILER_WORKS=1 -DLLVM_CMAKE_DIR=../cmake/Modules -DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON -C ../compiler-rt/cmake/caches/GPU.cmake ``` Some pointers would be appreciated for how to test this using a standard (non-default target only) build. GPU builds are somewhat finnicky. We only expect this to be built with a sufficiently new clang, as it's the only compiler that supports the target and output we distribute. Distribution is done as LLVM-IR blobs for now. GPUs have little backward compatibility, so linking object files is left to a future patch. More work is necessary to build correctly for all targets and ship into the correct clang resource directory. Additionally we need to use the `libc` project's support for running unit tests.
Summary:
This patch adds initial support to build the
builtins
library for GPUtargets. Primarily this requires adding a few new architectures for
amdgcn
andnvptx64
. I built this using the following invocations.Some pointers would be appreciated for how to test this using a standard
(non-default target only) build.
GPU builds are somewhat finnicky. We only expect this to be built with a
sufficiently new clang, as it's the only compiler that supports the
target and output we distribute. Distribution is done as LLVM-IR blobs for now.
GPUs have little backward compatibility, so linking object files is
left to a future patch.
More work is necessary to build correctly for all targets and ship into
the correct clang resource directory. Additionally we need to use the
libc
project's support for running unit tests.