-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor incremental CUDA PTX compiler #284
Conversation
cuFile << initialCUDADeviceCode; | ||
cuFile.close(); | ||
// initialize NVPTX backend | ||
LLVMInitializeNVPTXTargetInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI says:
error: ‘LLVMInitializeNVPTXTargetInfo’ was not declared in this scope
LLVMInitializeNVPTXTargetInfo();
error: ‘LLVMInitializeNVPTXTarget’ was not declared in this scope
LLVMInitializeNVPTXTarget();
error: ‘LLVMInitializeNVPTXTargetMC’ was not declared in this scope
LLVMInitializeNVPTXTargetMC();
error: ‘LLVMInitializeNVPTXAsmPrinter’ was not declared in this scope
LLVMInitializeNVPTXAsmPrinter();
missing includes or namespace prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a problem with the Travis CI configuration. Travis builds cling with the cmake argument -DLLVM_TARGETS_TO_BUILD=host
. So only the x86 backend is built.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's ok to do. Maybe you want to add #ifdef
guards or CMake modifications for the build target/backend?
Also, maybe add a new matrix entry in travis that installs a supported CUDA version and builds the PTX backend as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Axel-Naumann Do you have a better solution? A #ifdef
guard is just a workaround right now. It's just avoids the test build with the new functionality and that's not a good solutions. I think enabling the nvptx backend is the better solution. It should no affect the build on systems that don't provide a CUDA runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I solved the problem by adding the NVPTX
backend to -DLLVM_TARGETS_TO_BUILD=host
. It does not cause any problems on non CUDA system. It just increases the size a little.
cb7d5da
to
ed740c1
Compare
CC @Axel-Naumann @vgvassilev ping :) Please feel free to provide feedback to Simeon's new approach. |
ed740c1
to
47ac02a
Compare
2b96e83
to
9354f6f
Compare
@Axel-Naumann @vgvassilev |
fbe8682
to
1e9e7e4
Compare
For the commit 94075f9 is a clang patch needed: |
@SimeonEhrig can you quickly solve the merge conflict, please? |
94075f9
to
ed44af7
Compare
I have successfully rebased my branch. The tests are passed. The CI doesn't have the Clang Patch, so they will fail. Unfortunately, redefinition with CUDA doesn't work at the moment. I will try to solve it because the feature is very nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, @SimeonEhrig !
Apologies it took me AGES to review. Some smaller comments for this PR, some suggestions for future ones - I'm sure you know best which comments belong into which bucket :-) Nothing severe / critical, but I'd appreciate if you could let me know once you see the comments addressed that you want to address for this PR so I can hit "merge"!
@@ -494,7 +494,7 @@ def compile(arg, build_libcpp): | |||
|
|||
build = Build() | |||
cmake_config_flags = (srcdir + ' -DLLVM_BUILD_TOOLS=Off -DCMAKE_BUILD_TYPE={0} -DCMAKE_INSTALL_PREFIX={1} ' | |||
.format(build.buildType, TMP_PREFIX) + ' -DLLVM_TARGETS_TO_BUILD=host ' + | |||
.format(build.buildType, TMP_PREFIX) + ' -DLLVM_TARGETS_TO_BUILD="host;NVPTX" ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer doing that at the CMakeLists.txt
level of cling, given that - if not using cpt.py - people must know to add NVPTX
. Can cling add NVPTX
as LLVM_TARGETS_TO_BUILD
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a problem. Unfortunately I cannot solve the problem in the CMakeLists.txt
of cling, because it has to be activated in the clang build too. I think the best way to activate the NVPTX
backend automatically is to add some code to llvm's CMakeLists.txt
. Then it will also be enabled in the Clang and Cling project. Unfortunately, this will be a commit, which will not go upstream.
Any idea why Travis is unhappy: https://travis-ci.org/root-project/cling/jobs/592618226#L2146 ? |
Yes, the Clang patch is missing in the CI. See: #284 (comment) |
Sorry I missed that. I'm extremely hesitant to add clang patches that are not also upstreamed. Could you open a Differential with the clang folks, arguing your case? Or - preferred - think of a different solution (e.g. virtual file) that does not need a clang patch? |
I know that clang patches are not the best solution, but I haven't found another solution. I've already checked if I can use the llvm virtual file system. Unfortunately, the diff --git a/include/clang/Frontend/CodeGenOptions.h b/include/clang/Frontend/CodeGenOptions.h
index 71730a2..9e2cf68 100644
--- a/include/clang/Frontend/CodeGenOptions.h
+++ b/include/clang/Frontend/CodeGenOptions.h
@@ -209,6 +209,10 @@ public:
/// object file.
std::vector<std::string> CudaGpuBinaryFileNames;
+ /// A buffer that contains the fatbinary code to forward to CUDA runtime
+ /// back-end for incorporating them into host-side object file.
+ std::shared_ptr<llvm::SmallVectorImpl<char>> CudaGpuBinaryBuffer;
+
/// The name of the file to which the backend should save YAML optimization
/// records.
std::string OptRecordFile;
diff --git a/lib/CodeGen/CGCUDANV.cpp b/lib/CodeGen/CGCUDANV.cpp
index d24ef0a..66d6f63 100644
--- a/lib/CodeGen/CGCUDANV.cpp
+++ b/lib/CodeGen/CGCUDANV.cpp
@@ -254,7 +254,8 @@ llvm::Function *CGNVCUDARuntime::makeRegisterGlobalsFn() {
/// \endcode
llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
// No need to generate ctors/dtors if there are no GPU binaries.
- if (CGM.getCodeGenOpts().CudaGpuBinaryFileNames.empty())
+ if (!CGM.getCodeGenOpts().CudaGpuBinaryBuffer &&
+ CGM.getCodeGenOpts().CudaGpuBinaryFileNames.empty())
return nullptr;
// void __cuda_register_globals(void* handle);
@@ -281,16 +282,7 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
// to be cleaned up in destructor on exit. Then associate all known kernels
// with the GPU binary handle so CUDA runtime can figure out what to call on
// the GPU side.
- for (const std::string &GpuBinaryFileName :
- CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
- llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
- llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
- if (std::error_code EC = GpuBinaryOrErr.getError()) {
- CGM.getDiags().Report(diag::err_cannot_open_file) << GpuBinaryFileName
- << EC.message();
- continue;
- }
-
+ auto buildFatbinarySection = [&](const llvm::StringRef FatbinCode) {
const char *FatbinConstantName =
CGM.getTriple().isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin";
// NVIDIA's cuobjdump looks for fatbins in this section.
@@ -305,14 +297,12 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
// Fatbin version.
Values.addInt(IntTy, 1);
// Data.
- Values.add(makeConstantString(GpuBinaryOrErr.get()->getBuffer(),
- "", FatbinConstantName, 8));
+ Values.add(makeConstantString(FatbinCode, "", FatbinConstantName, 8));
// Unused in fatbin v1.
Values.add(llvm::ConstantPointerNull::get(VoidPtrTy));
- llvm::GlobalVariable *FatbinWrapper =
- Values.finishAndCreateGlobal("__cuda_fatbin_wrapper",
- CGM.getPointerAlign(),
- /*constant*/ true);
+ llvm::GlobalVariable *FatbinWrapper = Values.finishAndCreateGlobal(
+ "__cuda_fatbin_wrapper", CGM.getPointerAlign(),
+ /*constant*/ true);
FatbinWrapper->setSection(FatbinSectionName);
// GpuBinaryHandle = __cudaRegisterFatBinary(&FatbinWrapper);
@@ -331,6 +321,27 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
// Save GpuBinaryHandle so we can unregister it in destructor.
GpuBinaryHandles.push_back(GpuBinaryHandle);
+ };
+
+ // If there is a valid buffer with fatbinary code, embed the buffer.
+ // Otherwise, embed fatbinary code from files.
+ if (CGM.getCodeGenOpts().CudaGpuBinaryBuffer) {
+ const llvm::StringRef GpuBinaryBuffer(
+ CGM.getCodeGenOpts().CudaGpuBinaryBuffer->data(),
+ CGM.getCodeGenOpts().CudaGpuBinaryBuffer->size());
+ buildFatbinarySection(GpuBinaryBuffer);
+ } else {
+ for (const std::string &GpuBinaryFileName :
+ CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
+ llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
+ llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
+ if (std::error_code EC = GpuBinaryOrErr.getError()) {
+ CGM.getDiags().Report(diag::err_cannot_open_file)
+ << GpuBinaryFileName << EC.message();
+ continue;
+ }
+ buildFatbinarySection(GpuBinaryOrErr.get()->getBuffer());
+ }
}
CtorBuilder.CreateRetVoid(); |
I have another idea. Maybe we can replace the |
- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments
Perfect, yes! |
And now the "problem" of this solution. Backporting shouldn't be easy, because the clang code base is really old. In the meantime, the nvptx backend has been extended with functions for HIP[1]. So I think we have to port a lot of files back. I'm not sure, maybe we should stash this feature, because you wrote in #291 that you want to update the clang base at the end of the year. With a new base, the back port should be easy. The feature itself is not so important. It prevents the fatbin code from being written to a file and read by the same process again. The impact on performance is not noticeable. The main advantage of the feature is that it improves the independence of the file system. Especially for HPC it is very helpful. 1] HIP is the AMD version of CUDA. Unlike CUDA, it can address AMD and NVIDIA GPUs. If the target is an NVIDIA GPU, it works like a wrapper framework. Therefore, it made sense to integrate the HIP part for NVIDIA GPUs into the nvptx backend. Often it is just an if-else that changes the function name from |
@SimeonEhrig: @ax3l just linked me this somewhere else. This is really excellent work. Hopefully in about a month or so I'll be able to play with Kokkos and this, if we have good CUDA support it could be fun to start playing with rapid development of Kokkos, switching kernels between host and device and the like. Half commenting just to say well done, half to be notified when this merges :) |
Thank you for the kind words! Yes, one of our main motivations is tuning and playing with Alpaka as well :) |
Thank you so much. At the moment I am preparing a Jupyter notebook for a lecture with jitify and the redefinition feature. It's really nice to use, but just a "workaround". After the PR one of the next features will be the redefinition in CUDA mode. Then we can use it with the Runtime API, Alpaca, Kokkos, ... :-D |
- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments - remove deprecated opt level variables - change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()
1366a24
to
521c3bd
Compare
@Axel-Naumann I have removed the part of the code that requires changes to the clang base. Now the tests pass (except for some timeouts). There are only two comments left, which I can't solve without your feedback. |
…ices - change CUDA to CUDAHost and add CUDADevice to the InvocationOptions - in the PTX mode, some ASTTransformer will not be used, which are useful for the x86 mode
This ASTTransformer adds an inline attribute to any CUDA __device__ kernel that does not have the attribute. Inlining solves a problem caused by incremental compilation of PTX code. In a normal compiler, all definitions of __global__ and __device__ kernels are in the same translation unit. In the incremental compiler, each kernel has its own translation unit. In case a __global__ kernel uses a __device__ function, this design caused an error. Instead of generating the PTX code of the __device__ kernel in the same file as the __global__ kernel, there is only an external declaration of the __device__ function. However, normal PTX code does not support an external declaration of functions. The transformer only works if the target device is nvptx.
Replaced the old version of the PTX compiler which used external tools and llvm::ExecuteAndWait with an internal implementation. The new incremental PTX compiler uses a modified version of the cling::Interpreter instance. The instance can process the PTX built-ins and generates LLVM IR. The LLVM IR is be compiled to PTX via an additional NVPTX backend implemented in the IncrementalCUDADeviceCompiler. The new implementation has many advantages: - it's much faster than the old version - less error-prone because the ASTPrinter and some unnecessary cling transformations are avoided - reduction of problems in searching for external tools (can be very complicated on HPC systems) The IncrementalCUDADeviceCompiler is moved from the cling::IncrementalParser to the cling::Interpreter, because the second interpreter needs the input without wrappers or transformations.
- instead of using the NVIDIA tool fatbin, the fatbin is now generated directly in the cling - clean up the IncrementalCUDADeviceCompiler class depending on the new fatbin implementation (e.g. remove findToolChain()) - last I/O-operation is required -> write the fatbin code to a file for the CodeGen
- it is more similar to the interface of cling::Interpreter - replace function compileDeviceCode() with process() - add declare() and parse() functions - the functions have only the argument input, because the rest of the missing arguments (e.g. Transaction) requires modifications at the transaction system - it also fixes a bug in the I/O system of the xeus-cling kernel
- a really weak solution, which should replaced by a generic solution
- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments - remove deprecated opt level variables - change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()
521c3bd
to
aa6478c
Compare
I have applied this to ROOT (we sync cling from ROOT). The PR root-project/root#4616 will also run ROOT's test suite. I do not expect any issues there. I am closing this as all cling-related issues seem to be addressed; the rest happens in ROOT's PR! Open questions:
|
- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments - remove deprecated opt level variables - change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()
About
Two major modifications to cling are required to jiting CUDA code. On the one hand, the host side (x86 target) needs some modifications to understand the cuda host code, distinguish the host and device code, and embed the device code (PTX target) into the host code. On the other hand, a second jit compiler needed to compile the device code next to the host code.
Current implementation
Currently the PTX jit compiler (cuda device code) is implemented in IncrementalCUDADeviceCompiler.h and IncrementalCUDADeviceCompiler.cpp. The basic idea is to use file I/O and external tools via
llvm::sys::ExecuteAndWait
. For example, if we enterN
statements, we compile the PTX code with:When we add a new statement, we execute the following command:
There is also a necessary step to translate the PTX code to fatbin code. At the moment this step is not refactorized because it requires proprietary software and there is no API provided.
After compiling the fatbin code, the code is read and embedded by the cling.
Problems
The current implementation has a lot of problems.
NullDerefProtectionTransformer
, which protects the cling runtime from crashes caused bynullptr
differentiation. Some transformations cause some problems on the device side and are not necessary.Idea
The main idea is to use an integrated incremental compiler instance which can process the invalid source code from the prompt and generate PTX code incremental without having to compile the entire TU every time. I have already implemented a prototype, which implemented this idea. The prototype use some code from cling via
libcling.so
and use some code from the llvm and clang libraries.Prototype
The prototype is divided into three parts. The first part is a simple prompt similar to the cling prompt for testing. The second part is a modified version of the cling interpreter instance, which I use as frontend. Details will follow. The changes are visible in the pull request. The last part is a PTX backend implement which I implemented with an unmodified llvm library.
Frontend
Three modifications were necessary for the frontend.
CUDAHost
andCUDADevice
.inline
to each__device__
kernel. See this commit. This is necessary because the PTX code generating functionality of the clang is not designed to handle many TUs. For example, if we have a__device__
kernel [2] and a__global__
kernel, the PTX code of the__device__
kernel contains the definition of the__device__
kernel in the PTX code. The PTX file generated for the__global__
kernel contains the PTX code of the__global__
kernel and an external declaration of the__device__
function. But external declarations are forbidden in PTX code. So I needed a solution to copy the__device__
kernel definition into the__global__
kernel PTX file.[2]
__device__
kernel are only visible to__global__
and other__device__
kernels and cannot be called by the host.In my prototype I use the functionality of the interpreter instance up to the generation of the llvm ir code (
llvm::Module
). Then I take thellvm::Module
and pass it to a extra backend.Backend
The backend is based on some llvm tutorials and is really generic. It takes a
llvm::Module
and translates it to PTX. It works really well, witht one exception I solved with the ASTTransformer. I already tested the generated PTX code of some basic functions with a hacked cling.Questions
Interpreter
for clean code. On the other hand, I am not sure, if the classInterpreter
has been designed for inheritance. Maybe it might require some refactoring work and make future development more difficult. Do you have any good reasons or ideas as to what I should do?Current solution:
cling::Interpreter
has a pointer to a secondcling::Interpreter
instance. Someif (CudeDevice)
statements within the class provide the correct functionality.__device__
kernel of one TU to a__global__
kernel of another TU. Do you have any ideas? I have already checked the llvm module linker, but I dont't think the functionality is right for my problem.Current solution: Use a self-written
ASTTransformer
.llvm::MemoryBuffer::getFileOrSTDIN()
to load the fatbin code from file (see source code ). The path is set by the Cling. To get a full compiler stack without file I/O, we must avoid this step. Maybe we can use the virtual file system of clang or llvm. Or we try something with the STDIN. But I think that could cause a lot of problems. Otherwise, we have to change the clang source code so that it reads the code direct from a buffer.Current solution: The clang code has been modified because
llvm::MemoryBuffer::getFileOrSTDIN()
does not support a virtual file system.Tasks befor merge
These functions must be implemented so that the pull request is ready for merging.
cling::utils::getWrapPoint()
to handle the cuda function attributes__global__
,__device__
and__host__
fix bug with CUDA__constant__
memoryOptional task
I moved the
__constant__
memory task to an optional task, because the solution more complicated, than I thought.__constant__
is a frequently used feature, but is not required for the basic functionality of the CUDA extension.__constant__
memoryProblem description: SimeonEhrig#3
Possible solution 1: Use an ASTTransformer to remove the
__constant__
attribute from local variables. Is not easy to use because attributes are not stored in the AST. Attributes are stored in a side data structure (clang::Declarator). To check whether a statement has an attribute, we must perform the semantic analysis, which also checks properties of attributes which cause the__constant__
-local-error. The built-in functions of clang for detecting and changing attributes should therefore not be usable. This means that we have to implement our own solution for detecting and changing attributes.Possible solution 2: Change the behavior of the ASTTransformer
DeclExtractor
. At the moment theDeclExtractior
copies the declaration form the wrapper function to the global space and modify the local declaration a bit. If we could delete the local declaration completely, the error would not occur. But I don't know if the local declaration is necessary for any functionality.Overview of the most important changes
cling::Interpreter
cling::interpreter
instance -> host JIT has a device JITif(CudaDevice)
statements in the functions__device__
kernel__device__
kernel in transaction A and we want to use the__device__
kernel in a__global__
kernel of transaction B -> then we have to copy to whole definition of the__device__
kernel into the PTX file of transaction Bcling::Interpreter
Clang
code, because, the Clang codeGen read the code via a function from a file that does not support a virtual file system#ifdef
guards__global__
,__device__
and__host__
toSourceNormalization