Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor incremental CUDA PTX compiler #284

Closed
wants to merge 11 commits into from

Conversation

SimeonEhrig
Copy link
Contributor

@SimeonEhrig SimeonEhrig commented Feb 5, 2019

About

Two major modifications to cling are required to jiting CUDA code. On the one hand, the host side (x86 target) needs some modifications to understand the cuda host code, distinguish the host and device code, and embed the device code (PTX target) into the host code. On the other hand, a second jit compiler needed to compile the device code next to the host code.

Current implementation

Currently the PTX jit compiler (cuda device code) is implemented in IncrementalCUDADeviceCompiler.h and IncrementalCUDADeviceCompiler.cpp. The basic idea is to use file I/O and external tools via llvm::sys::ExecuteAndWait. For example, if we enter N statements, we compile the PTX code with:

clang++ -std=c++xx -Ox -S cling0.cu cling1.cu ... clingN.cu -o cling.ptx --cuda-gpu-arch=sm_xx -pthread --cuda-device-only

When we add a new statement, we execute the following command:

clang++ -std=c++xx -Ox -S cling0.cu cling1.cu ... clingN.cu clingN+1.cu -o cling.ptx --cuda-gpu-arch=sm_xx -pthread --cuda-device-only

There is also a necessary step to translate the PTX code to fatbin code. At the moment this step is not refactorized because it requires proprietary software and there is no API provided.

After compiling the fatbin code, the code is read and embedded by the cling.

Problems

The current implementation has a lot of problems.

  1. Generating valid CUDA C++ code: The PTX compiler cannot directly compile the source code entered by the user, because cling allows invalid syntax, e.g. function calls in global space. The handling of variables has a similar problem. Cling itself transform the code internally to valid C++ code. The transformation is performed in the AST-Tree. At the moment we use the ASTPrinter to generate C++ code from the AST. However, the ASTPrinter is designed to print error messages and not to generate valid C++ code. So there are many individual bugs in the printer.
  2. Performance: At the moment there is no possibility to save any progress of the translation. So we have to compile the whole entire code plus the new statement for each input, which we do. So the compile time constantly increasing.
  3. Handling Cling passes: Cling performs some internal transformations to provide new functions, e.g. the NullDerefProtectionTransformer, which protects the cling runtime from crashes caused by nullptr differentiation. Some transformations cause some problems on the device side and are not necessary.
  4. File-I/O: File I/O is slow, especially on HPC-Systems.
  5. Dependencies on external tools: Sometime, especially on HPC systems, it is not easy to find all necessary tools. Integrating the tools at build time avoid some problems during runtime.

Idea

The main idea is to use an integrated incremental compiler instance which can process the invalid source code from the prompt and generate PTX code incremental without having to compile the entire TU every time. I have already implemented a prototype, which implemented this idea. The prototype use some code from cling via libcling.so and use some code from the llvm and clang libraries.

Prototype

The prototype is divided into three parts. The first part is a simple prompt similar to the cling prompt for testing. The second part is a modified version of the cling interpreter instance, which I use as frontend. Details will follow. The changes are visible in the pull request. The last part is a PTX backend implement which I implemented with an unmodified llvm library.

Frontend

Three modifications were necessary for the frontend.

  1. Extending the InvocationOptions.h to decide between the mode CUDAHost and CUDADevice.
  • CUDAHost: normal cling functionality, e.g. jit and execute code with some extensions for CUDA host functionality
  • CUDADevice: incremental compiling of CUDA device code to ptx without code execution.
  1. Deactivation of some ASTTransformer that are not needed.
  2. Added a new ASTTransformer to add the attribute inline to each __device__ kernel. See this commit. This is necessary because the PTX code generating functionality of the clang is not designed to handle many TUs. For example, if we have a __device__ kernel [2] and a __global__ kernel, the PTX code of the __device__ kernel contains the definition of the __device__ kernel in the PTX code. The PTX file generated for the __global__ kernel contains the PTX code of the __global__ kernel and an external declaration of the __device__ function. But external declarations are forbidden in PTX code. So I needed a solution to copy the __device__ kernel definition into the __global__ kernel PTX file.

[2] __device__ kernel are only visible to __global__ and other __device__ kernels and cannot be called by the host.

In my prototype I use the functionality of the interpreter instance up to the generation of the llvm ir code (llvm::Module). Then I take the llvm::Module and pass it to a extra backend.

Backend

The backend is based on some llvm tutorials and is really generic. It takes a llvm::Module and translates it to PTX. It works really well, witht one exception I solved with the ASTTransformer. I already tested the generated PTX code of some basic functions with a hacked cling.

Questions

  1. PTX modification of the interpreter instance: With the generaton of PTX code, there are three possible modes for the interpreter instance (C++, CUDA C++ host and PTX). Perhaps it make sense to use an inheritance hierarchy of the class Interpreter for clean code. On the other hand, I am not sure, if the class Interpreter has been designed for inheritance. Maybe it might require some refactoring work and make future development more difficult. Do you have any good reasons or ideas as to what I should do?
    Current solution: cling::Interpreter has a pointer to a second cling::Interpreter instance. Some if (CudeDevice) statements within the class provide the correct functionality.
  2. Inline ASTTransformer: I think, I can replace my custom function pass with a solution of llvm d713ae0. I still have to check it out. Maybe there is another, better solution to link the __device__ kernel of one TU to a __global__ kernel of another TU. Do you have any ideas? I have already checked the llvm module linker, but I dont't think the functionality is right for my problem.
    Current solution: Use a self-written ASTTransformer.
  3. NVPTX backend without file I/O: The NVPTX backend uses the function llvm::MemoryBuffer::getFileOrSTDIN() to load the fatbin code from file (see source code ). The path is set by the Cling. To get a full compiler stack without file I/O, we must avoid this step. Maybe we can use the virtual file system of clang or llvm. Or we try something with the STDIN. But I think that could cause a lot of problems. Otherwise, we have to change the clang source code so that it reads the code direct from a buffer.
    Current solution: The clang code has been modified because llvm::MemoryBuffer::getFileOrSTDIN() does not support a virtual file system.

Tasks befor merge

These functions must be implemented so that the pull request is ready for merging.

  • Load fatbin code in NVPTX backend without file I/O (see Question 3.)
  • Add functionality at cling::utils::getWrapPoint() to handle the cuda function attributes __global__, __device__ and __host__
  • fix bug with CUDA __constant__ memory
  • fix bug with xeus-cling I/O

Optional task

I moved the __constant__ memory task to an optional task, because the solution more complicated, than I thought. __constant__ is a frequently used feature, but is not required for the basic functionality of the CUDA extension.

  • fix bug with CUDA __constant__ memory

Problem description: SimeonEhrig#3
Possible solution 1: Use an ASTTransformer to remove the __constant__ attribute from local variables. Is not easy to use because attributes are not stored in the AST. Attributes are stored in a side data structure (clang::Declarator). To check whether a statement has an attribute, we must perform the semantic analysis, which also checks properties of attributes which cause the __constant__-local-error. The built-in functions of clang for detecting and changing attributes should therefore not be usable. This means that we have to implement our own solution for detecting and changing attributes.
Possible solution 2: Change the behavior of the ASTTransformer DeclExtractor. At the moment the DeclExtractior copies the declaration form the wrapper function to the global space and modify the local declaration a bit. If we could delete the local declaration completely, the error would not occur. But I don't know if the local declaration is necessary for any functionality.

Overview of the most important changes

  • modified cling::Interpreter
    • can be used as nvptx JIT
    • has a pointer to a second cling::interpreter instance -> host JIT has a device JIT
    • to avoid recursions and some errors in the device JIT, there are some if(CudaDevice) statements in the functions
  • a new ASTTransformer that inline every cuda __device__ kernel
    • is required because PTX is a plain assembler format without linking functionality
    • is needed for example, if we define a __device__ kernel in transaction A and we want to use the __device__ kernel in a __global__ kernel of transaction B -> then we have to copy to whole definition of the __device__ kernel into the PTX file of transaction B
  • PTX and fatbinary are generated entirely in-memory
    • for PTX see modifications at cling::Interpreter
    • for the fatbinary code we need a modification at the Clang code, because, the Clang codeGen read the code via a function from a file that does not support a virtual file system
  • CPT tool builds the nvptx beside the host target
    • avoid a lot of #ifdef guards
    • build is completly target independent
    • if the functionality is not used at runtime, it is also completly target independent
  • Add the detection of CUDA attributes __global__, __device__ and __host__ to SourceNormalization
    • a general, intelligent approach would be better

@SimeonEhrig
Copy link
Contributor Author

/cc @Axel-Naumann @vgvassilev @ax3l

lib/Interpreter/IncrementalParser.cpp Outdated Show resolved Hide resolved
test/Interfaces/invocationFlags.C Outdated Show resolved Hide resolved
lib/Interpreter/DeviceKernelInliner.h Outdated Show resolved Hide resolved
cuFile << initialCUDADeviceCode;
cuFile.close();
// initialize NVPTX backend
LLVMInitializeNVPTXTargetInfo();
Copy link
Collaborator

@ax3l ax3l Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI says:

error: ‘LLVMInitializeNVPTXTargetInfo’ was not declared in this scope
     LLVMInitializeNVPTXTargetInfo();
error: ‘LLVMInitializeNVPTXTarget’ was not declared in this scope
     LLVMInitializeNVPTXTarget();
error: ‘LLVMInitializeNVPTXTargetMC’ was not declared in this scope
     LLVMInitializeNVPTXTargetMC();
error: ‘LLVMInitializeNVPTXAsmPrinter’ was not declared in this scope
     LLVMInitializeNVPTXAsmPrinter();

missing includes or namespace prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with the Travis CI configuration. Travis builds cling with the cmake argument -DLLVM_TARGETS_TO_BUILD=host. So only the x86 backend is built.

Copy link
Collaborator

@ax3l ax3l Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's ok to do. Maybe you want to add #ifdef guards or CMake modifications for the build target/backend?
Also, maybe add a new matrix entry in travis that installs a supported CUDA version and builds the PTX backend as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Axel-Naumann Do you have a better solution? A #ifdef guard is just a workaround right now. It's just avoids the test build with the new functionality and that's not a good solutions. I think enabling the nvptx backend is the better solution. It should no affect the build on systems that don't provide a CUDA runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I solved the problem by adding the NVPTX backend to -DLLVM_TARGETS_TO_BUILD=host. It does not cause any problems on non CUDA system. It just increases the size a little.

@ax3l
Copy link
Collaborator

ax3l commented Mar 24, 2019

CC @Axel-Naumann @vgvassilev ping :) Please feel free to provide feedback to Simeon's new approach.

@SimeonEhrig SimeonEhrig force-pushed the refactorCudaPTXJIT branch from ed740c1 to 47ac02a Compare May 27, 2019 13:08
@SimeonEhrig SimeonEhrig force-pushed the refactorCudaPTXJIT branch 2 times, most recently from 2b96e83 to 9354f6f Compare June 6, 2019 06:16
@ax3l ax3l mentioned this pull request Jun 10, 2019
@SimeonEhrig SimeonEhrig changed the title Refactor incremental CUDA PTX compiler [WIP] Refactor incremental CUDA PTX compiler Jun 24, 2019
@SimeonEhrig
Copy link
Contributor Author

NVPTX backend without file I/O: The NVPTX backend uses the function llvm::MemoryBuffer::getFileOrSTDIN() to load the fatbin code from file (see source code ). The path is set by the Cling. To get a full compiler stack without file I/O, we must avoid this step. Maybe we can use the virtual file system of clang or llvm. Or we try something with the STDIN. But I think that could cause a lot of problems. Otherwise, we have to change the clang source code so that it reads the code direct from a buffer.

@Axel-Naumann @vgvassilev
Do you have any idea how I can solve this? I have seen that you have already done something with virtual file systems in cling. There is also a post on the llvm-dev list.

@SimeonEhrig
Copy link
Contributor Author

For the commit 94075f9 is a clang patch needed:
0001-Add-buffer-for-CUDA-fatbinary-in-CGCUDANV-backend.patch.zip

@SimeonEhrig SimeonEhrig changed the title [WIP] Refactor incremental CUDA PTX compiler Refactor incremental CUDA PTX compiler Jul 18, 2019
@ax3l
Copy link
Collaborator

ax3l commented Oct 1, 2019

@SimeonEhrig can you quickly solve the merge conflict, please?

@SimeonEhrig
Copy link
Contributor Author

I have successfully rebased my branch. The tests are passed. The CI doesn't have the Clang Patch, so they will fail.

Unfortunately, redefinition with CUDA doesn't work at the moment. I will try to solve it because the feature is very nice.

Copy link
Member

@Axel-Naumann Axel-Naumann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, @SimeonEhrig !

Apologies it took me AGES to review. Some smaller comments for this PR, some suggestions for future ones - I'm sure you know best which comments belong into which bucket :-) Nothing severe / critical, but I'd appreciate if you could let me know once you see the comments addressed that you want to address for this PR so I can hit "merge"!

lib/Interpreter/DeviceKernelInliner.cpp Outdated Show resolved Hide resolved
lib/Interpreter/DeviceKernelInliner.h Outdated Show resolved Hide resolved
lib/Interpreter/DeviceKernelInliner.h Outdated Show resolved Hide resolved
lib/Interpreter/IncrementalCUDADeviceCompiler.cpp Outdated Show resolved Hide resolved
lib/Interpreter/IncrementalCUDADeviceCompiler.cpp Outdated Show resolved Hide resolved
lib/Interpreter/Interpreter.cpp Outdated Show resolved Hide resolved
lib/Interpreter/Interpreter.cpp Outdated Show resolved Hide resolved
lib/Interpreter/Interpreter.cpp Show resolved Hide resolved
test/CUDADeviceCode/CUDAInclude.C Show resolved Hide resolved
@@ -494,7 +494,7 @@ def compile(arg, build_libcpp):

build = Build()
cmake_config_flags = (srcdir + ' -DLLVM_BUILD_TOOLS=Off -DCMAKE_BUILD_TYPE={0} -DCMAKE_INSTALL_PREFIX={1} '
.format(build.buildType, TMP_PREFIX) + ' -DLLVM_TARGETS_TO_BUILD=host ' +
.format(build.buildType, TMP_PREFIX) + ' -DLLVM_TARGETS_TO_BUILD="host;NVPTX" ' +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer doing that at the CMakeLists.txt level of cling, given that - if not using cpt.py - people must know to add NVPTX. Can cling add NVPTX as LLVM_TARGETS_TO_BUILD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a problem. Unfortunately I cannot solve the problem in the CMakeLists.txt of cling, because it has to be activated in the clang build too. I think the best way to activate the NVPTX backend automatically is to add some code to llvm's CMakeLists.txt. Then it will also be enabled in the Clang and Cling project. Unfortunately, this will be a commit, which will not go upstream.

@Axel-Naumann
Copy link
Member

Any idea why Travis is unhappy: https://travis-ci.org/root-project/cling/jobs/592618226#L2146 ?

@SimeonEhrig
Copy link
Contributor Author

Yes, the Clang patch is missing in the CI.

See: #284 (comment)

@Axel-Naumann
Copy link
Member

Axel-Naumann commented Oct 4, 2019

Sorry I missed that. I'm extremely hesitant to add clang patches that are not also upstreamed. Could you open a Differential with the clang folks, arguing your case? Or - preferred - think of a different solution (e.g. virtual file) that does not need a clang patch?

@SimeonEhrig
Copy link
Contributor Author

I know that clang patches are not the best solution, but I haven't found another solution. I've already checked if I can use the llvm virtual file system. Unfortunately, the getFileOrSTDIN function cannot use the virtual filesystem. But maybe you have another idea when you see the differential:

diff --git a/include/clang/Frontend/CodeGenOptions.h b/include/clang/Frontend/CodeGenOptions.h
index 71730a2..9e2cf68 100644
--- a/include/clang/Frontend/CodeGenOptions.h
+++ b/include/clang/Frontend/CodeGenOptions.h
@@ -209,6 +209,10 @@ public:
   /// object file.
   std::vector<std::string> CudaGpuBinaryFileNames;
 
+  /// A buffer that contains the fatbinary code to forward to CUDA runtime
+  /// back-end for incorporating them into host-side object file.
+  std::shared_ptr<llvm::SmallVectorImpl<char>> CudaGpuBinaryBuffer;
+
   /// The name of the file to which the backend should save YAML optimization
   /// records.
   std::string OptRecordFile;
diff --git a/lib/CodeGen/CGCUDANV.cpp b/lib/CodeGen/CGCUDANV.cpp
index d24ef0a..66d6f63 100644
--- a/lib/CodeGen/CGCUDANV.cpp
+++ b/lib/CodeGen/CGCUDANV.cpp
@@ -254,7 +254,8 @@ llvm::Function *CGNVCUDARuntime::makeRegisterGlobalsFn() {
 /// \endcode
 llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
   // No need to generate ctors/dtors if there are no GPU binaries.
-  if (CGM.getCodeGenOpts().CudaGpuBinaryFileNames.empty())
+  if (!CGM.getCodeGenOpts().CudaGpuBinaryBuffer &&
+      CGM.getCodeGenOpts().CudaGpuBinaryFileNames.empty())
     return nullptr;
 
   // void __cuda_register_globals(void* handle);
@@ -281,16 +282,7 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
   // to be cleaned up in destructor on exit. Then associate all known kernels
   // with the GPU binary handle so CUDA runtime can figure out what to call on
   // the GPU side.
-  for (const std::string &GpuBinaryFileName :
-       CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
-    llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
-        llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
-    if (std::error_code EC = GpuBinaryOrErr.getError()) {
-      CGM.getDiags().Report(diag::err_cannot_open_file) << GpuBinaryFileName
-                                                        << EC.message();
-      continue;
-    }
-
+  auto buildFatbinarySection = [&](const llvm::StringRef FatbinCode) {
     const char *FatbinConstantName =
         CGM.getTriple().isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin";
     // NVIDIA's cuobjdump looks for fatbins in this section.
@@ -305,14 +297,12 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
     // Fatbin version.
     Values.addInt(IntTy, 1);
     // Data.
-    Values.add(makeConstantString(GpuBinaryOrErr.get()->getBuffer(), 
-                                  "", FatbinConstantName, 8));
+    Values.add(makeConstantString(FatbinCode, "", FatbinConstantName, 8));
     // Unused in fatbin v1.
     Values.add(llvm::ConstantPointerNull::get(VoidPtrTy));
-    llvm::GlobalVariable *FatbinWrapper =
-      Values.finishAndCreateGlobal("__cuda_fatbin_wrapper",
-                                   CGM.getPointerAlign(),
-                                   /*constant*/ true);
+    llvm::GlobalVariable *FatbinWrapper = Values.finishAndCreateGlobal(
+        "__cuda_fatbin_wrapper", CGM.getPointerAlign(),
+        /*constant*/ true);
     FatbinWrapper->setSection(FatbinSectionName);
 
     // GpuBinaryHandle = __cudaRegisterFatBinary(&FatbinWrapper);
@@ -331,6 +321,27 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
 
     // Save GpuBinaryHandle so we can unregister it in destructor.
     GpuBinaryHandles.push_back(GpuBinaryHandle);
+  };
+
+  // If there is a valid buffer with fatbinary code, embed the buffer.
+  // Otherwise, embed fatbinary code from files.
+  if (CGM.getCodeGenOpts().CudaGpuBinaryBuffer) {
+    const llvm::StringRef GpuBinaryBuffer(
+        CGM.getCodeGenOpts().CudaGpuBinaryBuffer->data(),
+        CGM.getCodeGenOpts().CudaGpuBinaryBuffer->size());
+    buildFatbinarySection(GpuBinaryBuffer);
+  } else {
+    for (const std::string &GpuBinaryFileName :
+         CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
+      llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
+          llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
+      if (std::error_code EC = GpuBinaryOrErr.getError()) {
+        CGM.getDiags().Report(diag::err_cannot_open_file)
+            << GpuBinaryFileName << EC.message();
+        continue;
+      }
+      buildFatbinarySection(GpuBinaryOrErr.get()->getBuffer());
+    }
   }
 
   CtorBuilder.CreateRetVoid();

@SimeonEhrig
Copy link
Contributor Author

I have another idea. Maybe we can replace the getFileOrSTDIN function with another function that supports the virtual filesystem and bring it into the clang upstream. Then we can port it back to cling's clang-repo. What do you think?

SimeonEhrig added a commit to SimeonEhrig/cling that referenced this pull request Oct 4, 2019
- add Author to CUDA test cases
- optimize DeviceKernelInliner
- improve some comments
@Axel-Naumann
Copy link
Member

Maybe we can replace the getFileOrSTDIN function with another function

Perfect, yes!

@SimeonEhrig
Copy link
Contributor Author

Maybe we can replace the getFileOrSTDIN function with another function

Perfect, yes!

And now the "problem" of this solution. Backporting shouldn't be easy, because the clang code base is really old. In the meantime, the nvptx backend has been extended with functions for HIP[1]. So I think we have to port a lot of files back. I'm not sure, maybe we should stash this feature, because you wrote in #291 that you want to update the clang base at the end of the year. With a new base, the back port should be easy. The feature itself is not so important. It prevents the fatbin code from being written to a file and read by the same process again. The impact on performance is not noticeable. The main advantage of the feature is that it improves the independence of the file system. Especially for HPC it is very helpful.

1] HIP is the AMD version of CUDA. Unlike CUDA, it can address AMD and NVIDIA GPUs. If the target is an NVIDIA GPU, it works like a wrapper framework. Therefore, it made sense to integrate the HIP part for NVIDIA GPUs into the nvptx backend. Often it is just an if-else that changes the function name from cudaSomething() to hipSomething().

@DavidPoliakoff
Copy link

@SimeonEhrig: @ax3l just linked me this somewhere else. This is really excellent work. Hopefully in about a month or so I'll be able to play with Kokkos and this, if we have good CUDA support it could be fun to start playing with rapid development of Kokkos, switching kernels between host and device and the like. Half commenting just to say well done, half to be notified when this merges :)

@ax3l
Copy link
Collaborator

ax3l commented Oct 14, 2019

Thank you for the kind words! Yes, one of our main motivations is tuning and playing with Alpaka as well :)

@SimeonEhrig
Copy link
Contributor Author

Thank you so much. At the moment I am preparing a Jupyter notebook for a lecture with jitify and the redefinition feature. It's really nice to use, but just a "workaround". After the PR one of the next features will be the redefinition in CUDA mode. Then we can use it with the Runtime API, Alpaca, Kokkos, ... :-D

SimeonEhrig added a commit to SimeonEhrig/cling that referenced this pull request Oct 24, 2019
- add Author to CUDA test cases
- optimize DeviceKernelInliner
- improve some comments
- remove deprecated opt level variables
- change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()
@SimeonEhrig
Copy link
Contributor Author

@Axel-Naumann I have removed the part of the code that requires changes to the clang base. Now the tests pass (except for some timeouts). There are only two comments left, which I can't solve without your feedback.

…ices

- change CUDA to CUDAHost and add CUDADevice to the InvocationOptions
- in the PTX mode, some ASTTransformer will not be used, which are useful for the x86 mode
This ASTTransformer adds an inline attribute to any CUDA __device__ kernel
that does not have the attribute. Inlining solves a problem caused by
incremental compilation of PTX code. In a normal compiler, all definitions
of __global__ and __device__ kernels are in the same translation unit. In
the incremental compiler, each kernel has its own translation unit. In case
a __global__ kernel uses a __device__ function, this design caused an error.
Instead of generating the PTX code of the __device__ kernel in the same file
as the __global__ kernel, there is only an external declaration of the
__device__ function. However, normal PTX code does not support an external
declaration of functions.

The transformer only works if the target device is nvptx.
Replaced the old version of the PTX compiler which used external tools and llvm::ExecuteAndWait with an internal implementation. The new incremental PTX compiler uses a modified version of the cling::Interpreter instance. The instance can process the PTX built-ins and generates LLVM IR. The LLVM IR is be compiled to PTX via an additional NVPTX backend implemented in the IncrementalCUDADeviceCompiler.

The new implementation has many advantages:
- it's much faster than the old version
- less error-prone because the ASTPrinter and some unnecessary cling transformations are avoided
- reduction of problems in searching for external tools (can be very complicated on HPC systems)

The IncrementalCUDADeviceCompiler is moved from the cling::IncrementalParser to the cling::Interpreter, because the second interpreter needs the input without wrappers or transformations.
- instead of using the NVIDIA tool fatbin, the fatbin is now generated directly in the cling
- clean up the IncrementalCUDADeviceCompiler class depending on the new fatbin implementation (e.g. remove findToolChain())
- last I/O-operation is required -> write the fatbin code to a file for the CodeGen
- it is more similar to the interface of cling::Interpreter
- replace function compileDeviceCode() with process()
- add declare() and parse() functions
- the functions have only the argument input, because the rest of the missing arguments (e.g. Transaction) requires modifications at the transaction system
- it also fixes a bug in the I/O system of the xeus-cling kernel
- a really weak solution, which should replaced by a generic solution
- add Author to CUDA test cases
- optimize DeviceKernelInliner
- improve some comments
- remove deprecated opt level variables
- change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()
@Axel-Naumann
Copy link
Member

I have applied this to ROOT (we sync cling from ROOT). The PR root-project/root#4616 will also run ROOT's test suite. I do not expect any issues there. I am closing this as all cling-related issues seem to be addressed; the rest happens in ROOT's PR!

Open questions:

  • PTX-string as multiple of 8 bytes seems to contradict the code
  • enable / disable nvptx backend

FonsRademakers pushed a commit that referenced this pull request Nov 7, 2019
- add Author to CUDA test cases
- optimize DeviceKernelInliner
- improve some comments
- remove deprecated opt level variables
- change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants