Refactor incremental CUDA PTX compiler #284

SimeonEhrig · 2019-02-05T10:08:05Z

About

Two major modifications to cling are required to jiting CUDA code. On the one hand, the host side (x86 target) needs some modifications to understand the cuda host code, distinguish the host and device code, and embed the device code (PTX target) into the host code. On the other hand, a second jit compiler needed to compile the device code next to the host code.

Current implementation

Currently the PTX jit compiler (cuda device code) is implemented in IncrementalCUDADeviceCompiler.h and IncrementalCUDADeviceCompiler.cpp. The basic idea is to use file I/O and external tools via llvm::sys::ExecuteAndWait. For example, if we enter N statements, we compile the PTX code with:

clang++ -std=c++xx -Ox -S cling0.cu cling1.cu ... clingN.cu -o cling.ptx --cuda-gpu-arch=sm_xx -pthread --cuda-device-only

When we add a new statement, we execute the following command:

clang++ -std=c++xx -Ox -S cling0.cu cling1.cu ... clingN.cu clingN+1.cu -o cling.ptx --cuda-gpu-arch=sm_xx -pthread --cuda-device-only

There is also a necessary step to translate the PTX code to fatbin code. At the moment this step is not refactorized because it requires proprietary software and there is no API provided.

After compiling the fatbin code, the code is read and embedded by the cling.

Problems

The current implementation has a lot of problems.

Generating valid CUDA C++ code: The PTX compiler cannot directly compile the source code entered by the user, because cling allows invalid syntax, e.g. function calls in global space. The handling of variables has a similar problem. Cling itself transform the code internally to valid C++ code. The transformation is performed in the AST-Tree. At the moment we use the ASTPrinter to generate C++ code from the AST. However, the ASTPrinter is designed to print error messages and not to generate valid C++ code. So there are many individual bugs in the printer.
Performance: At the moment there is no possibility to save any progress of the translation. So we have to compile the whole entire code plus the new statement for each input, which we do. So the compile time constantly increasing.
Handling Cling passes: Cling performs some internal transformations to provide new functions, e.g. the NullDerefProtectionTransformer, which protects the cling runtime from crashes caused by nullptr differentiation. Some transformations cause some problems on the device side and are not necessary.
File-I/O: File I/O is slow, especially on HPC-Systems.
Dependencies on external tools: Sometime, especially on HPC systems, it is not easy to find all necessary tools. Integrating the tools at build time avoid some problems during runtime.

Idea

The main idea is to use an integrated incremental compiler instance which can process the invalid source code from the prompt and generate PTX code incremental without having to compile the entire TU every time. I have already implemented a prototype, which implemented this idea. The prototype use some code from cling via libcling.so and use some code from the llvm and clang libraries.

Prototype

The prototype is divided into three parts. The first part is a simple prompt similar to the cling prompt for testing. The second part is a modified version of the cling interpreter instance, which I use as frontend. Details will follow. The changes are visible in the pull request. The last part is a PTX backend implement which I implemented with an unmodified llvm library.

Frontend

Three modifications were necessary for the frontend.

Extending the InvocationOptions.h to decide between the mode CUDAHost and CUDADevice.

CUDAHost: normal cling functionality, e.g. jit and execute code with some extensions for CUDA host functionality
CUDADevice: incremental compiling of CUDA device code to ptx without code execution.

Deactivation of some ASTTransformer that are not needed.
Added a new ASTTransformer to add the attribute inline to each __device__ kernel. See this commit. This is necessary because the PTX code generating functionality of the clang is not designed to handle many TUs. For example, if we have a __device__ kernel [2] and a __global__ kernel, the PTX code of the __device__ kernel contains the definition of the __device__ kernel in the PTX code. The PTX file generated for the __global__ kernel contains the PTX code of the __global__ kernel and an external declaration of the __device__ function. But external declarations are forbidden in PTX code. So I needed a solution to copy the __device__ kernel definition into the __global__ kernel PTX file.

[2] __device__ kernel are only visible to __global__ and other __device__ kernels and cannot be called by the host.

In my prototype I use the functionality of the interpreter instance up to the generation of the llvm ir code (llvm::Module). Then I take the llvm::Module and pass it to a extra backend.

Backend

The backend is based on some llvm tutorials and is really generic. It takes a llvm::Module and translates it to PTX. It works really well, witht one exception I solved with the ASTTransformer. I already tested the generated PTX code of some basic functions with a hacked cling.

Questions

PTX modification of the interpreter instance: With the generaton of PTX code, there are three possible modes for the interpreter instance (C++, CUDA C++ host and PTX). Perhaps it make sense to use an inheritance hierarchy of the class Interpreter for clean code. On the other hand, I am not sure, if the class Interpreter has been designed for inheritance. Maybe it might require some refactoring work and make future development more difficult. Do you have any good reasons or ideas as to what I should do?
Current solution: cling::Interpreter has a pointer to a second cling::Interpreter instance. Some if (CudeDevice) statements within the class provide the correct functionality.
Inline ASTTransformer: I think, I can replace my custom function pass with a solution of llvm d713ae0. I still have to check it out. Maybe there is another, better solution to link the __device__ kernel of one TU to a __global__ kernel of another TU. Do you have any ideas? I have already checked the llvm module linker, but I dont't think the functionality is right for my problem.
Current solution: Use a self-written ASTTransformer.
NVPTX backend without file I/O: The NVPTX backend uses the function llvm::MemoryBuffer::getFileOrSTDIN() to load the fatbin code from file (see source code ). The path is set by the Cling. To get a full compiler stack without file I/O, we must avoid this step. Maybe we can use the virtual file system of clang or llvm. Or we try something with the STDIN. But I think that could cause a lot of problems. Otherwise, we have to change the clang source code so that it reads the code direct from a buffer.
Current solution: The clang code has been modified because llvm::MemoryBuffer::getFileOrSTDIN() does not support a virtual file system.

Tasks befor merge

These functions must be implemented so that the pull request is ready for merging.

Load fatbin code in NVPTX backend without file I/O (see Question 3.)
Add functionality at cling::utils::getWrapPoint() to handle the cuda function attributes __global__, __device__ and __host__
~~fix bug with CUDA __constant__ memory~~
fix bug with xeus-cling I/O

Optional task

I moved the __constant__ memory task to an optional task, because the solution more complicated, than I thought. __constant__ is a frequently used feature, but is not required for the basic functionality of the CUDA extension.

fix bug with CUDA __constant__ memory

Problem description: SimeonEhrig#3
Possible solution 1: Use an ASTTransformer to remove the __constant__ attribute from local variables. Is not easy to use because attributes are not stored in the AST. Attributes are stored in a side data structure (clang::Declarator). To check whether a statement has an attribute, we must perform the semantic analysis, which also checks properties of attributes which cause the __constant__-local-error. The built-in functions of clang for detecting and changing attributes should therefore not be usable. This means that we have to implement our own solution for detecting and changing attributes.
Possible solution 2: Change the behavior of the ASTTransformer DeclExtractor. At the moment the DeclExtractior copies the declaration form the wrapper function to the global space and modify the local declaration a bit. If we could delete the local declaration completely, the error would not occur. But I don't know if the local declaration is necessary for any functionality.

Overview of the most important changes

modified cling::Interpreter
- can be used as nvptx JIT
- has a pointer to a second cling::interpreter instance -> host JIT has a device JIT
- to avoid recursions and some errors in the device JIT, there are some if(CudaDevice) statements in the functions
a new ASTTransformer that inline every cuda __device__ kernel
- is required because PTX is a plain assembler format without linking functionality
- is needed for example, if we define a __device__ kernel in transaction A and we want to use the __device__ kernel in a __global__ kernel of transaction B -> then we have to copy to whole definition of the __device__ kernel into the PTX file of transaction B
PTX and fatbinary are generated entirely in-memory
- for PTX see modifications at cling::Interpreter
- for the fatbinary code we need a modification at the Clang code, because, the Clang codeGen read the code via a function from a file that does not support a virtual file system
CPT tool builds the nvptx beside the host target
- avoid a lot of #ifdef guards
- build is completly target independent
- if the functionality is not used at runtime, it is also completly target independent
Add the detection of CUDA attributes __global__, __device__ and __host__ to SourceNormalization
- a general, intelligent approach would be better

SimeonEhrig · 2019-02-05T10:46:02Z

/cc @Axel-Naumann @vgvassilev @ax3l

lib/Interpreter/IncrementalParser.cpp

test/Interfaces/invocationFlags.C

lib/Interpreter/DeviceKernelInliner.h

ax3l · 2019-02-12T12:06:55Z

lib/Interpreter/IncrementalCUDADeviceCompiler.cpp

-    cuFile << initialCUDADeviceCode;
-    cuFile.close();
+    // initialize NVPTX backend
+    LLVMInitializeNVPTXTargetInfo();


CI says:

error: ‘LLVMInitializeNVPTXTargetInfo’ was not declared in this scope LLVMInitializeNVPTXTargetInfo(); error: ‘LLVMInitializeNVPTXTarget’ was not declared in this scope LLVMInitializeNVPTXTarget(); error: ‘LLVMInitializeNVPTXTargetMC’ was not declared in this scope LLVMInitializeNVPTXTargetMC(); error: ‘LLVMInitializeNVPTXAsmPrinter’ was not declared in this scope LLVMInitializeNVPTXAsmPrinter();

missing includes or namespace prefix?

There is a problem with the Travis CI configuration. Travis builds cling with the cmake argument -DLLVM_TARGETS_TO_BUILD=host. So only the x86 backend is built.

But that's ok to do. Maybe you want to add #ifdef guards or CMake modifications for the build target/backend?
Also, maybe add a new matrix entry in travis that installs a supported CUDA version and builds the PTX backend as well?

@Axel-Naumann Do you have a better solution? A #ifdef guard is just a workaround right now. It's just avoids the test build with the new functionality and that's not a good solutions. I think enabling the nvptx backend is the better solution. It should no affect the build on systems that don't provide a CUDA runtime.

I solved the problem by adding the NVPTX backend to -DLLVM_TARGETS_TO_BUILD=host. It does not cause any problems on non CUDA system. It just increases the size a little.

lib/Interpreter/IncrementalCUDADeviceCompiler.cpp

ax3l · 2019-03-24T12:38:02Z

CC @Axel-Naumann @vgvassilev ping :) Please feel free to provide feedback to Simeon's new approach.

SimeonEhrig · 2019-06-24T14:16:24Z

NVPTX backend without file I/O: The NVPTX backend uses the function llvm::MemoryBuffer::getFileOrSTDIN() to load the fatbin code from file (see source code ). The path is set by the Cling. To get a full compiler stack without file I/O, we must avoid this step. Maybe we can use the virtual file system of clang or llvm. Or we try something with the STDIN. But I think that could cause a lot of problems. Otherwise, we have to change the clang source code so that it reads the code direct from a buffer.

@Axel-Naumann @vgvassilev
Do you have any idea how I can solve this? I have seen that you have already done something with virtual file systems in cling. There is also a post on the llvm-dev list.

SimeonEhrig · 2019-07-18T08:33:54Z

For the commit 94075f9 is a clang patch needed:
0001-Add-buffer-for-CUDA-fatbinary-in-CGCUDANV-backend.patch.zip

ax3l · 2019-10-01T23:07:05Z

@SimeonEhrig can you quickly solve the merge conflict, please?

SimeonEhrig · 2019-10-02T15:23:27Z

I have successfully rebased my branch. The tests are passed. The CI doesn't have the Clang Patch, so they will fail.

Unfortunately, redefinition with CUDA doesn't work at the moment. I will try to solve it because the feature is very nice.

Axel-Naumann

Awesome work, @SimeonEhrig !

Apologies it took me AGES to review. Some smaller comments for this PR, some suggestions for future ones - I'm sure you know best which comments belong into which bucket :-) Nothing severe / critical, but I'd appreciate if you could let me know once you see the comments addressed that you want to address for this PR so I can hit "merge"!

lib/Interpreter/DeviceKernelInliner.cpp

lib/Interpreter/DeviceKernelInliner.h

lib/Interpreter/IncrementalCUDADeviceCompiler.cpp

lib/Interpreter/Interpreter.cpp

test/CUDADeviceCode/CUDAInclude.C

Axel-Naumann · 2019-10-03T11:18:32Z

tools/packaging/cpt.py

@@ -494,7 +494,7 @@ def compile(arg, build_libcpp):

    build = Build()
    cmake_config_flags = (srcdir + ' -DLLVM_BUILD_TOOLS=Off -DCMAKE_BUILD_TYPE={0} -DCMAKE_INSTALL_PREFIX={1} '
-                          .format(build.buildType, TMP_PREFIX) + ' -DLLVM_TARGETS_TO_BUILD=host ' +
+                          .format(build.buildType, TMP_PREFIX) + ' -DLLVM_TARGETS_TO_BUILD="host;NVPTX" ' +


I'd prefer doing that at the CMakeLists.txt level of cling, given that - if not using cpt.py - people must know to add NVPTX. Can cling add NVPTX as LLVM_TARGETS_TO_BUILD?

Yeah, that's a problem. Unfortunately I cannot solve the problem in the CMakeLists.txt of cling, because it has to be activated in the clang build too. I think the best way to activate the NVPTX backend automatically is to add some code to llvm's CMakeLists.txt. Then it will also be enabled in the Clang and Cling project. Unfortunately, this will be a commit, which will not go upstream.

Axel-Naumann · 2019-10-04T06:11:49Z

Any idea why Travis is unhappy: https://travis-ci.org/root-project/cling/jobs/592618226#L2146 ?

SimeonEhrig · 2019-10-04T07:00:42Z

Yes, the Clang patch is missing in the CI.

See: #284 (comment)

Axel-Naumann · 2019-10-04T07:56:05Z

Sorry I missed that. I'm extremely hesitant to add clang patches that are not also upstreamed. Could you open a Differential with the clang folks, arguing your case? Or - preferred - think of a different solution (e.g. virtual file) that does not need a clang patch?

SimeonEhrig · 2019-10-04T09:10:08Z

I know that clang patches are not the best solution, but I haven't found another solution. I've already checked if I can use the llvm virtual file system. Unfortunately, the getFileOrSTDIN function cannot use the virtual filesystem. But maybe you have another idea when you see the differential:

diff --git a/include/clang/Frontend/CodeGenOptions.h b/include/clang/Frontend/CodeGenOptions.h
index 71730a2..9e2cf68 100644
--- a/include/clang/Frontend/CodeGenOptions.h
+++ b/include/clang/Frontend/CodeGenOptions.h
@@ -209,6 +209,10 @@ public:
   /// object file.
   std::vector<std::string> CudaGpuBinaryFileNames;
 
+  /// A buffer that contains the fatbinary code to forward to CUDA runtime
+  /// back-end for incorporating them into host-side object file.
+  std::shared_ptr<llvm::SmallVectorImpl<char>> CudaGpuBinaryBuffer;
+
   /// The name of the file to which the backend should save YAML optimization
   /// records.
   std::string OptRecordFile;
diff --git a/lib/CodeGen/CGCUDANV.cpp b/lib/CodeGen/CGCUDANV.cpp
index d24ef0a..66d6f63 100644
--- a/lib/CodeGen/CGCUDANV.cpp
+++ b/lib/CodeGen/CGCUDANV.cpp
@@ -254,7 +254,8 @@ llvm::Function *CGNVCUDARuntime::makeRegisterGlobalsFn() {
 /// \endcode
 llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
   // No need to generate ctors/dtors if there are no GPU binaries.
-  if (CGM.getCodeGenOpts().CudaGpuBinaryFileNames.empty())
+  if (!CGM.getCodeGenOpts().CudaGpuBinaryBuffer &&
+      CGM.getCodeGenOpts().CudaGpuBinaryFileNames.empty())
     return nullptr;
 
   // void __cuda_register_globals(void* handle);
@@ -281,16 +282,7 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
   // to be cleaned up in destructor on exit. Then associate all known kernels
   // with the GPU binary handle so CUDA runtime can figure out what to call on
   // the GPU side.
-  for (const std::string &GpuBinaryFileName :
-       CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
-    llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
-        llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
-    if (std::error_code EC = GpuBinaryOrErr.getError()) {
-      CGM.getDiags().Report(diag::err_cannot_open_file) << GpuBinaryFileName
-                                                        << EC.message();
-      continue;
-    }
-
+  auto buildFatbinarySection = [&](const llvm::StringRef FatbinCode) {
     const char *FatbinConstantName =
         CGM.getTriple().isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin";
     // NVIDIA's cuobjdump looks for fatbins in this section.
@@ -305,14 +297,12 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
     // Fatbin version.
     Values.addInt(IntTy, 1);
     // Data.
-    Values.add(makeConstantString(GpuBinaryOrErr.get()->getBuffer(), 
-                                  "", FatbinConstantName, 8));
+    Values.add(makeConstantString(FatbinCode, "", FatbinConstantName, 8));
     // Unused in fatbin v1.
     Values.add(llvm::ConstantPointerNull::get(VoidPtrTy));
-    llvm::GlobalVariable *FatbinWrapper =
-      Values.finishAndCreateGlobal("__cuda_fatbin_wrapper",
-                                   CGM.getPointerAlign(),
-                                   /*constant*/ true);
+    llvm::GlobalVariable *FatbinWrapper = Values.finishAndCreateGlobal(
+        "__cuda_fatbin_wrapper", CGM.getPointerAlign(),
+        /*constant*/ true);
     FatbinWrapper->setSection(FatbinSectionName);
 
     // GpuBinaryHandle = __cudaRegisterFatBinary(&FatbinWrapper);
@@ -331,6 +321,27 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
 
     // Save GpuBinaryHandle so we can unregister it in destructor.
     GpuBinaryHandles.push_back(GpuBinaryHandle);
+  };
+
+  // If there is a valid buffer with fatbinary code, embed the buffer.
+  // Otherwise, embed fatbinary code from files.
+  if (CGM.getCodeGenOpts().CudaGpuBinaryBuffer) {
+    const llvm::StringRef GpuBinaryBuffer(
+        CGM.getCodeGenOpts().CudaGpuBinaryBuffer->data(),
+        CGM.getCodeGenOpts().CudaGpuBinaryBuffer->size());
+    buildFatbinarySection(GpuBinaryBuffer);
+  } else {
+    for (const std::string &GpuBinaryFileName :
+         CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
+      llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
+          llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
+      if (std::error_code EC = GpuBinaryOrErr.getError()) {
+        CGM.getDiags().Report(diag::err_cannot_open_file)
+            << GpuBinaryFileName << EC.message();
+        continue;
+      }
+      buildFatbinarySection(GpuBinaryOrErr.get()->getBuffer());
+    }
   }
 
   CtorBuilder.CreateRetVoid();

SimeonEhrig · 2019-10-04T14:16:52Z

I have another idea. Maybe we can replace the getFileOrSTDIN function with another function that supports the virtual filesystem and bring it into the clang upstream. Then we can port it back to cling's clang-repo. What do you think?

- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments

Axel-Naumann · 2019-10-04T16:12:34Z

Maybe we can replace the getFileOrSTDIN function with another function

Perfect, yes!

SimeonEhrig · 2019-10-07T08:03:21Z

Maybe we can replace the getFileOrSTDIN function with another function

Perfect, yes!

And now the "problem" of this solution. Backporting shouldn't be easy, because the clang code base is really old. In the meantime, the nvptx backend has been extended with functions for HIP[1]. So I think we have to port a lot of files back. I'm not sure, maybe we should stash this feature, because you wrote in #291 that you want to update the clang base at the end of the year. With a new base, the back port should be easy. The feature itself is not so important. It prevents the fatbin code from being written to a file and read by the same process again. The impact on performance is not noticeable. The main advantage of the feature is that it improves the independence of the file system. Especially for HPC it is very helpful.

1] HIP is the AMD version of CUDA. Unlike CUDA, it can address AMD and NVIDIA GPUs. If the target is an NVIDIA GPU, it works like a wrapper framework. Therefore, it made sense to integrate the HIP part for NVIDIA GPUs into the nvptx backend. Often it is just an if-else that changes the function name from cudaSomething() to hipSomething().

DavidPoliakoff · 2019-10-14T18:55:38Z

@SimeonEhrig: @ax3l just linked me this somewhere else. This is really excellent work. Hopefully in about a month or so I'll be able to play with Kokkos and this, if we have good CUDA support it could be fun to start playing with rapid development of Kokkos, switching kernels between host and device and the like. Half commenting just to say well done, half to be notified when this merges :)

ax3l · 2019-10-14T20:46:44Z

Thank you for the kind words! Yes, one of our main motivations is tuning and playing with Alpaka as well :)

SimeonEhrig · 2019-10-15T07:18:54Z

Thank you so much. At the moment I am preparing a Jupyter notebook for a lecture with jitify and the redefinition feature. It's really nice to use, but just a "workaround". After the PR one of the next features will be the redefinition in CUDA mode. Then we can use it with the Runtime API, Alpaca, Kokkos, ... :-D

- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments - remove deprecated opt level variables - change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()

SimeonEhrig · 2019-10-24T11:17:42Z

@Axel-Naumann I have removed the part of the code that requires changes to the clang base. Now the tests pass (except for some timeouts). There are only two comments left, which I can't solve without your feedback.

…ices - change CUDA to CUDAHost and add CUDADevice to the InvocationOptions - in the PTX mode, some ASTTransformer will not be used, which are useful for the x86 mode

This ASTTransformer adds an inline attribute to any CUDA __device__ kernel that does not have the attribute. Inlining solves a problem caused by incremental compilation of PTX code. In a normal compiler, all definitions of __global__ and __device__ kernels are in the same translation unit. In the incremental compiler, each kernel has its own translation unit. In case a __global__ kernel uses a __device__ function, this design caused an error. Instead of generating the PTX code of the __device__ kernel in the same file as the __global__ kernel, there is only an external declaration of the __device__ function. However, normal PTX code does not support an external declaration of functions. The transformer only works if the target device is nvptx.

Replaced the old version of the PTX compiler which used external tools and llvm::ExecuteAndWait with an internal implementation. The new incremental PTX compiler uses a modified version of the cling::Interpreter instance. The instance can process the PTX built-ins and generates LLVM IR. The LLVM IR is be compiled to PTX via an additional NVPTX backend implemented in the IncrementalCUDADeviceCompiler. The new implementation has many advantages: - it's much faster than the old version - less error-prone because the ASTPrinter and some unnecessary cling transformations are avoided - reduction of problems in searching for external tools (can be very complicated on HPC systems) The IncrementalCUDADeviceCompiler is moved from the cling::IncrementalParser to the cling::Interpreter, because the second interpreter needs the input without wrappers or transformations.

- instead of using the NVIDIA tool fatbin, the fatbin is now generated directly in the cling - clean up the IncrementalCUDADeviceCompiler class depending on the new fatbin implementation (e.g. remove findToolChain()) - last I/O-operation is required -> write the fatbin code to a file for the CodeGen

- it is more similar to the interface of cling::Interpreter - replace function compileDeviceCode() with process() - add declare() and parse() functions - the functions have only the argument input, because the rest of the missing arguments (e.g. Transaction) requires modifications at the transaction system - it also fixes a bug in the I/O system of the xeus-cling kernel

- a really weak solution, which should replaced by a generic solution

- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments - remove deprecated opt level variables - change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()

Axel-Naumann · 2019-11-05T23:35:21Z

I have applied this to ROOT (we sync cling from ROOT). The PR root-project/root#4616 will also run ROOT's test suite. I do not expect any issues there. I am closing this as all cling-related issues seem to be addressed; the rest happens in ROOT's PR!

Open questions:

PTX-string as multiple of 8 bytes seems to contradict the code
enable / disable nvptx backend

- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments - remove deprecated opt level variables - change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()

ax3l reviewed Feb 5, 2019

View reviewed changes

lib/Interpreter/IncrementalParser.cpp Outdated Show resolved Hide resolved

test/Interfaces/invocationFlags.C Outdated Show resolved Hide resolved

lib/Interpreter/DeviceKernelInliner.h Outdated Show resolved Hide resolved

ax3l reviewed Feb 5, 2019

View reviewed changes

lib/Interpreter/DeviceKernelInliner.h Outdated Show resolved Hide resolved

ax3l reviewed Feb 12, 2019

View reviewed changes

lib/Interpreter/IncrementalCUDADeviceCompiler.cpp Show resolved Hide resolved

SimeonEhrig force-pushed the refactorCudaPTXJIT branch from cb7d5da to ed740c1 Compare February 18, 2019 12:22

SimeonEhrig force-pushed the refactorCudaPTXJIT branch from ed740c1 to 47ac02a Compare May 27, 2019 13:08

SimeonEhrig force-pushed the refactorCudaPTXJIT branch 2 times, most recently from 2b96e83 to 9354f6f Compare June 6, 2019 06:16

SimeonEhrig mentioned this pull request Jun 6, 2019

Invalid C++ Code from AST-Printer at value assignment with reflection #242

Closed

ax3l mentioned this pull request Jun 10, 2019

CUDA Support #115

Open

SimeonEhrig changed the title ~~Refactor incremental CUDA PTX compiler~~ [WIP] Refactor incremental CUDA PTX compiler Jun 24, 2019

SimeonEhrig force-pushed the refactorCudaPTXJIT branch from fbe8682 to 1e9e7e4 Compare July 16, 2019 07:08

SimeonEhrig changed the title ~~[WIP] Refactor incremental CUDA PTX compiler~~ Refactor incremental CUDA PTX compiler Jul 18, 2019

oshadura assigned Axel-Naumann and ax3l Sep 17, 2019

SimeonEhrig force-pushed the refactorCudaPTXJIT branch from 94075f9 to ed44af7 Compare October 2, 2019 15:17

Axel-Naumann requested changes Oct 3, 2019

View reviewed changes

SimeonEhrig added a commit to SimeonEhrig/cling that referenced this pull request Oct 4, 2019

Changes for Pull Request root-project#284

1366a24

- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments

SimeonEhrig force-pushed the refactorCudaPTXJIT branch from 1366a24 to 521c3bd Compare October 24, 2019 08:12

SimeonEhrig mentioned this pull request Nov 4, 2019

Question 8-byte alignment PTX code hfinkel/llvm-project-cxxjit#13

Closed

SimeonEhrig added 11 commits November 5, 2019 14:53

Extend cling::Intpereter to incremental compile PTX code for CUDA dev…

4bb99f5

…ices - change CUDA to CUDAHost and add CUDADevice to the InvocationOptions - in the PTX mode, some ASTTransformer will not be used, which are useful for the x86 mode

Support for the define argument (-D) in the CUDA mode

8a91f51

Setting a new include path at runtime in the PTX compiler now works

5549f11

Add NVPTX backend to CPT

daa2649

Add -include argument to cling ptx

dc439a4

Add functionality to SourceNormalization to detect cuda kernels

98f46f4

- a really weak solution, which should replaced by a generic solution

Changes for Pull Request root-project#284

aa6478c

- add Author to CUDA test cases - optimize DeviceKernelInliner - improve some comments - remove deprecated opt level variables - change interface of IncrementalCUDADeviceCompiler::process() IncrementalCUDADeviceCompiler::declare()

SimeonEhrig force-pushed the refactorCudaPTXJIT branch from 521c3bd to aa6478c Compare November 5, 2019 13:56

Axel-Naumann mentioned this pull request Nov 5, 2019

cling PR 284 root-project/root#4616

Closed

Axel-Naumann closed this Nov 5, 2019

SimeonEhrig mentioned this pull request Nov 7, 2019

Cling PR 284 root-project/root#4618

Merged

SimeonEhrig mentioned this pull request Jan 27, 2020

Backport bugfix in the ptx generator #328

Closed

SimeonEhrig mentioned this pull request Feb 11, 2021

Upgrade llvm90 root-project/root#6385

Merged

SimeonEhrig mentioned this pull request Dec 2, 2021

[metcling] Disable optimized register allocation root-project/root#9342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor incremental CUDA PTX compiler #284

Refactor incremental CUDA PTX compiler #284

SimeonEhrig commented Feb 5, 2019 •

edited

Loading

SimeonEhrig commented Feb 5, 2019

ax3l Feb 12, 2019 •

edited

Loading

SimeonEhrig Feb 12, 2019

ax3l Feb 13, 2019 •

edited

Loading

SimeonEhrig Feb 13, 2019

SimeonEhrig Jun 11, 2019

ax3l commented Mar 24, 2019

SimeonEhrig commented Jun 24, 2019

SimeonEhrig commented Jul 18, 2019

ax3l commented Oct 1, 2019

SimeonEhrig commented Oct 2, 2019

Axel-Naumann left a comment

Axel-Naumann Oct 3, 2019

SimeonEhrig Oct 4, 2019

Axel-Naumann commented Oct 4, 2019

SimeonEhrig commented Oct 4, 2019

Axel-Naumann commented Oct 4, 2019 •

edited

Loading

SimeonEhrig commented Oct 4, 2019

SimeonEhrig commented Oct 4, 2019

Axel-Naumann commented Oct 4, 2019

SimeonEhrig commented Oct 7, 2019

DavidPoliakoff commented Oct 14, 2019

ax3l commented Oct 14, 2019

SimeonEhrig commented Oct 15, 2019

SimeonEhrig commented Oct 24, 2019

Axel-Naumann commented Nov 5, 2019

Refactor incremental CUDA PTX compiler #284

Refactor incremental CUDA PTX compiler #284

Conversation

SimeonEhrig commented Feb 5, 2019 • edited Loading

About

Current implementation

Problems

Idea

Prototype

Frontend

Backend

Questions

Tasks befor merge

Optional task

Overview of the most important changes

SimeonEhrig commented Feb 5, 2019

ax3l Feb 12, 2019 • edited Loading

Choose a reason for hiding this comment

SimeonEhrig Feb 12, 2019

Choose a reason for hiding this comment

ax3l Feb 13, 2019 • edited Loading

Choose a reason for hiding this comment

SimeonEhrig Feb 13, 2019

Choose a reason for hiding this comment

SimeonEhrig Jun 11, 2019

Choose a reason for hiding this comment

ax3l commented Mar 24, 2019

SimeonEhrig commented Jun 24, 2019

SimeonEhrig commented Jul 18, 2019

ax3l commented Oct 1, 2019

SimeonEhrig commented Oct 2, 2019

Axel-Naumann left a comment

Choose a reason for hiding this comment

Axel-Naumann Oct 3, 2019

Choose a reason for hiding this comment

SimeonEhrig Oct 4, 2019

Choose a reason for hiding this comment

Axel-Naumann commented Oct 4, 2019

SimeonEhrig commented Oct 4, 2019

Axel-Naumann commented Oct 4, 2019 • edited Loading

SimeonEhrig commented Oct 4, 2019

SimeonEhrig commented Oct 4, 2019

Axel-Naumann commented Oct 4, 2019

SimeonEhrig commented Oct 7, 2019

DavidPoliakoff commented Oct 14, 2019

ax3l commented Oct 14, 2019

SimeonEhrig commented Oct 15, 2019

SimeonEhrig commented Oct 24, 2019

Axel-Naumann commented Nov 5, 2019

SimeonEhrig commented Feb 5, 2019 •

edited

Loading

ax3l Feb 12, 2019 •

edited

Loading

ax3l Feb 13, 2019 •

edited

Loading

Axel-Naumann commented Oct 4, 2019 •

edited

Loading