-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVPTX backend metabug #38789
Comments
It would be better if instead of having to create a new module / crate, adding the That way we could just write our kernels inline with normal Rust code. #[target_feature(enabled = "ptx_device_fn")] unsafe fn a_device_fn(...) { ... }
#[target_feature(enabled = "ptx_kernel")] unsafe fn a_kernel(...) { ... a_device_fn(...) ... }
unsafe fn bar() {
cuda::driver::launch(x, y, z, w, a_kernel);
cuda::driver::launch(x, y, z, w, #[target_feature(enabled = "ptx_kernel")] |...| a_kernel(...) );
} This way users can use the typical |
@japaric @rkruppe @alexcrichton I'd like to work on this. |
You seem to assume a "single source" model? We don't have that currently. You have to compile your kernels as one crate for a Supporting that would require novel frontend integration (novel for Rust; clang has something like this already). For example rustc would have to decide for each translation item whether it should be compiled for the host, for the device, or both -- and then combining the resulting PTX and host object files. Additionally, even if/once we have "single source", |
Indeed, makes sense.
I'd like to work on enabling this via a mixture of the ABI solution to choose, e.g., the If we could have these multiple ABIs into a single source file, we could have #[inline]
fn baz() { }
#[target_device(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn bar(...) { ... }
#[target_device_kernel(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn foo(...) {
#[device] bar(...); // device attribute indicates that this fn is a device fn
baz(); // this function will be used as is
} that expand to: fn baz_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx" fn baz_nvptx(...) { bar_nvptx(...); baz(); }
extern "spriv" fn baz_spirv(...) { bar_spirv(...); baz(); }
fn foo_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx-kernel" fn foo_nvptx(...) { bar_nvptx(...); baz(); }
extern "spirv-kernell" fn foo_spirv(...) { bar_device(...); baz(); } And then just launch the kernels using another procedural macro, e.g., One cool feature of clang and nvcc is to allow whoever builds the library to easily choose the devices to target. Procedural macros could allow these via feature flags: This approach leaves the door open to doing something nicer in the language in the future, while allowing libraries to experiment with better APIs. I wonder whether these two building blocks (extern ABIs in a single source file, and |
Frankly, I don't see how any of the tools we have in the language now (target_feature, proc macros, ABIs) can help at all with single source support. Right now, one crate is compiled for one target, period. Subsets of the crate can tweak some parts of the target (e.g., use non-standard ABIs or enable/disable instruction set extensions) but that's a far cry from slicing out a subset of the crate, compiling it for a completely different target, and then stitching the results back together -- and that's precisely what is necessary for single-source offloading (not just CUDA, but also everything else along these lines that I've seen). In fact the assumption that one crate == one target goes as far as Even if one attempts to minimize the amount of compiler changes needed during prototyping for faster iteration (generally a good idea) by e.g. splitting the crate into two crates with an external tool and invoking rustc twice, there is ample room for compiler hacking. Even the bare minimum of single source support requires name resolution information, and being able to use generic library code will require type system integration as well. So in my opinion, this is a rather big feature with at least as much need for experimentation and compiler hacking and design work as SIMD intrinsics. I say this not to discourage you but because your posts so far ignore the technical challenges that are, in my opinion, the biggest obstacle to single source support. I'm also rather puzzled by the priorities here. Before experimenting with the best way to allow users to compile their single-source applications not just for multiple CUDA devices but also for entirely different targets like SPIR-V, basic features like an equivalent to |
Oh no, I think I expressed myself wrong. I meant that once we get single source support using extern ABIs, the combination of tools that we have already available in the language can allow for pretty nice APIs.
I think that |
What does this mean? I am not aware of any plans for any kind of single source support. And what does "using extern ABIs" mean? It seems to presuppose some strategy for single source support but it's not clear to me which one (and it doesn't sound like any of the strategies that I am aware of). Finally, assuming I'm correct that single source support is not on the horizon, I'm puzzled why we're hashing out details of how it could be exposed better to the user if the basic technical prerequsites aren't even on the horizon.
IIUC such an intrinsic would be basically like an |
How so? The kernel does not allocate anything:
I'd like to work on enabling single source support and I'd like to enable it in such a way that it is useful.
In a single source model: extern "ptx-kernel" unsafe fn foo(...) { ... is compiled to a ptx kernel ... }
fn bar(...) { ... is compiled for the host ... } |
It's a storage specifies in C parlance. You declare variables to live in shared memory as opposed to thread-private memory or global memory or constant memory. In pointer types it's just an optional hint that the pointee lives in shared memory, that aspect isn't even needed. What is absolutely necessary is to be able to do declare locals like |
Sure, but what's the point of making it a storage specifier in Rust? You can't have two variables on shared memory, that is, the following is not valid CUDA C: __global__ void foo(float* foo) {
__shared__ a float[];
__shared__ b float[]; // ERROR: you can only have one pointer to shared memory per kernel
foo[0] = a[0] + b[t0];
}
In particular, here you are not allocating a So IMO, independently of what the spec says, we should focus on the actual semantics of This extern __shared__ a float[]; is just: float* a = __get_ptr_to_shared_memory(); |
I want to point out that we're pretty badly derailing this metabug. You should probably open a thread on internals.rlo if you want to take this discussion much further. But first I would invite you to double check your facts. Many things you've said go against everything I've ever heard and seen about CUDA (and other offloading solutions, for that matter).
I'm not saying it should be a storage specifier in Rust. We don't even have such a concept at the moment. I'm just saying, it's fundamentally a variation on variable declaration (like DST locals), not something about pointers.
???? Check the code I linked earlier, it has at least two such variables. Or pretty a random non-trivial CUDA program using shared memory. I don't know why the code you give is rejected, but I've never even seen this syntax so I'm not even sure what it means.
Are you saying Nvidia's examples, as well as all the other programs using shared memory, are effectively borked? That can't be right. In fact, I'm pretty sure the size of the is recorded (provided it has a size -- again I don't know wth That reminds me, another way in which such intrinsic for |
That syntax is the dynamic shared memory allocation syntax.
The size of the shared memory region can be specified at run-time, at least in CUDA. Reading through the docs of the example you mention, if the size of the variables allocated in shared memory are compile-time constants, and no dynamic shared memory allocation occurs, it looks like one does not need to specify the memory to allocate during kernel launch because the compiler does it for you (but I've always used dynamic shared memory so I am not sure). |
Good to know that there's a dynamic allocation strategy as well. But it seems that you still specify the size, just at kernel invocation time? That seems like it would still allow the driver to make sure enough memory is available (i.e., as much memory as the kernel invocation specified; of course the kernel still needs to obtain and use that number correctly). For dynamic an intrinsic might be good, but since static shared memory allocation seems extremely common, we'd probably want to support it as well and an intrinsic can't really do that (well). |
Yes.
Exactly. The typical way in which this is used is by passing something that correlates with the allocated size as a run-time argument to the kernel.
Yes definitely. Since dynamic shared memory is more powerful (it allows doing everything that can be done with static shared memory and some more), has no drawbacks over static shared memory beyond ergonomics (shared memory is always allocated at run-time, whether the size is known are compile-time or not is pretty much irrelevant), and can probably just be an Adding support for AFAIK only fixed-size arrays are allowed in static shared memory and the memory must be uninitialized. So while something like |
Note that I'm interested in funding work on this: https://internals.rust-lang.org/t/nvptx-funding/7441 I'd like to get this to work out of the box on nightly. |
Hi, I've started using the NVPTX backend for some simple experiments, I'm listing my experiences here so far since I don't know what the proper protocol is. We can turn these into specific issues on the right repo's later on.
pub struct MyStruct {
data: u32,
}
impl PartialEq for MyStruct {
fn eq(&self, other: &Self) -> bool {
return self.data == other.data;
}
} Leads to invalid PTX since symbols are being generated with dots in them:
I haven't looked much into it but I have a feeling that it's due to https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_utils/symbol_names.rs#L424
On the bright side: this has been a really pleasant GPU programming experience so far (other then actually getting it set up) because it's extremely valuable to share the same codebase between CPU and GPU. |
I have not been able to compile to PTX with cargo or xargo, I've only been able to do so using accel. Therefore, some of the following may be issues with Accel. That seems unlikely, so I'll report them here.
Most of these are probably due to references to missing functions in the final PTX, but they'll need to be dealt with somehow. I am interested in contributing to improve the state of GPGPU in Rust. Not sure where to start. |
I've met this issue while developing accel , and it prevent me to use libcore for nvptx target. accel cannot link libcore or other std libraries currently. I recently start to write a patch to rustc to enable nvptx target.
this seems to be a good information for me :) |
#53099 is relevant here. I haven't been able to compile any kernel recently, because of a segfault while compiling |
@bheisler I believe it somehow related to definition json. It doesn't happend to me with json from ptx-linker, but I saw the problem with another one. |
I'm finally proud to announce my progress on CUDA integration. I've made several tools to ease development and currently working on a tutorial and high-level crate (it will probably be a custom rustc driver because First one is a ptx-linker that solves several important problems:
I started work on the linker about a year ago, and today achieved important milestone: it doesn't depend on any external tools and libs anymore. So end users don't need to care about matching Rust's and system's LLVM versions (which became a problem when Rust switched to LLVM 7.0). The second crate is a ptx-builder that improves development convenience dramatically. It's a Also worth checking, an incomplete tutorial about CUDA development flow with more or less real example. |
@denzp Can ptx-linker link with libcore? I am creating a toolchain to link libcore using llvm-link in rust-accel/nvptx. Linking of libcore will cause the symbol name issue as reported by @Jasper-Bekkers due to the difference between GAS and PTX, and I avoid it by rewriting librustc_codegen_utils/symbol_names.rs. |
@termoshtt The linker suppose to fix this, it has a special "pass" that does renaming. The problem can happen not only with src_image.pixel(i, j)
I found the linker robust enough about solving the issue. But still, I'd probably prefer this to be fixed in rustc. |
@denzp - I don't think it's caused by the target JSON, unfortunately. When I add the |
#38824 is closed about a month ago. I think we can remove it from the list? |
NVPTX target specification This change adds a built-in `nvptx64-nvidia-cuda` GPGPU no-std target specification and a basic PTX assembly smoke tests. The approach is taken here and the target spec is based on `ptx-linker`, a project started about 1.5 years ago. Key feature: bitcode object files being linked with LTO into the final module on the linker's side. Prior to this change, the linker used a `ld` linker-flavor, but I think, having the special CLI convention is a more reliable way. Questions about further progress on reliable CUDA workflow with Rust: 1. Is it possible to create a test suite `codegen-asm` to verify end-to-end integration with LLVM backend? 1. How would it be better to organise no-std `compile-fail` tests: add `#![no_std]` where possible and mark others as `ignore-nvptx` directive, or alternatively, introduce `compile-fail-no-std` test suite? 1. Can we have the `ptx-linker` eventually be integrated as `rls` or `clippy`? Hopefully, this should allow to statically link against LLVM used in Rust and get rid of the [current hacky solution](https://github.com/denzp/rustc-llvm-proxy). 1. Am I missing some methods from `rustc_codegen_ssa::back::linker::Linker` that can be useful for bitcode-only linking? Currently, there are no major public CUDA projects written in Rust I'm aware of, but I'm expecting to have a built-in target will create a solid foundation for further experiments and awesome crates. Related to #38789 Fixes #38787 Fixes #38786
Triage: this is a metabug. Not aware of anything particular going on with this target lately. |
Visited for T-compiler backlog bonanza. It seems like there are some unresolved questions about scope and design, with respect to the concerns that were raised on this thread between @hanna-kruppe and @gnzlbg . (Basically, its not clear to me whether the work remaining here is "just" more implementation and fixing bugs, or if there's some design stuff that needs to be revisited.) @rustbot label: +S-tracking-needs-summary |
@kjetilkjeka Thank you for all your excellent recent work on Rust PTX support. What, in your opinion, is required to bring Rust CUDA support to parity with C++? Is it even possible? I'm curious to hear from others too. |
I think the most prominent issue with using CUDA from Rust is referenced by this issue. It's the lacking support for shared memory. In the bright future, I think that Rust with CUDA could in some ways even surpass using CUDA from C++. To avoid adding a lot of discussion around what CUDA using Rust should and should not be in this issue, I think we should continue discussing it in the newly created zulip thread for gpgpu. https://rust-lang.zulipchat.com/#narrow/stream/422870-t-compiler.2Fgpgpu-backend If you start a thread there I'm happy to discuss it further 😄 |
To respond somewhat to the way this initially started: while I'm not sure the ABI string is the most ideal mechanism, I definitely don't believe that The biggest giveaway is that "device function" and "host function" are mutually exclusive. In general, |
The way I've managed this is to put device functions and entry points ( This works relatively well, though RFC 3176: |
Agreed that single-source conveniences are cool and not critical. I mostly was remarking to make it clear that we shouldn't be using I believe that all remaining details should be split into new sub-issues from here. If we are happy to bury the single-source concern for now, then it seems the biggest remaining subissues are
|
Shared memory can be exposed using inline assembly (reference). Would something like this be welcome in core::arch::global_asm!(".extern .shared .align 16 .b8 _shared_data[];");
#[inline(always)]
pub fn _dynamic_smem() -> (*mut u8, u32) {
// Dynamic shared memory.
let size: u32;
let saddr: u64;
let ptr: *mut u8;
unsafe {
asm!("mov.u32 {}, %dynamic_smem_size;", out(reg32) size);
asm!("mov.u64 {}, _shared_data;", out(reg64) saddr);
asm!("cvta.shared.u64 {ptr}, {saddr};", ptr = out(reg64) ptr, saddr = in(reg64) saddr);
}
(ptr, size)
} It has also been suggested that this be implemented as an LLVM intrinsic, something like |
Yes, a new issue to discuss it would be nice. |
Then, closing this! |
The NVPTX backend has been available since: nightly-2017-01-XX
This is a collections of bugs and TODOs related to it.
Documentation
Bugs
LLVM assertion when compiling
core
to PTX. LLVM assertion when usingi128
with the NVPTX #38824LLVM error when emitting PTX code with debuginfo. LLVM error when emitting PTX code with debuginfo #38785
NVPTX: No "undefined reference" error is raised when it should be. NVPTX: No "undefined reference" error is raised when it should be #38786
NVPTX: non-inlined functions can't be used cross crate. NVPTX: non-inlined functions can't be used cross crate #38787
Missing features
__shared__
modifier. Probably needs an RFC to landin the compiler as we don't have anything similar to it (AFAIK).
Stabilization
All the non-trivial kernels make use of intrinsics like
blockIdx.x
. These will have to be stabilized. Right now these intrinsics are implemented as"plaform-intrinsics"
but that feature is unstable.Stabilize the
"ptx-kernel"
ABI. Tracking issue for the "ptx-kernel" ABI #38788cc @rkruppe
The text was updated successfully, but these errors were encountered: