Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVPTX backend metabug #38789

Closed
japaric opened this issue Jan 2, 2017 · 37 comments
Closed

NVPTX backend metabug #38789

japaric opened this issue Jan 2, 2017 · 37 comments
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC metabug Issues about issues themselves ("bugs about bugs") O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html S-tracking-needs-summary Status: It's hard to tell what's been done and what hasn't! Someone should do some investigation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@japaric
Copy link
Member

japaric commented Jan 2, 2017

The NVPTX backend has been available since: nightly-2017-01-XX

This is a collections of bugs and TODOs related to it.

Documentation

Bugs

Missing features

  • The equivalent to CUDA's __shared__ modifier. Probably needs an RFC to land
    in the compiler as we don't have anything similar to it (AFAIK).

Stabilization

  • Stabilize the nvptx targets. IOW, add them to the compiler. Candidates for merging.
  • All the non-trivial kernels make use of intrinsics like blockIdx.x. These will have to be stabilized. Right now these intrinsics are implemented as "plaform-intrinsics" but that feature is unstable.

  • Stabilize the "ptx-kernel" ABI. Tracking issue for the "ptx-kernel" ABI #38788

cc @rkruppe

@japaric japaric added metabug Issues about issues themselves ("bugs about bugs") O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html labels Jan 2, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC label Jul 26, 2017
@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2018

It would be better if instead of having to create a new module / crate, adding the #![feature(abi_ptx)] feature, and then declaring the kernels as extern "ptx-kernel" fn foo() {}, we could just handle this using #[target_feature], so that one can add kernels within a non-kernel Rust module. It would be great if this attribute could be applied to closures as well.

That way we could just write our kernels inline with normal Rust code.

#[target_feature(enabled = "ptx_device_fn")] unsafe fn a_device_fn(...) { ... }
#[target_feature(enabled = "ptx_kernel")] unsafe fn a_kernel(...) { ... a_device_fn(...) ... }

unsafe fn bar() {
   cuda::driver::launch(x, y, z, w, a_kernel);
   cuda::driver::launch(x, y, z, w, #[target_feature(enabled = "ptx_kernel")] |...| a_kernel(...) );
}

This way users can use the typical target_feature facilities to write portable code.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2018

@japaric @rkruppe @alexcrichton I'd like to work on this.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Feb 21, 2018

You seem to assume a "single source" model? We don't have that currently. You have to compile your kernels as one crate for a nvptx-* target, then embed the resulting PTX asm as string/resource into a crate that's compiled for the host target. The latter crate passes it to the driver for compilation. There's no way to mix host and device code in one compilation unit like nvcc allows.

Supporting that would require novel frontend integration (novel for Rust; clang has something like this already). For example rustc would have to decide for each translation item whether it should be compiled for the host, for the device, or both -- and then combining the resulting PTX and host object files.

Additionally, even if/once we have "single source", target_feature seems inappropriate, since target_feature is for modifying the available instruction set within one target, while this here requires compiling the code for a different target altogether (and in many cases, for two targets as mentioned before). A custom attribute might be more appropriate but since this is really about "entry point for device code or not" rather than "can run on device or not" an ABI seems fine too.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2018

Additionally, even if/once we have "single source", target_feature seems inappropriate, since target_feature is for modifying the available instruction set within one target,

Indeed, makes sense.

Supporting that would require novel frontend integration (novel for Rust; clang has something like this already). For example rustc would have to decide for each translation item whether it should be compiled for the host, for the device, or both -- and then combining the resulting PTX and host object files.

A custom attribute might be more appropriate but since this is really about "entry point for device code or not" rather than "can run on device or not" an ABI seems fine too.

I'd like to work on enabling this via a mixture of the ABI solution to choose, e.g., the ptx or ptx-kernel ABIs, and target_feature to choose, e.g., sm30 vs sm70.

If we could have these multiple ABIs into a single source file, we could have #[target_device] and #[target_device_kernel] procedural macros that just generates copies of a function for different ABIs:

#[inline]
fn baz() { }

#[target_device(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn bar(...) { ... }

#[target_device_kernel(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn foo(...) { 
  #[device] bar(...); // device attribute indicates that this fn is a device fn
  baz(); // this function will be used as is
}

that expand to:

fn baz_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx" fn baz_nvptx(...) { bar_nvptx(...); baz();  }
extern "spriv" fn baz_spirv(...) { bar_spirv(...); baz();  }

fn foo_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx-kernel" fn foo_nvptx(...) { bar_nvptx(...); baz();  }
extern "spirv-kernell" fn foo_spirv(...) { bar_device(...); baz();  }

And then just launch the kernels using another procedural macro, e.g., kernel_launch!(foo, args...).

One cool feature of clang and nvcc is to allow whoever builds the library to easily choose the devices to target. Procedural macros could allow these via feature flags: --features target_device_nvptx_sm_35, nvptx_sm_70, ... (it's not nice, but should be at least doable).

This approach leaves the door open to doing something nicer in the language in the future, while allowing libraries to experiment with better APIs. I wonder whether these two building blocks (extern ABIs in a single source file, and #[target_feature]) are a good way to do this, and whether this is something that a nicer approach is going to need in one form or another, or whether this approach is completely wrong and there is a better alternative.

@hanna-kruppe
Copy link
Contributor

Frankly, I don't see how any of the tools we have in the language now (target_feature, proc macros, ABIs) can help at all with single source support. Right now, one crate is compiled for one target, period. Subsets of the crate can tweak some parts of the target (e.g., use non-standard ABIs or enable/disable instruction set extensions) but that's a far cry from slicing out a subset of the crate, compiling it for a completely different target, and then stitching the results back together -- and that's precisely what is necessary for single-source offloading (not just CUDA, but also everything else along these lines that I've seen).

In fact the assumption that one crate == one target goes as far as rustc being literally unable to store more than one target per compilation session (Session). Not to mention how you need to generate a whole separate LLVM module for the PTX code.

Even if one attempts to minimize the amount of compiler changes needed during prototyping for faster iteration (generally a good idea) by e.g. splitting the crate into two crates with an external tool and invoking rustc twice, there is ample room for compiler hacking. Even the bare minimum of single source support requires name resolution information, and being able to use generic library code will require type system integration as well.

So in my opinion, this is a rather big feature with at least as much need for experimentation and compiler hacking and design work as SIMD intrinsics. I say this not to discourage you but because your posts so far ignore the technical challenges that are, in my opinion, the biggest obstacle to single source support.

I'm also rather puzzled by the priorities here. Before experimenting with the best way to allow users to compile their single-source applications not just for multiple CUDA devices but also for entirely different targets like SPIR-V, basic features like an equivalent to __shared__ seem like a simpler and more important first step. Again, not trying to discourage, but even the greatest most ergonomic portable offloading solution seems pointless if the kernels can't even use group-shared memory.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2018

Frankly, I don't see how any of the tools we have in the language now (target_feature, proc macros, ABIs) can help at all with single source support.

Oh no, I think I expressed myself wrong. I meant that once we get single source support using extern ABIs, the combination of tools that we have already available in the language can allow for pretty nice APIs.

basic features like an equivalent to shared seem like a simpler and more important first step.

I think that __shared__ can be implemented as a core::intrinsic and is thus not a big deal: fn shared() -> *mut u8 would do.

@hanna-kruppe
Copy link
Contributor

I meant that once we get single source support using extern ABIs,

What does this mean? I am not aware of any plans for any kind of single source support. And what does "using extern ABIs" mean? It seems to presuppose some strategy for single source support but it's not clear to me which one (and it doesn't sound like any of the strategies that I am aware of). Finally, assuming I'm correct that single source support is not on the horizon, I'm puzzled why we're hashing out details of how it could be exposed better to the user if the basic technical prerequsites aren't even on the horizon.

I think that shared can be implemented as a core::intrinsic and is thus not a big deal: fn shared() -> *mut u8 would do.

IIUC such an intrinsic would be basically like an alloca intrinsic, which Rust has rejected in favor of better support for unsized (DST) values. So while this is a possible strategy (although I can think of some technical challenges as well) that a prototype implementation might choose, it is far from clear to me that it's the approach we'd want to adopt.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2018

IIUC such an intrinsic would be basically like an alloca intrinsic

How so? The kernel does not allocate anything: __shared__ just initializes a pointer with a value.

Finally, assuming I'm correct that single source support is not on the horizon

I'd like to work on enabling single source support and I'd like to enable it in such a way that it is useful.

And what does "using extern ABIs" mean?

In a single source model:

extern "ptx-kernel" unsafe fn foo(...) { ... is compiled to a ptx kernel ... }
fn bar(...) { ... is compiled for the host ... }

@hanna-kruppe
Copy link
Contributor

How so? The kernel does not allocate anything: shared just initializes a pointer with a value.

It's a storage specifies in C parlance. You declare variables to live in shared memory as opposed to thread-private memory or global memory or constant memory. In pointer types it's just an optional hint that the pointee lives in shared memory, that aspect isn't even needed. What is absolutely necessary is to be able to do declare locals like __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; (from http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory) which is like let As: [[f32; BLOCK_SIZE]; BLOCK_SIZE]; except there's just one array per thread group, accessed by all threads in the group.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2018

It's a storage specifies in C parlance.

Sure, but what's the point of making it a storage specifier in Rust? You can't have two variables on shared memory, that is, the following is not valid CUDA C:

__global__ void foo(float* foo) {
  __shared__ a float[];
  __shared__ b float[];  // ERROR: you can only have one pointer to shared memory per kernel
  foo[0] = a[0] + b[t0]; 
}

What is absolutely necessary is to be able to do declare locals like shared float As[BLOCK_SIZE][BLOCK_SIZE]; (from http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory)

In particular, here you are not allocating a float [BLOCK_SIZE][BLOCK_SIZE]; region of shared memory, but assigning the pointer to shared memory to a float [BLOCK_SIZE][BLOCK_SIZE] array (C arrays are just pointers). This might lead to UB if no memory actually has been allocated, or if some other kernel reinterpreted it to have another type, or if less memory than the one required here was allocated, etc.

So IMO, independently of what the spec says, we should focus on the actual semantics of __shared__. The only thing __shared__ does is initializing some pointer to point to the shared memory region with the restriction that one cannot initialize two pointers to it. Nothing more, nothing less. It does not allocate anything, it does not guarantee that any memory exists, or was allocated, or that the data in the memory has some "type", ... no nothin'.

This

extern __shared__ a float[];

is just:

float* a = __get_ptr_to_shared_memory();

@hanna-kruppe
Copy link
Contributor

I want to point out that we're pretty badly derailing this metabug. You should probably open a thread on internals.rlo if you want to take this discussion much further.

But first I would invite you to double check your facts. Many things you've said go against everything I've ever heard and seen about CUDA (and other offloading solutions, for that matter).

Sure, but what's the point of making it a storage specifier in Rust?

I'm not saying it should be a storage specifier in Rust. We don't even have such a concept at the moment. I'm just saying, it's fundamentally a variation on variable declaration (like DST locals), not something about pointers.

You can't have two variables on shared memory, that is, the following is not valid CUDA C:

???? Check the code I linked earlier, it has at least two such variables. Or pretty a random non-trivial CUDA program using shared memory. I don't know why the code you give is rejected, but I've never even seen this syntax so I'm not even sure what it means.

For all you know, As[0] will fail because no shared memory actually has been allocated.

Are you saying Nvidia's examples, as well as all the other programs using shared memory, are effectively borked? That can't be right. In fact, I'm pretty sure the size of the is recorded (provided it has a size -- again I don't know wth __shared__ a float[]; is) in the binary and the scheduler makes sure to only place as many thread groups on one core as the available shared memory permits. I know this because it means the amount of shared memory you use impacts occupancy and thus performance.

That reminds me, another way in which such intrinsic for __shared__ would be weird is that it should probably require the size to be a compile time constant (unlike alloca).

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2018

but I've never even seen this syntax so I'm not even sure what it means.

That syntax is the dynamic shared memory allocation syntax.

That reminds me, another way in which such intrinsic for shared would be weird is that it should probably require the size to be a compile time constant (unlike alloca).

The size of the shared memory region can be specified at run-time, at least in CUDA.

Reading through the docs of the example you mention, if the size of the variables allocated in shared memory are compile-time constants, and no dynamic shared memory allocation occurs, it looks like one does not need to specify the memory to allocate during kernel launch because the compiler does it for you (but I've always used dynamic shared memory so I am not sure).

@hanna-kruppe
Copy link
Contributor

Good to know that there's a dynamic allocation strategy as well. But it seems that you still specify the size, just at kernel invocation time? That seems like it would still allow the driver to make sure enough memory is available (i.e., as much memory as the kernel invocation specified; of course the kernel still needs to obtain and use that number correctly). For dynamic an intrinsic might be good, but since static shared memory allocation seems extremely common, we'd probably want to support it as well and an intrinsic can't really do that (well).

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 21, 2018

But it seems that you still specify the size, just at kernel invocation time?

Yes.

That seems like it would still allow the driver to make sure enough memory is available (i.e., as much memory as the kernel invocation specified; of course the kernel still needs to obtain and use that number correctly).

Exactly. The typical way in which this is used is by passing something that correlates with the allocated size as a run-time argument to the kernel.

For dynamic an intrinsic might be good, but since static shared memory allocation seems extremely common, we'd probably want to support it as well and an intrinsic can't really do that (well).

Yes definitely. Since dynamic shared memory is more powerful (it allows doing everything that can be done with static shared memory and some more), has no drawbacks over static shared memory beyond ergonomics (shared memory is always allocated at run-time, whether the size is known are compile-time or not is pretty much irrelevant), and can probably just be an nvptx intrinsic that returns a pointer to the shared memory region, I think that implementing support for it would be the tiniest incremental step that delivers the most value.

Adding support for __shared__ static memory would be a nice ergonomic addition. If you have device-only functions that need to be called from kernels, using static shared memory in them allows you to bump the size of the allocated memory region transparently (using dynamic shared memory these functions would need to take as argument a pointer into a suitable part of the shared memory region).

AFAIK only fixed-size arrays are allowed in static shared memory and the memory must be uninitialized. So while something like let floats: #[shared] [f32; N] = mem::unitialized(); might work, I think here it would be better to just provide an nvptx::shared_array<[T; N]>type, implemented using compiler magic to put it always on static shared memory, that provides a minimal API that makes sense for shared memory array since things like bounds checking by default (as provided by arrays) make little sense in device kernels where you might not have a way to panic, abort, print anything, etc.

@lilith
Copy link

lilith commented Jul 28, 2018

Note that I'm interested in funding work on this: https://internals.rust-lang.org/t/nvptx-funding/7441

I'd like to get this to work out of the box on nightly.

@Jasper-Bekkers
Copy link

Hi,

I've started using the NVPTX backend for some simple experiments, I'm listing my experiences here so far since I don't know what the proper protocol is. We can turn these into specific issues on the right repo's later on.

  • Name mangling is sometimes incorrect because the LLVM backend outputs names with . in some cases and PTX only accepts [a-zA-Z0-9_$]+
pub struct MyStruct {
    data: u32,
}

impl PartialEq for MyStruct {
    fn eq(&self, other: &Self) -> bool {
        return self.data == other.data;
    }
}

Leads to invalid PTX since symbols are being generated with dots in them:


.visible .func  (.param .b32 func_retval0) _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE(
	.param .b64 _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_0,
	.param .b64 _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_1
)
{
	.reg .pred 	%p<2>;
	.reg .b32 	%r<4>;
	.reg .b64 	%rd<3>;

	ld.param.u64 	%rd1, [_ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_0];
	ld.u32 	%r1, [%rd1];
	ld.param.u64 	%rd2, [_ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_1];
	ld.u32 	%r2, [%rd2];
	setp.eq.s32 	%p1, %r1, %r2;
	selp.u32 	%r3, 1, 0, %p1;
	st.param.b32 	[func_retval0+0], %r3;
	ret;
}

I haven't looked much into it but I have a feeling that it's due to https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_utils/symbol_names.rs#L424

  • So far I've only been able to get this to run and compile with the nightly-2018-02-12 compiler (due to this PR Corrected auto trait syntax as previous became obsolete japaric-archived/core64#4 ) anything newer seems to spew quite a few compile errors after trying to manually create my own core64 lib.
  • It would be nice to see if we can add a "kernel" attribute to expose some (all) of these https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#performance-tuning-directives to tweak occupancy & register counts similar to CUDA
  • Find a way to expose vector loads & stores (ld.v4.f32 etc) since they're significantly faster. One way would be through the platform intrinsics
  • Expose __shared__ memory (as per discussion previous) though preferably as static kernel side memory - though ideally both options are available since this too factors into the kernel occupancy
  • Expose __constant__ memory similar to __shared__.
  • Expose cross-lane swizzle ops (shfl.bfly, shfl.up etc) as platform intrinsics
  • Right now the target json determines the sm target machine and fixes it at sm_20 which is extremely dated (Fermi gpus). It would be nice if this was more flexible - though for now patching the json works too.
  • I have some confusion whether repr(C) would be good enough of a layout to share data between GPU and CPU or if we'd eventually need another one.
  • It's not entirely clear to me if __syncthreads is also a compiler barrier or not but it ought to be since the compiler shouldn't schedule memory requests across it.

On the bright side: this has been a really pleasant GPU programming experience so far (other then actually getting it set up) because it's extremely valuable to share the same codebase between CPU and GPU.

@bheisler
Copy link

bheisler commented Aug 4, 2018

I have not been able to compile to PTX with cargo or xargo, I've only been able to do so using accel. Therefore, some of the following may be issues with Accel. That seems unlikely, so I'll report them here.

  • I see an internal compiler error if the PTX code references the core::intrinsics::sinf32 or cosf32 functions
  • INVALID_PTX error if I index into a slice (I saw the same error if I used get_unchecked) in the PTX code
  • INVALID_PTX error if I use the for x in 0..y loop in the PTX code
  • INVALID_PTX error if any struct in the PTX code implements core::ops traits like Add or Mul
  • INVALID_PTX error if any struct in the PTX code derives Debug or Display

Most of these are probably due to references to missing functions in the final PTX, but they'll need to be dealt with somehow.

I am interested in contributing to improve the state of GPGPU in Rust. Not sure where to start.

@termoshtt
Copy link

Name mangling is sometimes incorrect
Leads to invalid PTX since symbols are being generated with dots in them:

I've met this issue while developing accel , and it prevent me to use libcore for nvptx target. accel cannot link libcore or other std libraries currently.

I recently start to write a patch to rustc to enable nvptx target.

I haven't looked much into it but I have a feeling that it's due to https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_utils/symbol_names.rs#L424

this seems to be a good information for me :)

@bheisler
Copy link

#53099 is relevant here. I haven't been able to compile any kernel recently, because of a segfault while compiling libcore. This happens on all machines where I've tried it (Windows and Linux). However, other people say that they've been able to compile NVPTX kernels without running into this issue. Has anyone else seen this?

@denzp
Copy link
Contributor

denzp commented Aug 19, 2018

@bheisler I believe it somehow related to definition json. It doesn't happend to me with json from ptx-linker, but I saw the problem with another one.

@denzp
Copy link
Contributor

denzp commented Aug 19, 2018

I'm finally proud to announce my progress on CUDA integration. I've made several tools to ease development and currently working on a tutorial and high-level crate (it will probably be a custom rustc driver because compiler plugins are de-facto deprecated).

First one is a ptx-linker that solves several important problems:

I started work on the linker about a year ago, and today achieved important milestone: it doesn't depend on any external tools and libs anymore. So end users don't need to care about matching Rust's and system's LLVM versions (which became a problem when Rust switched to LLVM 7.0).

The second crate is a ptx-builder that improves development convenience dramatically. It's a build.rs helper that ensures all needed tools (xargo and ptx-linker atm) are present. It also manages a build environment and xargo runs.

Also worth checking, an incomplete tutorial about CUDA development flow with more or less real example.
The tutorial evolved as mentioned before tools did. And more chapters are yet to come :)
Sometimes I run tests from there, to ensure PTX assembly is correct and the whole thing still works :)

@termoshtt
Copy link

@denzp Can ptx-linker link with libcore? I am creating a toolchain to link libcore using llvm-link in rust-accel/nvptx. Linking of libcore will cause the symbol name issue as reported by @Jasper-Bekkers due to the difference between GAS and PTX, and I avoid it by rewriting librustc_codegen_utils/symbol_names.rs.

@denzp
Copy link
Contributor

denzp commented Aug 20, 2018

@termoshtt The linker suppose to fix this, it has a special "pass" that does renaming. The problem can happen not only with libcore though: consts or structs can also produce invalid PTX sometimes:

src_image.pixel(i, j)
call.uni (retval0), 
_ZN32_$LT$example..Image$LT$T$GT$$GT$5pixel17h81db5ad692bcf640E, 
(param0, param1, param2);

I found the linker robust enough about solving the issue. But still, I'd probably prefer this to be fixed in rustc.

@bheisler
Copy link

@denzp - I don't think it's caused by the target JSON, unfortunately. When I add the obj-is-bitcode flag to my target file it compiles without segfaulting, but that means it no longer writes PTX files but instead writes LLVM bitcode.

@denzp
Copy link
Contributor

denzp commented Aug 21, 2018

@bheisler I can confirm that obj-is-bitcode indeed helps to avoid segfault (that's the reason I've never seen the problem before, I always use the flag). But it doesn't affect assembly file creation for me. Let's move further discussion into #53099

@juchiast
Copy link
Contributor

#38824 is closed about a month ago. I think we can remove it from the list?

bors added a commit that referenced this issue Feb 1, 2019
NVPTX target specification

This change adds a built-in `nvptx64-nvidia-cuda` GPGPU no-std target specification and a basic PTX assembly smoke tests.

The approach is taken here and the target spec is based on `ptx-linker`, a project started about 1.5 years ago. Key feature: bitcode object files being linked with LTO into the final module on the linker's side.

Prior to this change, the linker used a `ld` linker-flavor, but I think, having the special CLI convention is a more reliable way.

Questions about further progress on reliable CUDA workflow with Rust:
1. Is it possible to create a test suite `codegen-asm` to verify end-to-end integration with LLVM backend?
1. How would it be better to organise no-std `compile-fail` tests: add `#![no_std]` where possible and mark others as `ignore-nvptx` directive, or alternatively, introduce `compile-fail-no-std` test suite?
1. Can we have the `ptx-linker` eventually be integrated as `rls` or `clippy`? Hopefully, this should allow to statically link against LLVM used in Rust and get rid of the [current hacky solution](https://github.com/denzp/rustc-llvm-proxy).
1. Am I missing some methods from `rustc_codegen_ssa::back::linker::Linker` that can be useful for bitcode-only linking?

Currently, there are no major public CUDA projects written in Rust I'm aware of, but I'm expecting to have a built-in target will create a solid foundation for further experiments and awesome crates.

Related to #38789
Fixes #38787
Fixes #38786
@steveklabnik
Copy link
Member

Triage: this is a metabug. Not aware of anything particular going on with this target lately.

@pnkfelix
Copy link
Member

Visited for T-compiler backlog bonanza. It seems like there are some unresolved questions about scope and design, with respect to the concerns that were raised on this thread between @hanna-kruppe and @gnzlbg . (Basically, its not clear to me whether the work remaining here is "just" more implementation and fixing bugs, or if there's some design stuff that needs to be revisited.)

@rustbot label: +S-tracking-needs-summary

@rustbot rustbot added the S-tracking-needs-summary Status: It's hard to tell what's been done and what hasn't! Someone should do some investigation. label Mar 18, 2022
@OwenTrokeBillard
Copy link

@kjetilkjeka Thank you for all your excellent recent work on Rust PTX support.

What, in your opinion, is required to bring Rust CUDA support to parity with C++? Is it even possible?

I'm curious to hear from others too.

@kjetilkjeka
Copy link
Contributor

I think the most prominent issue with using CUDA from Rust is referenced by this issue. It's the lacking support for shared memory.

In the bright future, I think that Rust with CUDA could in some ways even surpass using CUDA from C++. To avoid adding a lot of discussion around what CUDA using Rust should and should not be in this issue, I think we should continue discussing it in the newly created zulip thread for gpgpu.

https://rust-lang.zulipchat.com/#narrow/stream/422870-t-compiler.2Fgpgpu-backend

If you start a thread there I'm happy to discuss it further 😄

@fmease fmease added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label May 12, 2024
@workingjubilee
Copy link
Member

workingjubilee commented Jan 14, 2025

To respond somewhat to the way this initially started: while I'm not sure the ABI string is the most ideal mechanism, I definitely don't believe that target_feature is an appropriate mechanism to distinguish between "device" and "host" functions.

The biggest giveaway is that "device function" and "host function" are mutually exclusive. In general, target_feature should be additive.

@jedbrown
Copy link
Contributor

The way I've managed this is to put device functions and entry points ("ptx-kernel", which is like __global__ in CUDA C++) into no_std crates that are built as artifact dependencies by a build.rs (with --target=nvptx64-nvidia-cuda --crate-type=cdylib -Clinker-flavor=llbc, which produces PTX in the target directory). Sometimes one wants functions available on both host and device, in which case I put them in a no_std crate that is included by the host code as a regular dependency (in addition to the artifact dependency for the device). This does require organizing that host-and-device code into crates, but I think this is usually good organization anyway.

This works relatively well, though RFC 3176: cargo-multi-dep-artifacts would be a significantly better user experience with respect to error messages (we wouldn't need the build.rs). I see single-source multi-target as a convenience for projects with simple kernels, but not on the critical path for usability of the NVPTX backend.

@workingjubilee
Copy link
Member

workingjubilee commented Jan 14, 2025

Agreed that single-source conveniences are cool and not critical. I mostly was remarking to make it clear that we shouldn't be using target_feature for this.

I believe that all remaining details should be split into new sub-issues from here. If we are happy to bury the single-source concern for now, then it seems the biggest remaining subissues are

@jedbrown
Copy link
Contributor

Shared memory can be exposed using inline assembly (reference). Would something like this be welcome in core::arch::nvptx? Should I make a new tracking issue to discuss it?

core::arch::global_asm!(".extern .shared .align 16 .b8 _shared_data[];");

#[inline(always)]
pub fn _dynamic_smem() -> (*mut u8, u32) {
    // Dynamic shared memory.
    let size: u32;
    let saddr: u64;
    let ptr: *mut u8;
    unsafe {
        asm!("mov.u32 {}, %dynamic_smem_size;", out(reg32) size);
        asm!("mov.u64 {}, _shared_data;", out(reg64) saddr);
        asm!("cvta.shared.u64 {ptr}, {saddr};", ptr = out(reg64) ptr, saddr = in(reg64) saddr);
    }
    (ptr, size)
}

It has also been suggested that this be implemented as an LLVM intrinsic, something like llvm.nvvm.read.ptx.sreg.dynamic_smem_size. That could in principle allow accessing the shared memory base pointer without introducing a new symbol (_shared_data above).

@workingjubilee
Copy link
Member

Yes, a new issue to discuss it would be nice.

@jedbrown
Copy link
Contributor

#135516

@workingjubilee
Copy link
Member

Then, closing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC metabug Issues about issues themselves ("bugs about bugs") O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html S-tracking-needs-summary Status: It's hard to tell what's been done and what hasn't! Someone should do some investigation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests