Re-use pre-converted kernel arguments when launching kernels. #2472
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses part of #2456, by avoiding a redundant
cudaconvert
call on kernel launches. The idea is that, instead of first callingcudaconvert
+typeof
to drive compilation, and then for the kernel's arguments when doing the actual launch (see below for why this split is necessary), we only callcudaconvert
at launch time when there's a mismatch between an argument's type and what the kernel was compiled for. This makes it possible for@cuda
to forward the already-converted arguments to the kernel, avoiding a secondary conversion at launch time.Note that the whole distinction between compile-time and launch-time
cudaconvert
exists to make it possible to do the following:i.e., to make
launch=false
user-friendly and let users pass non-converted arguments into a pre-compiled kernel, the API has to callcudaconvert
twice, once when compiling as part of@cuda launch=false
, and once when calling the kernel. The alternative would be to have users explicitly callcudaconvert(args)
when launching the kernel, but I didn't want to expose the implementation detail that arguments are converted.cc @vchuravy
@pxl-th @tgymnich This pattern probably applies to other GPU back-ends as well. CUDA.jl's
cudaconvert
is somewhat expensive because of our automatic memory tracking (https://juliagpu.org/post/2024-05-28-cuda_5.4/#tracked_memory_allocations,CUDA.jl/src/memory.jl
Lines 494 to 598 in 76e2972