Custom CUDA kernel to accelerate masked 3D convolutions used as raw kernel in cupy. Currently used to learn CUDA, hence the code is not production-ready and might not even be correct yet. Please use at your own risk.
Masked convolutions are convolutions where the input array contain "invalid" elements which should not participate in the end result. For example, consider an array with 3 elements and a normalized kernel with 3 elements. Lets assume that the left element in the input array is invalid, i.e. the validity-mask is [0,1,1].
arrary = [1,2,1]
kernel = [0.25,0.5,0.25]
mask = [0,1,1]
array * kernel =
[
First element: invalid (since mask is false)
Second element: 0.25 * invalid + 0.5 * 2 + 0.25 * 1 = 1.25
Third element: 0.25 * 2 + 0.5 * 1 + 0 (zero padding) * 0.25 = 1.0
]
=> [invalid, 1.25, 1.0]
Additionally, in some usecases (or maybe just in mine), the convolution with a normalized kernel should always result in a normalized output, i.e. the kernel weight which participated in the computation of an element should sum up to 1, even if some input elements are invalid. In that case, the masked convolution gets more complex:
First element: invalid (since mask is false)
Second element: 0.25 * invalid + 0.5 * 2 + 0.25 * 1
Only two kernel elements are "active", i.e. placed on valid array elements.
Thus, these kernel elements are scaled by their sum to again be 1. (0.5/0.75, 0.25/0.75)
Thus the calculation becomes: 0.66 * 2 + 0.33 * 1 = 1.66
Third element: 0.25 * 2 + 0.5 * 1 + 0 (zero padding) * 0.25
Similar calculations are performed for zero padding.
0.33 * 2 + 0.66 * 1 = 1.32
=> [invalid, 1.66, 1.32]
You can read more about that usecase in the following publication
The source code can be used as a raw kernel in cupy. Alternatively, the kernels can be compiled and then used as raw modules in cupy.
- It might be worth to try General Matrix Multiplication (GEMM) with masking for even more acceleration. Hower, I did not find any masked GEMM implementation in cuBlas.
- Thrust lib is awesome to use std-like features (vector, unique_ptr etc). However, it complicates the usage with Cupy, hence I did not use it here.
- I did not test even kernel length yet. Might get wrong results..
- Constant memory is not used here, since the kernel size is not known at compile time. If the kernel size is always below a certain length, we can allocate constant memory for the kernel and use it.