CUDA - masked convolutions

Custom CUDA kernel to accelerate masked 3D convolutions used as raw kernel in cupy. Currently used to learn CUDA, hence the code is not production-ready and might not even be correct yet. Please use at your own risk.

Motivation

Masked convolutions are convolutions where the input array contain "invalid" elements which should not participate in the end result. For example, consider an array with 3 elements and a normalized kernel with 3 elements. Lets assume that the left element in the input array is invalid, i.e. the validity-mask is [0,1,1].

arrary = [1,2,1]
kernel = [0.25,0.5,0.25]
mask = [0,1,1]

Normal masked convolution (zero padding):

array * kernel =
[
First element: invalid (since mask is false)
Second element: 0.25 * invalid + 0.5 * 2 + 0.25 * 1 = 1.25
Third element:  0.25 * 2 + 0.5 * 1 + 0 (zero padding) * 0.25  = 1.0
]
=> [invalid, 1.25, 1.0]

Additionally, in some usecases (or maybe just in mine), the convolution with a normalized kernel should always result in a normalized output, i.e. the kernel weight which participated in the computation of an element should sum up to 1, even if some input elements are invalid. In that case, the masked convolution gets more complex:

Normalized masked convolution (zero padding):

First element: invalid (since mask is false)

Second element: 0.25 * invalid + 0.5 * 2 + 0.25 * 1
    Only two kernel elements are "active", i.e. placed on valid array elements.
    Thus, these kernel elements are scaled by their sum to again be 1. (0.5/0.75, 0.25/0.75)
    Thus the calculation becomes: 0.66 * 2 + 0.33 * 1 = 1.66

Third element:  0.25 * 2 + 0.5 * 1 + 0 (zero padding) * 0.25
    Similar calculations are performed for zero padding.
    0.33 * 2 + 0.66 * 1 = 1.32

=> [invalid, 1.66, 1.32]

You can read more about that usecase in the following publication

Usage

The source code can be used as a raw kernel in cupy. Alternatively, the kernels can be compiled and then used as raw modules in cupy.

Notes:

It might be worth to try General Matrix Multiplication (GEMM) with masking for even more acceleration. Hower, I did not find any masked GEMM implementation in cuBlas.
Thrust lib is awesome to use std-like features (vector, unique_ptr etc). However, it complicates the usage with Cupy, hence I did not use it here.
I did not test even kernel length yet. Might get wrong results..
Constant memory is not used here, since the kernel size is not known at compile time. If the kernel size is always below a certain length, we can allocate constant memory for the kernel and use it.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
apps		apps
python		python
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA - masked convolutions

Motivation

Normal masked convolution (zero padding):

Normalized masked convolution (zero padding):

Usage

Notes:

About

Releases

Packages

Languages

License

FloWsnr/CUDA-maskedConvolutions

Folders and files

Latest commit

History

Repository files navigation

CUDA - masked convolutions

Motivation

Normal masked convolution (zero padding):

Normalized masked convolution (zero padding):

Usage

Notes:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages