How could I use cuda block shared memory in the self-defined operator? #59

CoinCheung · 2020-05-07T02:41:43Z

I am working on ubuntu16.04 with pytorch1.3 installed from conda.
My cuda version is 10.1.243 and cudnn version is 7.
I have 8 t4 gpus on my server and the gcc version is the default 5.4.
When submitting a bug report, please include the following information (where relevant):

The simplified version of my code main.cu is like this:

#include <torch/extension.h>
#include <ATen/ATen.h>
#include <ATen/cuda/CUDAContext.h>

#include <THC/THC.h>
#include <THC/THCAtomics.cuh>
#include <THC/THCDeviceUtils.cuh>

#include <cuda.h>
#include <cuda_runtime.h>
#include <cfloat>



// kernel function for forward and backward
template<typename scalar_t>
__global__ void TestForward() {
    int tid = threadIdx.x + (blockIdx.x * gridDim.y + blockIdx.y) * blockDim.x;
    int stride = blockDim.x * blockDim.y * gridDim.x;
    extern __shared__ scalar_t sdata[]; 
    sdata[threadIdx.x] = 0; // numer
    sdata[threadIdx.x + blockDim.x] = 0; // denor
    __syncthreads();
}


// cuda forward and backward
at::Tensor test_forward_cuda() {
    dim3 grid(4096);
    dim3 block(512);

    auto ret = at::empty(1024);
    AT_DISPATCH_FLOATING_TYPES_AND_HALF(ret.scalar_type(), "forward", [&] {
        TestForward<scalar_t><<<grid, block, 4096, at::cuda::getCurrentCUDAStream()>>>();
    });
    return ret;
}


// python inferface
at::Tensor test_forward() {
    return test_forward_cuda();
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("test_forward", &test_forward, "forward");
}

and the setup.py is like this:

from setuptools import setup, Extension
from torch.utils import cpp_extension
setup(
    name='pytorch',
    ext_modules=[
        cpp_extension.CUDAExtension(
            'main_cpp',
            ['main.cu']),
    ],
    cmdclass={'build_ext': cpp_extension.BuildExtension}
)

I compiled it with command python setup.py install. And the error message is like this:

main.cu(21): error: declaration is incompatible with previous "sdata"
(21): here
          detected during instantiation of "void TestForward<scalar_t>() [with scalar_t=float]" 
(34): here

main.cu(21): warning: a host variable("sdata") redeclared with __shared__
          detected during instantiation of "void TestForward<scalar_t>() [with scalar_t=float]" 
(34): here

main.cu(21): warning: a host variable("sdata") redeclared with __shared__
          detected during instantiation of "void TestForward<scalar_t>() [with scalar_t=c10::Half]" 
(34): here

1 error detected in the compilation of "/tmp/tmpxft_00022124_00000000-6_main.cpp1.ii".

What is the cause of this and how could I cope with this problem please ?

The text was updated successfully, but these errors were encountered:

ClementPinard · 2020-05-07T09:03:01Z

I'm pretty sure you need to specify a size for the shared array sdata
Here if I understood correctly it's 2* 512 ?

You should probably define these numbers at the beginning of your file so that it's et at compile time, but you still have some flexibility. See an example here : https://github.com/ClementPinard/Pytorch-Correlation-extension/blob/master/Correlation_Module/correlation_cuda_kernel.cu#L47

CoinCheung · 2020-05-07T10:09:31Z

@ClementPinard Thanks for replying !! As far as I know, cuda seems support dynamic allocated shared memory within a block, which is defined like extern __shared__ scalar_t sdata[]. Doesn't pytorch support that ? What if the size of shared memory is not known in the compiling time, do we have an option to assign the size dynamically?

ClementPinard · 2020-05-07T11:48:09Z

Ah actually you are right, you can use dynamic shared arrays.
Apparently, you need to specify the shared object size in another option in the kernel call.
See here : https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

CoinCheung · 2020-05-07T23:39:04Z

But I did have called the kernel function with shared memory sized assgined, I called it like this:

TestForward<scalar_t><<<grid, block, 4096, at::cuda::getCurrentCUDAStream()>>>();

I assigned 4k shared memory for each block in this way. Would you please tell me why does this not work?

ClementPinard · 2020-05-09T15:08:18Z

Ok sorry about misleading you, your code is mostly fine. I tried your code, and the problem seems to come from the template and the fact you use three different specializations of the template (float double and half), because doesn't allow two differently typed dynamic shared array with the same name ¯\_(ツ)_/¯

See here for more info : https://stackoverflow.com/questions/27570552/templated-cuda-kernel-with-dynamic-shared-memory

in the end you need to change the line

extern __shared__ scalar_t sdata[];

with the two lines

extern __shared__ __align__(sizeof(scalar_t)) unsigned char sdata_uchar[];
    scalar_t *sdata = reinterpret_cast<scalar_t *>(sdata_uchar);

CoinCheung · 2020-05-12T02:28:50Z

Thanks !!! It works now, but I have two more warnings:

/miniconda/envs/py36/lib/python3.6/site-packages/torch/include/torch/csrc/autograd/profiler.h(97): warning: attribute "__visibility__" does not apply here
warning: specified alignment (4) is different from alignment (8) specified on a previous declaration
          detected during instantiation of "void compute_numer_denor(int, const scalar_t *, const int64_t *, scalar_t *, scalar_t *, float, float) [with scalar_t=float]"

Will this be fine if my code go with these two warnings?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How could I use cuda block shared memory in the self-defined operator? #59

How could I use cuda block shared memory in the self-defined operator? #59

CoinCheung commented May 7, 2020

ClementPinard commented May 7, 2020

CoinCheung commented May 7, 2020

ClementPinard commented May 7, 2020

CoinCheung commented May 7, 2020

ClementPinard commented May 9, 2020 •

edited

Loading

CoinCheung commented May 12, 2020

How could I use cuda block shared memory in the self-defined operator? #59

How could I use cuda block shared memory in the self-defined operator? #59

Comments

CoinCheung commented May 7, 2020

ClementPinard commented May 7, 2020

CoinCheung commented May 7, 2020

ClementPinard commented May 7, 2020

CoinCheung commented May 7, 2020

ClementPinard commented May 9, 2020 • edited Loading

CoinCheung commented May 12, 2020

ClementPinard commented May 9, 2020 •

edited

Loading