it seems like the sync-batchnorm (in syncbn_kernel.cu) can't match the pytorch1.0.0 ? #168

cqq0505 · 2019-01-22T10:25:01Z

When I try to train or test the model , it seems like code in " syncbn_kernel.cu" can't match the pytorch1.0.0 :(I've installed the cuda9.2, ninja1.8, pytorch1.0.0)
the errors of train mode look like that:
RuntimeError: cudaGetLastError() == cudaSuccess ASSERT FAILED at /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu:424, please report a bug to PyTorch. (Expectation_Forward_CUDA at /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu:424)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fa21f2a3cc5 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: Expectation_Forward_CUDA(at::Tensor) + 0x281 (0x7fa2133b2c86 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #2: + 0x8a6b5 (0x7fa21338b6b5 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #3: + 0x838f6 (0x7fa2133848f6 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #4: + 0x7c181 (0x7fa21337d181 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #5: + 0x7c2ed (0x7fa21337d2ed in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #6: + 0x69872 (0x7fa21336a872 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)

frame #15: THPFunction_apply(_object*, _object*) + 0x5dd (0x7fa24f73c40d in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

huanghoujing · 2019-01-23T14:26:02Z

@cqq0505 I find it necessary to (1) install pytorch 1.0 from source and (2) modify some include clauses of Pytorch-Encoding. My installation guide is this and then this.

d-li14 · 2019-01-29T16:14:14Z

Thanks for @huanghoujing 's suggestion, but I wonder what the real cause is. It will be hard to find the clues from a totally new installation process. Indeed, I have been successfully installed torch-encoding on two servers but met the aforementioned issue in a machine with Tesla M40 GPU, kept almost the same software configurations.

huanghoujing · 2019-01-30T05:16:32Z

@d-li14 Would this post solve your problem? It says that when installing pytorch from source, we have to install with compatibility for various CUDA architectures.

d-li14 · 2019-01-31T13:17:48Z

@huanghoujing Sorry for the late reply, I just tried to install pytorch through building from source code and kept all other steps unchanged. The errors no longer exist now.

zhanghang1989 · 2019-02-18T18:44:43Z

It works with PyTorch 1.0.0, but not 1.0.1

qiulesun · 2019-07-31T08:24:09Z

@zhanghang1989 Does this repo works with PyTorch 1.1 now?

yinjunbo · 2019-10-02T15:01:05Z

@zhanghang1989 @qiulesun Same question. Can this be used in PyTorch 1.1 now?

zhanghang1989 · 2019-10-02T16:13:33Z

I am not maintaining the code any more, because I have moved to MXNet development.
I believe both PyTorch and MXNet have built-in syncbatchnorm now.

zhanghang1989 added the compatibility label Feb 18, 2019

zhanghang1989 mentioned this issue Aug 2, 2020

ADD Docker #305

Merged

zhanghang1989 closed this as completed in #305 Aug 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

it seems like the sync-batchnorm (in syncbn_kernel.cu) can't match the pytorch1.0.0 ? #168

it seems like the sync-batchnorm (in syncbn_kernel.cu) can't match the pytorch1.0.0 ? #168

cqq0505 commented Jan 22, 2019

huanghoujing commented Jan 23, 2019

d-li14 commented Jan 29, 2019

huanghoujing commented Jan 30, 2019

d-li14 commented Jan 31, 2019

zhanghang1989 commented Feb 18, 2019

qiulesun commented Jul 31, 2019 •

edited

Loading

yinjunbo commented Oct 2, 2019

zhanghang1989 commented Oct 2, 2019

it seems like the sync-batchnorm (in syncbn_kernel.cu) can't match the pytorch1.0.0 ? #168

it seems like the sync-batchnorm (in syncbn_kernel.cu) can't match the pytorch1.0.0 ? #168

Comments

cqq0505 commented Jan 22, 2019

huanghoujing commented Jan 23, 2019

d-li14 commented Jan 29, 2019

huanghoujing commented Jan 30, 2019

d-li14 commented Jan 31, 2019

zhanghang1989 commented Feb 18, 2019

qiulesun commented Jul 31, 2019 • edited Loading

yinjunbo commented Oct 2, 2019

zhanghang1989 commented Oct 2, 2019

qiulesun commented Jul 31, 2019 •

edited

Loading