Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it seems like the sync-batchnorm (in syncbn_kernel.cu) can't match the pytorch1.0.0 ? #168

Closed
cqq0505 opened this issue Jan 22, 2019 · 8 comments · Fixed by #305
Closed

Comments

@cqq0505
Copy link

cqq0505 commented Jan 22, 2019

When I try to train or test the model , it seems like code in " syncbn_kernel.cu" can't match the pytorch1.0.0 :(I've installed the cuda9.2, ninja1.8, pytorch1.0.0)
the errors of train mode look like that:
RuntimeError: cudaGetLastError() == cudaSuccess ASSERT FAILED at /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu:424, please report a bug to PyTorch. (Expectation_Forward_CUDA at /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu:424)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fa21f2a3cc5 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: Expectation_Forward_CUDA(at::Tensor) + 0x281 (0x7fa2133b2c86 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #2: + 0x8a6b5 (0x7fa21338b6b5 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #3: + 0x838f6 (0x7fa2133848f6 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #4: + 0x7c181 (0x7fa21337d181 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #5: + 0x7c2ed (0x7fa21337d2ed in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)
frame #6: + 0x69872 (0x7fa21336a872 in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/encoding/lib/gpu/enclib_gpu.so)

frame #15: THPFunction_apply(_object*, _object*) + 0x5dd (0x7fa24f73c40d in /home/qingqing/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

@huanghoujing
Copy link

@cqq0505 I find it necessary to (1) install pytorch 1.0 from source and (2) modify some include clauses of Pytorch-Encoding. My installation guide is this and then this.

@d-li14
Copy link

d-li14 commented Jan 29, 2019

Thanks for @huanghoujing 's suggestion, but I wonder what the real cause is. It will be hard to find the clues from a totally new installation process. Indeed, I have been successfully installed torch-encoding on two servers but met the aforementioned issue in a machine with Tesla M40 GPU, kept almost the same software configurations.

@huanghoujing
Copy link

@d-li14 Would this post solve your problem? It says that when installing pytorch from source, we have to install with compatibility for various CUDA architectures.

@d-li14
Copy link

d-li14 commented Jan 31, 2019

@huanghoujing Sorry for the late reply, I just tried to install pytorch through building from source code and kept all other steps unchanged. The errors no longer exist now.

@zhanghang1989
Copy link
Owner

It works with PyTorch 1.0.0, but not 1.0.1

@qiulesun
Copy link

qiulesun commented Jul 31, 2019

@zhanghang1989 Does this repo works with PyTorch 1.1 now?

@yinjunbo
Copy link

yinjunbo commented Oct 2, 2019

@zhanghang1989 @qiulesun Same question. Can this be used in PyTorch 1.1 now?

@zhanghang1989
Copy link
Owner

I am not maintaining the code any more, because I have moved to MXNet development.
I believe both PyTorch and MXNet have built-in syncbatchnorm now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants