Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Error building extension 'fused_adam' #629

Closed
jeremytanjianle opened this issue Jan 4, 2021 · 2 comments
Closed

RuntimeError: Error building extension 'fused_adam' #629

jeremytanjianle opened this issue Jan 4, 2021 · 2 comments

Comments

@jeremytanjianle
Copy link

Hi, I'm trying to run the basic cifar_deepspeed.py example, and have encountered the error below.
RuntimeError: Error building extension 'fused_adam'

I've narrowed the error step to simply initializing FusedAdam, but the traceback seems incomprehensible.

To Replicate
Running Pytorch 1.70, CUDA 10.1, python=3.6.9:

import torch.nn as nn
import torch.nn.functional as F
from deepspeed.ops.adam.fused_adam import FusedAdam

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
parameters = filter(lambda p: p.requires_grad, net.parameters())
FusedAdam(parameters)

Full Error Log

Using /home/vinitrinh/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/vinitrinh/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1538                 check=True,
-> 1539                 env=env)
   1540         else:

~/anaconda3/envs/ds3/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    437             raise CalledProcessError(retcode, process.args,
--> 438                                      output=stdout, stderr=stderr)
    439     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-15-74928986df69> in <module>
      1 from deepspeed.ops.adam.fused_adam import FusedAdam
      2 parameters = filter(lambda p: p.requires_grad, net.parameters())
----> 3 FusedAdam(parameters)

~/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py in __init__(self, params, lr, bias_correction, betas, eps, adam_w_mode, weight_decay, amsgrad, set_grad_none)
     70         self.set_grad_none = set_grad_none
     71 
---> 72         fused_adam_cuda = FusedAdamBuilder().load()
     73         # Skip buffer
     74         self._dummy_overflow_buf = torch.cuda.IntTensor([0])

~/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    174             return importlib.import_module(self.absolute_name())
    175         else:
--> 176             return self.jit_load(verbose)
    177 
    178     def jit_load(self, verbose=True):

~/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    210             extra_cflags=self.cxx_args(),
    211             extra_cuda_cflags=self.nvcc_args(),
--> 212             verbose=verbose)
    213         build_duration = time.time() - start_build
    214         if verbose:

~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, keep_intermediates)
    995         with_cuda,
    996         is_python_module,
--> 997         keep_intermediates=keep_intermediates)
    998 
    999 

~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, keep_intermediates)
   1200                         build_directory=build_directory,
   1201                         verbose=verbose,
-> 1202                         with_cuda=with_cuda)
   1203             finally:
   1204                 baton.release()

~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda)
   1298         build_directory,
   1299         verbose,
-> 1300         error_prefix="Error building extension '{}'".format(name))
   1301 
   1302 

~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1553         if hasattr(error, 'output') and error.output:  # type: ignore
   1554             message += ": {}".format(error.output.decode())  # type: ignore
-> 1555         raise RuntimeError(message) from e
   1556 
   1557 

RuntimeError: Error building extension 'fused_adam'

@jeremytanjianle
Copy link
Author

Looking deeper into it, it seems like a ninja issue. As such, will be following #298.

Have tried to run the ninja build directly and still encountered an error:

>>> import subprocess
>>> subprocess.run(['ninja','-v'], cwd="/home/vinitrinh/.cache/torch_extensions/fused_adam")
[1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/TH -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/vinitrinh/anaconda3/envs/ds3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61 -std=c++14 -c /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
FAILED: multi_tensor_adam.cuda.o 
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/TH -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/vinitrinh/anaconda3/envs/ds3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61 -std=c++14 -c /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6688:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’ without object
       __p->_M_set_sharable();
       ~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’ without object
ninja: build stopped: subcommand failed.
CompletedProcess(args=['ninja', '-v'], returncode=1)

@jeremytanjianle
Copy link
Author

Possibly a pytorch problem

Original poster claimed solution was to upgrade CUDA version from 10.1.105 to 10.1.243.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant