RuntimeError: Error building extension 'fused_adam' #629

jeremytanjianle · 2021-01-04T12:44:32Z

Hi, I'm trying to run the basic cifar_deepspeed.py example, and have encountered the error below.
RuntimeError: Error building extension 'fused_adam'

I've narrowed the error step to simply initializing FusedAdam, but the traceback seems incomprehensible.

To Replicate
Running Pytorch 1.70, CUDA 10.1, python=3.6.9:

import torch.nn as nn
import torch.nn.functional as F
from deepspeed.ops.adam.fused_adam import FusedAdam

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
parameters = filter(lambda p: p.requires_grad, net.parameters())
FusedAdam(parameters)

Full Error Log

Using /home/vinitrinh/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/vinitrinh/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1538                 check=True,
-> 1539                 env=env)
   1540         else:

~/anaconda3/envs/ds3/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    437             raise CalledProcessError(retcode, process.args,
--> 438                                      output=stdout, stderr=stderr)
    439     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-15-74928986df69> in <module>
      1 from deepspeed.ops.adam.fused_adam import FusedAdam
      2 parameters = filter(lambda p: p.requires_grad, net.parameters())
----> 3 FusedAdam(parameters)

~/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py in __init__(self, params, lr, bias_correction, betas, eps, adam_w_mode, weight_decay, amsgrad, set_grad_none)
     70         self.set_grad_none = set_grad_none
     71 
---> 72         fused_adam_cuda = FusedAdamBuilder().load()
     73         # Skip buffer
     74         self._dummy_overflow_buf = torch.cuda.IntTensor([0])

~/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    174             return importlib.import_module(self.absolute_name())
    175         else:
--> 176             return self.jit_load(verbose)
    177 
    178     def jit_load(self, verbose=True):

~/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    210             extra_cflags=self.cxx_args(),
    211             extra_cuda_cflags=self.nvcc_args(),
--> 212             verbose=verbose)
    213         build_duration = time.time() - start_build
    214         if verbose:

~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, keep_intermediates)
    995         with_cuda,
    996         is_python_module,
--> 997         keep_intermediates=keep_intermediates)
    998 
    999 

~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, keep_intermediates)
   1200                         build_directory=build_directory,
   1201                         verbose=verbose,
-> 1202                         with_cuda=with_cuda)
   1203             finally:
   1204                 baton.release()

~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda)
   1298         build_directory,
   1299         verbose,
-> 1300         error_prefix="Error building extension '{}'".format(name))
   1301 
   1302 

~/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1553         if hasattr(error, 'output') and error.output:  # type: ignore
   1554             message += ": {}".format(error.output.decode())  # type: ignore
-> 1555         raise RuntimeError(message) from e
   1556 
   1557 

RuntimeError: Error building extension 'fused_adam'

The text was updated successfully, but these errors were encountered:

jeremytanjianle · 2021-01-04T13:27:24Z

Looking deeper into it, it seems like a ninja issue. As such, will be following #298.

Have tried to run the ninja build directly and still encountered an error:

>>> import subprocess
>>> subprocess.run(['ninja','-v'], cwd="/home/vinitrinh/.cache/torch_extensions/fused_adam")
[1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/TH -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/vinitrinh/anaconda3/envs/ds3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61 -std=c++14 -c /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
FAILED: multi_tensor_adam.cuda.o 
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/TH -isystem /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/vinitrinh/anaconda3/envs/ds3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61 -std=c++14 -c /home/vinitrinh/anaconda3/envs/ds3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6688:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’ without object
       __p->_M_set_sharable();
       ~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’ without object
ninja: build stopped: subcommand failed.
CompletedProcess(args=['ninja', '-v'], returncode=1)

jeremytanjianle · 2021-01-04T14:04:55Z

Possibly a pytorch problem

Original poster claimed solution was to upgrade CUDA version from 10.1.105 to 10.1.243.

jeremytanjianle closed this as completed Jan 4, 2021

TevenLeScao mentioned this issue Jan 31, 2021

Cifar-10 example - RuntimeError: Error building extension 'fused_adam' #694

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Error building extension 'fused_adam' #629

RuntimeError: Error building extension 'fused_adam' #629

jeremytanjianle commented Jan 4, 2021

jeremytanjianle commented Jan 4, 2021

jeremytanjianle commented Jan 4, 2021

RuntimeError: Error building extension 'fused_adam' #629

RuntimeError: Error building extension 'fused_adam' #629

Comments

jeremytanjianle commented Jan 4, 2021

jeremytanjianle commented Jan 4, 2021

jeremytanjianle commented Jan 4, 2021