[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o` #1435

franckjay · 2021-10-06T18:45:57Z

Describe the bug
I am trying to reproduce the HuggingFace + DeepSpeed https://huggingface.co/transformers/main_classes/deepspeed.html training example on a SageMaker p3.16xlarge instance (8 Tesla V100s) . However, we cannot seem to fix a FAILED: fused_adam_frontend.o and FAILED: multi_tensor_adam.cuda.o issues. It may also be related to our gcc version (?):

Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 5.0 and above.

We have tried to install deepspeed from:

pip install deepspeed
DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8"
From source following this guide: https://www.deepspeed.ai/tutorials/advanced-install/#install-deepspeed-from-source

Unfortunately for security reasons, we do not have access to the root of this instance, so we cannot directly upgrade the CUDA/gcc version, which seemed to work for these related issues: #694 and microsoft/DeepSpeedExamples#85, among others.

To Reproduce
Steps to reproduce the behavior:

Spin up Amazon EC2 p3.16xlarge instance
Install and setup tutorial example: https://huggingface.co/transformers/main_classes/deepspeed.html
run translation.py

Expected behavior
A very fast training time. Using python -m torch.distributed.launch without DeepSpeed runs as expected in our environment.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variablesto where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch']
torch version .................... 1.9.1+cu102
torch cuda version ............... 10.2
nvcc version ..................... 10.0
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.5.4+c6d1418, c6d1418, master
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2

System info (please complete the following information):

8x V100s
Python 3.6.13
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" -> (7, 0)
gcc (GCC) 4.8.5

Launcher context

deepspeed run_translation.py \
    --deepspeed ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir  \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
    --source_lang en --target_lang ro

The text was updated successfully, but these errors were encountered:

franckjay · 2021-10-06T18:48:08Z

Full stack-trace:

Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
FAILED: fused_adam_frontend.o
c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
c++: error: unrecognized command line option ‘-std=c++14’
c++: error: unrecognized command line option ‘-std=c++14’
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/THC -isystem/usr/local/cuda/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++14 -c /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ec2-user/anaconda3/envs/JupyterSystemEnv/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++14 -c /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored.
In file included from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:6:0:
/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++14 or later compatible compiler is required to use ATen.
 #error C++14 or later compatible compiler is required to use ATen.
  ^
In file included from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/core/DeviceType.h:8:0,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/core/Device.h:3,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/core/Allocator.h:6,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/ATen.h:7,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:6:
/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/macros/Macros.h:213:22: error: missing binary operator before token "("
 #elif __has_attribute(always_inline) || defined(__GNUC__)
                      ^
In file included from /usr/include/c++/4.8.5/unordered_map:35:0,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/util/ThreadLocalDebugInfo.h:8,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/core/Allocator.h:8,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/ATen.h:7,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:6:
/usr/include/c++/4.8.5/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support for the \
  ^
In file included from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/util/ArrayRef.h:18:0,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/core/DispatchKey.h:4,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/core/Backend.h:4,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/core/Layout.h:3,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/core/TensorBody.h:4,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/Tensor.h:3,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/Context.h:4,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/ATen.h:9,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:6:
/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/util/C++17.h:16:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 5 or later."
 #error \
  ^
/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/util/C++17.h:27:2: error: #error You need C++14 to compile PyTorch
 #error You need C++14 to compile PyTorch
  ^
In file included from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/util/typeid.h:26:0,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/core/ScalarTypeToTypeMeta.h:4,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/core/TensorBody.h:10,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/Tensor.h:3,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/Context.h:4,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/ATen/ATen.h:9,
                 from /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:6:
/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/include/c10/util/TypeIndex.h:76:2: error:#error "You're running a too old version of GCC. We need GCC 5 or later."
 #error "You're running a too old version of GCC. We need GCC 5 or later."
  ^
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1672, in _run_ninja_build
    env=env)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_translation2.py", line 744, in <module>
    main()
  File "run_translation2.py", line 643, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/trainer.py", line 1158, intrain
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/deepspeed.py", line 367, in deepspeed_init
    lr_scheduler=lr_scheduler,
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 220,in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 860,in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 942,in _configure_basic_optimizer
    adam_w_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 394, in jit_load
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1092, in load
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1303, in _jit_compile
    is_standalone=is_standalone)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1408, in _write_ninja_file_and_build_library
    error_prefix=f"Error building extension '{name}'")
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1682, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fused_adam...
Traceback (most recent call last):
  File "run_translation2.py", line 744, in <module>
    main()
  File "run_translation2.py", line 643, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/trainer.py", line 1158, intrain
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/deepspeed.py", line 367, in deepspeed_init
    lr_scheduler=lr_scheduler,
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 220,in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 860,in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 942,in _configure_basic_optimizer
    adam_w_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 394, in jit_load
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1092, in load
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/ec2-user/.cache/torch_extensions/fused_adam/fused_adam.so: cannot open shared object file: No such fileor directory
Loading extension module fused_adam...
Traceback (most recent call last):
  File "run_translation2.py", line 744, in <module>
    main()
  File "run_translation2.py", line 643, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/trainer.py", line 1158, intrain
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/deepspeed.py", line 367, in deepspeed_init
    lr_scheduler=lr_scheduler,
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 220,in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 860,in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 942,in _configure_basic_optimizer
    adam_w_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 394, in jit_load
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1092, in load
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/ec2-user/.cache/torch_extensions/fused_adam/fused_adam.so: cannot open shared object file: No such fileor directory
Loading extension module fused_adam...
Traceback (most recent call last):
  File "run_translation2.py", line 744, in <module>
    main()
  File "run_translation2.py", line 643, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/trainer.py", line 1158, intrain
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/deepspeed.py", line 367, in deepspeed_init
    lr_scheduler=lr_scheduler,
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 220,in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 860,in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 942,in _configure_basic_optimizer
    adam_w_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 394, in jit_load
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1092, in load
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
Loading extension module fused_adam...
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/ec2-user/.cache/torch_extensions/fused_adam/fused_adam.so: cannot open shared object file: No such fileor directory
Traceback (most recent call last):
  File "run_translation2.py", line 744, in <module>
Loading extension module fused_adam...
    main()
  File "run_translation2.py", line 643, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/trainer.py", line 1158, intrain
Traceback (most recent call last):
  File "run_translation2.py", line 744, in <module>
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/deepspeed.py", line 367, in deepspeed_init
    main()
  File "run_translation2.py", line 643, in main
    lr_scheduler=lr_scheduler,
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 220,in __init__
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/trainer.py", line 1158, intrain
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 860,in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 942,in _configure_basic_optimizer
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/deepspeed.py", line 367, in deepspeed_init
    lr_scheduler=lr_scheduler,
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/__init__.py", line 141, in initialize
    adam_w_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line72, in __init__
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 220,in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 355, in load
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 860,in _configure_optimizer
    return self.jit_load(verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 394, in jit_load
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1092, in load
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 942,in _configure_basic_optimizer
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile
    adam_w_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 394, in jit_load
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1092, in load
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/ec2-user/.cache/torch_extensions/fused_adam/fused_adam.so: cannot open shared object file: No such fileor directory
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/ec2-user/.cache/torch_extensions/fused_adam/fused_adam.so: cannot open shared object file: No such fileor directory
Loading extension module fused_adam...
Traceback (most recent call last):
  File "run_translation2.py", line 744, in <module>
    main()
  File "run_translation2.py", line 643, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/trainer.py", line 1158, intrain
Loading extension module fused_adam...
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/deepspeed.py", line 367, in deepspeed_init
    lr_scheduler=lr_scheduler,
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/__init__.py", line 141, in initialize
Traceback (most recent call last):
  File "run_translation2.py", line 744, in <module>
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 220,in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 860,in _configure_optimizer
    main()
  File "run_translation2.py", line 643, in main
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 942,in _configure_basic_optimizer
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/trainer.py", line 1158, intrain
    adam_w_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 394, in jit_load
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/deepspeed.py", line 367, in deepspeed_init
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1092, in load
    lr_scheduler=lr_scheduler,
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 220,in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 860,in _configure_optimizer
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 942,in _configure_basic_optimizer
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library
    adam_w_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
    module = importlib.util.module_from_spec(spec)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 394, in jit_load
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1092, in load
ImportError: /home/ec2-user/.cache/torch_extensions/fused_adam/fused_adam.so: cannot open shared object file: No such fileor directory
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/ec2-user/.cache/torch_extensions/fused_adam/fused_adam.so: cannot open shared object file: No such fileor directory
Killing subprocess 58707
Killing subprocess 58708
Killing subprocess 58709
Killing subprocess 58710
Killing subprocess 58711
Killing subprocess 58712
Killing subprocess 58713
Killing subprocess 58714
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin/python3.6', '-u', 'run_translation2.py', '--local_rank=7', '--deepspeed', 'ds_config.json', '--model_name_or_path', 't5-small', '--per_device_train_batch_size', '1', '--output_dir', 'output_dir', '--overwrite_output_dir', '--do_train', '--max_train_samples', '500', '--num_train_epochs', '1', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro']' returned non-zero exit status 1.

philschmid · 2021-10-07T07:01:01Z

Hey @franckjay,

Could you please tell more about your environment? In the title you are saying you are using SageMaker, but from going through your steps on how to reproduce it seems you are starting an EC2 instance, which is not related to SageMaker and running training on SageMaker.

Can you please share, which AMI you have used?

franckjay · 2021-10-07T15:20:40Z

@philschmid , I apologize for the confusion. Yes, we are running this in Sagemaker.

philschmid · 2021-10-07T15:40:56Z

Could you share your estimator code and how start the Training Job as well as which images are you using for this?

franckjay · 2021-10-07T16:44:39Z

By estimator, are you referring to this : https://huggingface.co/blog/sagemaker-distributed-training-seq2seq ? If so, we are not using an estimator.
We start the training job via: deepspeed run_translation.py \ --deepspeed ds_config.json \ --model_name_or_path t5-small --per_device_train_batch_size 1 \ --output_dir output_dir --overwrite_output_dir \ --do_train --max_train_samples 500 --num_train_epochs 1 \ --dataset_name wmt16 --dataset_config "ro-en" \ --source_lang en --target_lang ro , and the code is https://github.com/huggingface/transformers/blob/master/examples/pytorch/translation/run_translation.py
We have tried a number of kernels, but usually we stick with : conda_pytorch_latest_p36

philschmid · 2021-10-07T16:47:09Z

Ah okay so you are not using SageMaker Training jobs you are using a SageMaker notebook instance then execute the DeepSpeed command in it? Or which service are you using?

franckjay · 2021-10-07T16:47:58Z

Exactly!

philschmid · 2021-10-07T16:49:48Z

When creating your Notebook instance did you use the AL1 or AL2 based on? it is possible that the notebook instance uses old/ not up-to-date dependencies, e.g. for gcc. So probably the best it to update those and make you sure have compatible ones installed.

tjruwase · 2021-10-07T17:02:01Z

@philschmid, thanks for helping with this issue.

@franckjay, similar to @philschmid's suspicions I also think you are using old compiler tools. I noticed the following in your log. Can you please try newer versions of the compiler tools?

franckjay · 2021-10-07T17:07:08Z

@philschmid , yes, I am on AL1 . Will AL2 solve this issue?

jeffra · 2021-10-07T17:25:23Z

@franckjay it appears AL2 has gcc 7.3 which should be new enough to compile our kernels.

https://aws.amazon.com/amazon-linux-2/faqs/

franckjay · 2021-10-07T18:13:43Z

@philschmid , @jeffra , and @tjruwase , thank you for the help! Spinning up an instance with AL2 worked perfectly. You are all wizards of the highest order.

ucas010 · 2023-04-19T09:20:47Z

have the same bug, so how to solve the problem ?

franckjay added the bug Something isn't working label Oct 6, 2021

franckjay closed this as completed Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o` #1435

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o` #1435

franckjay commented Oct 6, 2021

franckjay commented Oct 6, 2021

philschmid commented Oct 7, 2021

franckjay commented Oct 7, 2021

philschmid commented Oct 7, 2021

franckjay commented Oct 7, 2021 •

edited

Loading

philschmid commented Oct 7, 2021

franckjay commented Oct 7, 2021

philschmid commented Oct 7, 2021

tjruwase commented Oct 7, 2021

franckjay commented Oct 7, 2021

jeffra commented Oct 7, 2021

franckjay commented Oct 7, 2021

ucas010 commented Apr 19, 2023

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: FAILED: multi_tensor_adam.cuda.o #1435

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: FAILED: multi_tensor_adam.cuda.o #1435

Comments

franckjay commented Oct 6, 2021

franckjay commented Oct 6, 2021

philschmid commented Oct 7, 2021

franckjay commented Oct 7, 2021

philschmid commented Oct 7, 2021

franckjay commented Oct 7, 2021 • edited Loading

philschmid commented Oct 7, 2021

franckjay commented Oct 7, 2021

philschmid commented Oct 7, 2021

tjruwase commented Oct 7, 2021

franckjay commented Oct 7, 2021

jeffra commented Oct 7, 2021

franckjay commented Oct 7, 2021

ucas010 commented Apr 19, 2023

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o` #1435

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o` #1435

franckjay commented Oct 7, 2021 •

edited

Loading