Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 1: error: must run as root and 2: raise RuntimeError("Ninja is required to load C++ extensions") #5627

Closed
YangBrooksHan opened this issue Jun 7, 2024 · 1 comment
Assignees
Labels
bug Something isn't working training

Comments

@YangBrooksHan
Copy link

Describe the bug
Encountered the following errors while training a large language model with DeepSpeed on multiple nodes

172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: die: error: must run as root[rank5]: Traceback (most recent call last):
172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 88, in
172.27.221.56: [rank5]: main()
172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 64, in main
172.27.221.56: [rank5]: optimizer = FusedAdam(
172.27.221.56: [rank5]: ^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
172.27.221.56: [rank5]: fused_adam_cuda = FusedAdamBuilder().load()
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in load
172.27.221.56: [rank5]: return self.jit_load(verbose)
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 524, in jit_load
172.27.221.56: [rank5]: op_module = load(name=self.name,
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1309, in load
172.27.221.56: [rank5]: return _jit_compile(
172.27.221.56: [rank5]: ^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile
172.27.221.56: [rank5]: _write_ninja_file_and_build_library(
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1802, in _write_ninja_file_and_build_library
172.27.221.56: [rank5]: verify_ninja_availability()
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1853, in verify_ninja_availability
172.27.221.56: [rank5]: raise RuntimeError("Ninja is required to load C++ extensions")
172.27.221.56: [rank5]: RuntimeError: Ninja is required to load C++ extensions
172.27.221.62: Detected CUDA files, patching ldflags
172.27.221.62: Emitting ninja build file /home/hy/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
172.27.221.62: /home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
172.27.221.62: If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
172.27.221.62: warnings.warn(
172.27.221.62: Building extension module fused_adam...
172.27.221.62: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
172.27.221.62: ninja: no work to do.
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Time to load fused_adam op: 0.041127920150756836 seconds
172.27.221.62: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...
172.27.221.62:
172.27.221.62:
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Time to load fused_adam op: 0.10191988945007324 secondsTime to load fused_adam op: 0.1019446849822998 secondsTime to load fused_adam op: 0.10191512107849121 seconds
172.27.221.62:
172.27.221.62:
172.27.221.62: Time to load fused_adam op: 0.10192584991455078 seconds
172.27.221.62: Time to load fused_adam op: 0.10190796852111816 seconds
172.27.221.62: Time to load fused_adam op: 0.10191154479980469 seconds
172.27.221.62: Time to load fused_adam op: 0.1019294261932373 seconds
172.27.221.56: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Time to load fused_adam op: 0.10198068618774414 secondsTime to load fused_adam op: 0.10190844535827637 secondsTime to load fused_adam op: 0.10196852684020996 secondsTime to load fused_adam op: 0.10200953483581543 secondsTime to load fused_adam op: 0.10203671455383301 secondsTime to load fused_adam op: 0.10205721855163574 secondsTime to load fused_adam op: 0.10202836990356445 seconds
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown
172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
172.27.221.56: node56:56374:57122 [5] NCCL INFO [Service thread] Connection closed by localRank 5
172.27.221.56: node56:56374:57169 [0] NCCL INFO comm 0xa198190 rank 5 nranks 16 cudaDev 5 busId 8f000 - Abort COMPLETE
172.27.221.56: [2024-06-07 22:12:59,286] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56369
172.27.221.56: [2024-06-07 22:12:59,422] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56370
172.27.221.56: [2024-06-07 22:12:59,546] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56371
172.27.221.56: [2024-06-07 22:12:59,674] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56372
172.27.221.56: [2024-06-07 22:12:59,802] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56373
172.27.221.56: [2024-06-07 22:12:59,928] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56374
172.27.221.56: [2024-06-07 22:12:59,929] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56375
172.27.221.56: [2024-06-07 22:13:00,055] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56376
172.27.221.56: [2024-06-07 22:13:00,183] [ERROR] [launch.py:325:sigkill_handler] ['/home/hy/anaconda3/envs/algmnode1/bin/python', '-u', 'test.py', '--local_rank=7'] exits with return code = 1

@YangBrooksHan YangBrooksHan added bug Something isn't working training labels Jun 7, 2024
@GuanhuaWang GuanhuaWang self-assigned this Oct 23, 2024
@GuanhuaWang
Copy link
Member

Hi @YangBrooksHan ,

Thanks for reporting this issue,

  1. deepspeed does not require root privilege and I suppose there may be some env setting mistake on your local cluster.
  2. you need to install ninja on every node. For installation, I found some tutorial online that you can take a look here and stackoverflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants