We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug Encountered the following errors while training a large language model with DeepSpeed on multiple nodes
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: die: error: must run as root[rank5]: Traceback (most recent call last): 172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 88, in 172.27.221.56: [rank5]: main() 172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 64, in main 172.27.221.56: [rank5]: optimizer = FusedAdam( 172.27.221.56: [rank5]: ^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init 172.27.221.56: [rank5]: fused_adam_cuda = FusedAdamBuilder().load() 172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in load 172.27.221.56: [rank5]: return self.jit_load(verbose) 172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 524, in jit_load 172.27.221.56: [rank5]: op_module = load(name=self.name, 172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1309, in load 172.27.221.56: [rank5]: return _jit_compile( 172.27.221.56: [rank5]: ^^^^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile 172.27.221.56: [rank5]: _write_ninja_file_and_build_library( 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1802, in _write_ninja_file_and_build_library 172.27.221.56: [rank5]: verify_ninja_availability() 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1853, in verify_ninja_availability 172.27.221.56: [rank5]: raise RuntimeError("Ninja is required to load C++ extensions") 172.27.221.56: [rank5]: RuntimeError: Ninja is required to load C++ extensions 172.27.221.62: Detected CUDA files, patching ldflags 172.27.221.62: Emitting ninja build file /home/hy/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja... 172.27.221.62: /home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 172.27.221.62: If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. 172.27.221.62: warnings.warn( 172.27.221.62: Building extension module fused_adam... 172.27.221.62: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 172.27.221.62: ninja: no work to do. 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Time to load fused_adam op: 0.041127920150756836 seconds 172.27.221.62: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam... 172.27.221.62: 172.27.221.62: 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Time to load fused_adam op: 0.10191988945007324 secondsTime to load fused_adam op: 0.1019446849822998 secondsTime to load fused_adam op: 0.10191512107849121 seconds 172.27.221.62: 172.27.221.62: 172.27.221.62: Time to load fused_adam op: 0.10192584991455078 seconds 172.27.221.62: Time to load fused_adam op: 0.10190796852111816 seconds 172.27.221.62: Time to load fused_adam op: 0.10191154479980469 seconds 172.27.221.62: Time to load fused_adam op: 0.1019294261932373 seconds 172.27.221.56: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam... 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: Loading extension module fused_adam... 172.27.221.56: Loading extension module fused_adam... 172.27.221.56: Loading extension module fused_adam... 172.27.221.56: Time to load fused_adam op: 0.10198068618774414 secondsTime to load fused_adam op: 0.10190844535827637 secondsTime to load fused_adam op: 0.10196852684020996 secondsTime to load fused_adam op: 0.10200953483581543 secondsTime to load fused_adam op: 0.10203671455383301 secondsTime to load fused_adam op: 0.10205721855163574 secondsTime to load fused_adam op: 0.10202836990356445 seconds 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown 172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized 172.27.221.56: node56:56374:57122 [5] NCCL INFO [Service thread] Connection closed by localRank 5 172.27.221.56: node56:56374:57169 [0] NCCL INFO comm 0xa198190 rank 5 nranks 16 cudaDev 5 busId 8f000 - Abort COMPLETE 172.27.221.56: [2024-06-07 22:12:59,286] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56369 172.27.221.56: [2024-06-07 22:12:59,422] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56370 172.27.221.56: [2024-06-07 22:12:59,546] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56371 172.27.221.56: [2024-06-07 22:12:59,674] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56372 172.27.221.56: [2024-06-07 22:12:59,802] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56373 172.27.221.56: [2024-06-07 22:12:59,928] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56374 172.27.221.56: [2024-06-07 22:12:59,929] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56375 172.27.221.56: [2024-06-07 22:13:00,055] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56376 172.27.221.56: [2024-06-07 22:13:00,183] [ERROR] [launch.py:325:sigkill_handler] ['/home/hy/anaconda3/envs/algmnode1/bin/python', '-u', 'test.py', '--local_rank=7'] exits with return code = 1
The text was updated successfully, but these errors were encountered:
Hi @YangBrooksHan ,
Thanks for reporting this issue,
Sorry, something went wrong.
GuanhuaWang
No branches or pull requests
Describe the bug
Encountered the following errors while training a large language model with DeepSpeed on multiple nodes
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
172.27.221.56: die: error: must run as root[rank5]: Traceback (most recent call last):
172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 88, in
172.27.221.56: [rank5]: main()
172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 64, in main
172.27.221.56: [rank5]: optimizer = FusedAdam(
172.27.221.56: [rank5]: ^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
172.27.221.56: [rank5]: fused_adam_cuda = FusedAdamBuilder().load()
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in load
172.27.221.56: [rank5]: return self.jit_load(verbose)
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 524, in jit_load
172.27.221.56: [rank5]: op_module = load(name=self.name,
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1309, in load
172.27.221.56: [rank5]: return _jit_compile(
172.27.221.56: [rank5]: ^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile
172.27.221.56: [rank5]: _write_ninja_file_and_build_library(
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1802, in _write_ninja_file_and_build_library
172.27.221.56: [rank5]: verify_ninja_availability()
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1853, in verify_ninja_availability
172.27.221.56: [rank5]: raise RuntimeError("Ninja is required to load C++ extensions")
172.27.221.56: [rank5]: RuntimeError: Ninja is required to load C++ extensions
172.27.221.62: Detected CUDA files, patching ldflags
172.27.221.62: Emitting ninja build file /home/hy/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
172.27.221.62: /home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
172.27.221.62: If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
172.27.221.62: warnings.warn(
172.27.221.62: Building extension module fused_adam...
172.27.221.62: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
172.27.221.62: ninja: no work to do.
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Time to load fused_adam op: 0.041127920150756836 seconds
172.27.221.62: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...
172.27.221.62:
172.27.221.62:
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Time to load fused_adam op: 0.10191988945007324 secondsTime to load fused_adam op: 0.1019446849822998 secondsTime to load fused_adam op: 0.10191512107849121 seconds
172.27.221.62:
172.27.221.62:
172.27.221.62: Time to load fused_adam op: 0.10192584991455078 seconds
172.27.221.62: Time to load fused_adam op: 0.10190796852111816 seconds
172.27.221.62: Time to load fused_adam op: 0.10191154479980469 seconds
172.27.221.62: Time to load fused_adam op: 0.1019294261932373 seconds
172.27.221.56: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Time to load fused_adam op: 0.10198068618774414 secondsTime to load fused_adam op: 0.10190844535827637 secondsTime to load fused_adam op: 0.10196852684020996 secondsTime to load fused_adam op: 0.10200953483581543 secondsTime to load fused_adam op: 0.10203671455383301 secondsTime to load fused_adam op: 0.10205721855163574 secondsTime to load fused_adam op: 0.10202836990356445 seconds
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown
172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
172.27.221.56: node56:56374:57122 [5] NCCL INFO [Service thread] Connection closed by localRank 5
172.27.221.56: node56:56374:57169 [0] NCCL INFO comm 0xa198190 rank 5 nranks 16 cudaDev 5 busId 8f000 - Abort COMPLETE
172.27.221.56: [2024-06-07 22:12:59,286] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56369
172.27.221.56: [2024-06-07 22:12:59,422] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56370
172.27.221.56: [2024-06-07 22:12:59,546] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56371
172.27.221.56: [2024-06-07 22:12:59,674] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56372
172.27.221.56: [2024-06-07 22:12:59,802] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56373
172.27.221.56: [2024-06-07 22:12:59,928] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56374
172.27.221.56: [2024-06-07 22:12:59,929] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56375
172.27.221.56: [2024-06-07 22:13:00,055] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56376
172.27.221.56: [2024-06-07 22:13:00,183] [ERROR] [launch.py:325:sigkill_handler] ['/home/hy/anaconda3/envs/algmnode1/bin/python', '-u', 'test.py', '--local_rank=7'] exits with return code = 1
The text was updated successfully, but these errors were encountered: