Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] internvl2_8b, 4 3090 cards, CUDA OOM error #2993

Open
3 tasks done
zhaowenZhou opened this issue Jan 7, 2025 · 1 comment
Open
3 tasks done

[Bug] internvl2_8b, 4 3090 cards, CUDA OOM error #2993

zhaowenZhou opened this issue Jan 7, 2025 · 1 comment
Assignees

Comments

@zhaowenZhou
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

My command: torchrun --standalone --nnodes=1 --nproc_per_node=4 internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py --checkpoint ckpts/ckpts/internvl2_8b --prompt-path /home/qianyuan/InternVL/data/review/vlm_sft/annotations/jsonlines/R0_5/train_v8_mpo.jsonl --out-dir data/review/vlm_sft/MPO --batch-size 1 --num-workers 4 --num-return-sequences 1 --top-k 50 --temperature 1.0 --dynamic --sample-max-num 500000 --tp 4 --start-ratio 0.5

Error: [rank3]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31

This code works fine with a nproc_per_node=1 with a 1B model so I guess the problem is related to multi-gpu?
Thanks in advance

Reproduction

export NCCL_P2P_DISABLE=1
torchrun --standalone --nnodes=1 --nproc_per_node=4 internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py --checkpoint ckpts/ckpts/internvl2_8b --prompt-path /home/qianyuan/InternVL/data/review/vlm_sft/annotations/jsonlines/R0_5/train_v8_mpo.jsonl --out-dir data/review/vlm_sft/MPO --batch-size 1 --num-workers 4 --num-return-sequences 1 --top-k 50 --temperature 1.0 --dynamic --sample-max-num 500000 --tp 4 --start-ratio 0.5

Environment

sys.platform: linux
Python: 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 2.4.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1  (built against CUDA 12.4)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.19.0+cu121
LMDeploy: 0.6.4+a0fe6ed
transformers: 4.46.2
gradio: 5.7.1
fastapi: 0.115.4
pydantic: 2.9.2
triton: 3.0.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     SYS     SYS     0-13,28-41      0               N/A
GPU1    PHB      X      SYS     SYS     0-13,28-41      0               N/A
GPU2    SYS     SYS      X      PHB     14-27,42-55     1               N/A
GPU3    SYS     SYS     PHB      X      14-27,42-55     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
2025-01-07 16:39:40,644 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format:   0%|                                                                                                                                                    | 0/32 [00:00<?, ?it/s][TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
Convert to turbomind format:   6%|████████▊                                                                                                                                   | 2/32 [00:00<00:03,  7.73it/s]2025-01-07 16:39:40,918 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format:   9%|█████████████▏                                                                                                                              | 3/32 [00:00<00:03,  7.45it/s][TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
2025-01-07 16:39:41,174 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format:   6%|████████▊                                                                                                                                   | 2/32 [00:00<00:03,  7.54it/s]2025-01-07 16:39:41,230 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256                                                                                                                                                

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][WARNING] Devicle 0 peer access Device 1 is not available.
[TM][WARNING] Devicle 0 peer access Device 2 is not available.
[TM][WARNING] Devicle 1 peer access Device 0 is not available.
[TM][WARNING] Devicle 1 peer access Device 2 is not available.
[TM][WARNING] Devicle 1 peer access Device 3 is not available.
[TM][WARNING] Devicle 0 peer access Device 3 is not available.
[TM][WARNING] Devicle 2 peer access Device 0 is not available.
[TM][WARNING] Devicle 2 peer access Device 1 is not available.
[TM][WARNING] Devicle 2 peer access Device 3 is not available.
[TM][WARNING] Devicle 3 peer access Device 0 is not available.
[TM][WARNING] Devicle 3 peer access Device 1 is not available.
[TM][WARNING] Devicle 3 peer access Device 2 is not available.
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][INFO] [Gemm2] 8
[TM][INFO] [Gemm2] 16
[TM][INFO] [Gemm2] 32
[TM][INFO] [Gemm2] 48
[TM][INFO] [Gemm2] 64
[TM][INFO] [Gemm2] 96
[TM][INFO] [Gemm2] 128
[TM][INFO] [Gemm2] 192
[TM][INFO] [Gemm2] 256
[TM][INFO] [Gemm2] 384
[TM][INFO] [Gemm2] 512
[TM][INFO] [Gemm2] 768
[TM][INFO] [Gemm2] 1024
[TM][INFO] [Gemm2] 1536
[TM][INFO] [Gemm2] 2048
[TM][INFO] [Gemm2] 3072
[TM][INFO] [Gemm2] 4096                                                                                                                                                                                      
[TM][INFO] [Gemm2] 6144
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in <module>
[rank3]:     pipe = pipeline(
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank3]:     return pipeline_class(model_path,
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank3]:     super().__init__(model_path, **kwargs)
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank3]:     self._build_turbomind(model_path=model_path,
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank3]:     self.engine = tm.TurboMind.from_pretrained(
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank3]:     return cls(model_path=pretrained_model_name_or_path,
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank3]:     for _ in e.map(self.model_comm.process_weight,
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank3]:     yield _result_or_cancel(fs.pop())
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank3]:     return fut.result(timeout)
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank3]:     return self.__get_result()
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank3]:     raise self._exception
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank3]:     result = self.fn(*self.args, **self.kwargs)
[rank3]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31 

[TM][INFO] [Gemm2] 8192
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] pointer_mapping_ does not have information of ptr at 0x2b0f73f800. Assertion fail: /lmdeploy/src/turbomind/utils/allocator.h:284 

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256                                                                                                                                                

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in <module>
[rank2]:     pipe = pipeline(
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank2]:     return pipeline_class(model_path,
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank2]:     super().__init__(model_path, **kwargs)
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank2]:     self._build_turbomind(model_path=model_path,
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank2]:     self.engine = tm.TurboMind.from_pretrained(
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank2]:     return cls(model_path=pretrained_model_name_or_path,
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank2]:     for _ in e.map(self.model_comm.process_weight,
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank2]:     yield _result_or_cancel(fs.pop())
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank2]:     return fut.result(timeout)
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank2]:     return self.__get_result()
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank2]:     raise self._exception
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank2]:     result = self.fn(*self.args, **self.kwargs)
[rank2]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31 

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in <module>
[rank0]:     pipe = pipeline(
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank0]:     return pipeline_class(model_path,
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank0]:     super().__init__(model_path, **kwargs)
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank0]:     self._build_turbomind(model_path=model_path,
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank0]:     self.engine = tm.TurboMind.from_pretrained(
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank0]:     return cls(model_path=pretrained_model_name_or_path,
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank0]:     for _ in e.map(self.model_comm.process_weight,
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank0]:     yield _result_or_cancel(fs.pop())
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank0]:     return fut.result(timeout)
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank0]:     return self.__get_result()
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank0]:     raise self._exception
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank0]:     result = self.fn(*self.args, **self.kwargs)
[rank0]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31 

W0107 16:39:49.598000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47910 closing signal SIGTERM
W0107 16:39:49.598000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47912 closing signal SIGTERM
W0107 16:39:49.599000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47913 closing signal SIGTERM
E0107 16:39:50.228000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 47911) of binary: /media/sde1/qianyuan/miniconda3/envs/aishenhe/bin/python
Traceback (most recent call last):
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-07_16:39:49
  host      : hz-02
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 47911)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 47911
============================================================
@starevelyn
Copy link

same bug when i use 8 gpus, 80g per gpu, to infer a 76b model....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants