[Bug] internvl2_8b, 4 3090 cards, CUDA OOM error #2993

zhaowenZhou · 2025-01-07T08:44:06Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

My command: torchrun --standalone --nnodes=1 --nproc_per_node=4 internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py --checkpoint ckpts/ckpts/internvl2_8b --prompt-path /home/qianyuan/InternVL/data/review/vlm_sft/annotations/jsonlines/R0_5/train_v8_mpo.jsonl --out-dir data/review/vlm_sft/MPO --batch-size 1 --num-workers 4 --num-return-sequences 1 --top-k 50 --temperature 1.0 --dynamic --sample-max-num 500000 --tp 4 --start-ratio 0.5

Error: [rank3]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31

This code works fine with a nproc_per_node=1 with a 1B model so I guess the problem is related to multi-gpu?
Thanks in advance

Reproduction

export NCCL_P2P_DISABLE=1
torchrun --standalone --nnodes=1 --nproc_per_node=4 internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py --checkpoint ckpts/ckpts/internvl2_8b --prompt-path /home/qianyuan/InternVL/data/review/vlm_sft/annotations/jsonlines/R0_5/train_v8_mpo.jsonl --out-dir data/review/vlm_sft/MPO --batch-size 1 --num-workers 4 --num-return-sequences 1 --top-k 50 --temperature 1.0 --dynamic --sample-max-num 500000 --tp 4 --start-ratio 0.5

Environment

sys.platform: linux
Python: 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 2.4.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1  (built against CUDA 12.4)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.19.0+cu121
LMDeploy: 0.6.4+a0fe6ed
transformers: 4.46.2
gradio: 5.7.1
fastapi: 0.115.4
pydantic: 2.9.2
triton: 3.0.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     SYS     SYS     0-13,28-41      0               N/A
GPU1    PHB      X      SYS     SYS     0-13,28-41      0               N/A
GPU2    SYS     SYS      X      PHB     14-27,42-55     1               N/A
GPU3    SYS     SYS     PHB      X      14-27,42-55     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
2025-01-07 16:39:40,644 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format:   0%|                                                                                                                                                    | 0/32 [00:00<?, ?it/s][TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
Convert to turbomind format:   6%|████████▊                                                                                                                                   | 2/32 [00:00<00:03,  7.73it/s]2025-01-07 16:39:40,918 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format:   9%|█████████████▏                                                                                                                              | 3/32 [00:00<00:03,  7.45it/s][TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
2025-01-07 16:39:41,174 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format:   6%|████████▊                                                                                                                                   | 2/32 [00:00<00:03,  7.54it/s]2025-01-07 16:39:41,230 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256                                                                                                                                                

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][WARNING] Devicle 0 peer access Device 1 is not available.
[TM][WARNING] Devicle 0 peer access Device 2 is not available.
[TM][WARNING] Devicle 1 peer access Device 0 is not available.
[TM][WARNING] Devicle 1 peer access Device 2 is not available.
[TM][WARNING] Devicle 1 peer access Device 3 is not available.
[TM][WARNING] Devicle 0 peer access Device 3 is not available.
[TM][WARNING] Devicle 2 peer access Device 0 is not available.
[TM][WARNING] Devicle 2 peer access Device 1 is not available.
[TM][WARNING] Devicle 2 peer access Device 3 is not available.
[TM][WARNING] Devicle 3 peer access Device 0 is not available.
[TM][WARNING] Devicle 3 peer access Device 1 is not available.
[TM][WARNING] Devicle 3 peer access Device 2 is not available.
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][INFO] [Gemm2] 8
[TM][INFO] [Gemm2] 16
[TM][INFO] [Gemm2] 32
[TM][INFO] [Gemm2] 48
[TM][INFO] [Gemm2] 64
[TM][INFO] [Gemm2] 96
[TM][INFO] [Gemm2] 128
[TM][INFO] [Gemm2] 192
[TM][INFO] [Gemm2] 256
[TM][INFO] [Gemm2] 384
[TM][INFO] [Gemm2] 512
[TM][INFO] [Gemm2] 768
[TM][INFO] [Gemm2] 1024
[TM][INFO] [Gemm2] 1536
[TM][INFO] [Gemm2] 2048
[TM][INFO] [Gemm2] 3072
[TM][INFO] [Gemm2] 4096                                                                                                                                                                                      
[TM][INFO] [Gemm2] 6144
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in <module>
[rank3]:     pipe = pipeline(
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank3]:     return pipeline_class(model_path,
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank3]:     super().__init__(model_path, **kwargs)
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank3]:     self._build_turbomind(model_path=model_path,
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank3]:     self.engine = tm.TurboMind.from_pretrained(
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank3]:     return cls(model_path=pretrained_model_name_or_path,
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank3]:     for _ in e.map(self.model_comm.process_weight,
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank3]:     yield _result_or_cancel(fs.pop())
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank3]:     return fut.result(timeout)
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank3]:     return self.__get_result()
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank3]:     raise self._exception
[rank3]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank3]:     result = self.fn(*self.args, **self.kwargs)
[rank3]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31 

[TM][INFO] [Gemm2] 8192
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] pointer_mapping_ does not have information of ptr at 0x2b0f73f800. Assertion fail: /lmdeploy/src/turbomind/utils/allocator.h:284 

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256                                                                                                                                                

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in <module>
[rank2]:     pipe = pipeline(
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank2]:     return pipeline_class(model_path,
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank2]:     super().__init__(model_path, **kwargs)
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank2]:     self._build_turbomind(model_path=model_path,
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank2]:     self.engine = tm.TurboMind.from_pretrained(
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank2]:     return cls(model_path=pretrained_model_name_or_path,
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank2]:     for _ in e.map(self.model_comm.process_weight,
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank2]:     yield _result_or_cancel(fs.pop())
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank2]:     return fut.result(timeout)
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank2]:     return self.__get_result()
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank2]:     raise self._exception
[rank2]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank2]:     result = self.fn(*self.args, **self.kwargs)
[rank2]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31 

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in <module>
[rank0]:     pipe = pipeline(
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank0]:     return pipeline_class(model_path,
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank0]:     super().__init__(model_path, **kwargs)
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank0]:     self._build_turbomind(model_path=model_path,
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank0]:     self.engine = tm.TurboMind.from_pretrained(
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank0]:     return cls(model_path=pretrained_model_name_or_path,
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank0]:     for _ in e.map(self.model_comm.process_weight,
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank0]:     yield _result_or_cancel(fs.pop())
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank0]:     return fut.result(timeout)
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank0]:     return self.__get_result()
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank0]:     raise self._exception
[rank0]:   File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank0]:     result = self.fn(*self.args, **self.kwargs)
[rank0]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31 

W0107 16:39:49.598000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47910 closing signal SIGTERM
W0107 16:39:49.598000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47912 closing signal SIGTERM
W0107 16:39:49.599000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47913 closing signal SIGTERM
E0107 16:39:50.228000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 47911) of binary: /media/sde1/qianyuan/miniconda3/envs/aishenhe/bin/python
Traceback (most recent call last):
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-07_16:39:49
  host      : hz-02
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 47911)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 47911
============================================================

starevelyn · 2025-01-17T09:39:18Z

same bug when i use 8 gpus, 80g per gpu, to infer a 76b model....

lvhan028 assigned irexyc Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] internvl2_8b, 4 3090 cards, CUDA OOM error #2993

[Bug] internvl2_8b, 4 3090 cards, CUDA OOM error #2993

zhaowenZhou commented Jan 7, 2025

starevelyn commented Jan 17, 2025

[Bug] internvl2_8b, 4 3090 cards, CUDA OOM error #2993

[Bug] internvl2_8b, 4 3090 cards, CUDA OOM error #2993

Comments

zhaowenZhou commented Jan 7, 2025

Checklist

Describe the bug

Reproduction

Environment

Error traceback

starevelyn commented Jan 17, 2025