You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
sys.platform: linux
Python: 3.10.15 (main, Oct 3 2024, 07:27:34) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 2.4.0+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 90.1 (built against CUDA 12.4)
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.19.0+cu121
LMDeploy: 0.6.4+a0fe6ed
transformers: 4.46.2
gradio: 5.7.1
fastapi: 0.115.4
pydantic: 2.9.2
triton: 3.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB SYS SYS 0-13,28-41 0 N/A
GPU1 PHB X SYS SYS 0-13,28-41 0 N/A
GPU2 SYS SYS X PHB 14-27,42-55 1 N/A
GPU3 SYS SYS PHB X 14-27,42-55 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Error traceback
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
2025-01-07 16:39:40,644 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format: 0%|| 0/32 [00:00<?, ?it/s][TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
Convert to turbomind format: 6%|████████▊ | 2/32 [00:00<00:03, 7.73it/s]2025-01-07 16:39:40,918 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format: 9%|█████████████▏ | 3/32 [00:00<00:03, 7.45it/s][TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][INFO] TM_FUSE_SILU_ACT=1
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
[TM][WARNING] pad vocab size from 92553 to 92556
[TM][WARNING] pad embed size from 92556 to 92556
2025-01-07 16:39:41,174 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
Convert to turbomind format: 6%|████████▊ | 2/32 [00:00<00:03, 7.54it/s]2025-01-07 16:39:41,230 - lmdeploy - WARNING - turbomind.py:231 - get 713 model params
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][WARNING] Devicle 0 peer access Device 1 is not available.
[TM][WARNING] Devicle 0 peer access Device 2 is not available.
[TM][WARNING] Devicle 1 peer access Device 0 is not available.
[TM][WARNING] Devicle 1 peer access Device 2 is not available.
[TM][WARNING] Devicle 1 peer access Device 3 is not available.
[TM][WARNING] Devicle 0 peer access Device 3 is not available.
[TM][WARNING] Devicle 2 peer access Device 0 is not available.
[TM][WARNING] Devicle 2 peer access Device 1 is not available.
[TM][WARNING] Devicle 2 peer access Device 3 is not available.
[TM][WARNING] Devicle 3 peer access Device 0 is not available.
[TM][WARNING] Devicle 3 peer access Device 1 is not available.
[TM][WARNING] Devicle 3 peer access Device 2 is not available.
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] block_size = 2 MB
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] [BlockManager] max_block_count = 1112
[TM][INFO] [BlockManager] chunk_size = 1112
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][INFO] [Gemm2] 8
[TM][INFO] [Gemm2] 16
[TM][INFO] [Gemm2] 32
[TM][INFO] [Gemm2] 48
[TM][INFO] [Gemm2] 64
[TM][INFO] [Gemm2] 96
[TM][INFO] [Gemm2] 128
[TM][INFO] [Gemm2] 192
[TM][INFO] [Gemm2] 256
[TM][INFO] [Gemm2] 384
[TM][INFO] [Gemm2] 512
[TM][INFO] [Gemm2] 768
[TM][INFO] [Gemm2] 1024
[TM][INFO] [Gemm2] 1536
[TM][INFO] [Gemm2] 2048
[TM][INFO] [Gemm2] 3072
[TM][INFO] [Gemm2] 4096
[TM][INFO] [Gemm2] 6144
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in<module>
[rank3]: pipe = pipeline(
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank3]: return pipeline_class(model_path,
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank3]: super().__init__(model_path, **kwargs)
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank3]: self._build_turbomind(model_path=model_path,
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank3]: self.engine = tm.TurboMind.from_pretrained(
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank3]: return cls(model_path=pretrained_model_name_or_path,
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank3]: for_in e.map(self.model_comm.process_weight,
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank3]: yield _result_or_cancel(fs.pop())
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank3]: return fut.result(timeout)
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank3]: returnself.__get_result()
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank3]: raise self._exception
[rank3]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank3]: result = self.fn(*self.args, **self.kwargs)
[rank3]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31
[TM][INFO] [Gemm2] 8192
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
terminate called after throwing an instance of 'std::runtime_error'what(): [TM][ERROR] pointer_mapping_ does not have information of ptr at 0x2b0f73f800. Assertion fail: /lmdeploy/src/turbomind/utils/allocator.h:284
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 58720256
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in<module>
[rank2]: pipe = pipeline(
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank2]: return pipeline_class(model_path,
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank2]: super().__init__(model_path, **kwargs)
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank2]: self._build_turbomind(model_path=model_path,
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank2]: self.engine = tm.TurboMind.from_pretrained(
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank2]: return cls(model_path=pretrained_model_name_or_path,
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank2]: for_in e.map(self.model_comm.process_weight,
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank2]: yield _result_or_cancel(fs.pop())
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank2]: return fut.result(timeout)
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank2]: returnself.__get_result()
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank2]: raise self._exception
[rank2]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank2]: result = self.fn(*self.args, **self.kwargs)
[rank2]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/qianyuan/InternVL/internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py", line 379, in<module>
[rank0]: pipe = pipeline(
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/api.py", line 85, in pipeline
[rank0]: return pipeline_class(model_path,
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in __init__
[rank0]: super().__init__(model_path, **kwargs)
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 159, in __init__
[rank0]: self._build_turbomind(model_path=model_path,
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 198, in _build_turbomind
[rank0]: self.engine = tm.TurboMind.from_pretrained(
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
[rank0]: return cls(model_path=pretrained_model_name_or_path,
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 121, in __init__
[rank0]: for_in e.map(self.model_comm.process_weight,
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
[rank0]: yield _result_or_cancel(fs.pop())
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
[rank0]: return fut.result(timeout)
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[rank0]: returnself.__get_result()
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[rank0]: raise self._exception
[rank0]: File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank0]: result = self.fn(*self.args, **self.kwargs)
[rank0]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31
W0107 16:39:49.598000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47910 closing signal SIGTERM
W0107 16:39:49.598000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47912 closing signal SIGTERM
W0107 16:39:49.599000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 47913 closing signal SIGTERM
E0107 16:39:50.228000 139783256338560 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 47911) of binary: /media/sde1/qianyuan/miniconda3/envs/aishenhe/bin/python
Traceback (most recent call last):
File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/bin/torchrun", line 8, in<module>sys.exit(main())
File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/media/sde1/qianyuan/miniconda3/envs/aishenhe/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time: 2025-01-07_16:39:49
host : hz-02
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 47911)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 47911
============================================================
The text was updated successfully, but these errors were encountered:
Checklist
Describe the bug
My command: torchrun --standalone --nnodes=1 --nproc_per_node=4 internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py --checkpoint ckpts/ckpts/internvl2_8b --prompt-path /home/qianyuan/InternVL/data/review/vlm_sft/annotations/jsonlines/R0_5/train_v8_mpo.jsonl --out-dir data/review/vlm_sft/MPO --batch-size 1 --num-workers 4 --num-return-sequences 1 --top-k 50 --temperature 1.0 --dynamic --sample-max-num 500000 --tp 4 --start-ratio 0.5
Error: [rank3]: RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:31
This code works fine with a nproc_per_node=1 with a 1B model so I guess the problem is related to multi-gpu?
Thanks in advance
Reproduction
export NCCL_P2P_DISABLE=1
torchrun --standalone --nnodes=1 --nproc_per_node=4 internvl_chat/tools/mm_reasoning_pipeline/internvl_lmdeploy_dropout_ntp.py --checkpoint ckpts/ckpts/internvl2_8b --prompt-path /home/qianyuan/InternVL/data/review/vlm_sft/annotations/jsonlines/R0_5/train_v8_mpo.jsonl --out-dir data/review/vlm_sft/MPO --batch-size 1 --num-workers 4 --num-return-sequences 1 --top-k 50 --temperature 1.0 --dynamic --sample-max-num 500000 --tp 4 --start-ratio 0.5
Environment
Error traceback
The text was updated successfully, but these errors were encountered: