[Bug] OOM in jetson but not in x86 #3006

quanfeifan · 2025-01-09T08:02:06Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

我量化了qwen2.5-7b成为w4a16，然后使用代码lmdeploy/turbomind/chat.py，在x86的ubuntu20.04（3060ti 8G）上跑没问题，在jetson orin nx （16G）上跑就出现了OOM，哪怕我把cache_max_entry_count设置为0.01还是OOM，实际上我使用jtop来监控内存也发现内存远没溢出，目前难以排查。

Reproduction

lmdeploy/turbomind/chat.py，然后model_path设置为量化后的qwen2.5，同样的代码在x86上ok，在jetson上不ok
x86和jetson的lmdeploy版本均为最新的main，0.6.5，均是cuda12。均使用jetson的源码编译（https://github.com/InternLM/lmdeploy/blob/main/docs/en/get_started/installation.md#install-from-source ），jetson已生成build/lib/_turbomind.cpython-310-aarch64-linux-gnu.so

Environment

sys.platform: linux
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 21:44:20) [GCC 12.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: Orin
CUDA_HOME: /usr/local/cuda-12.2
NVCC: Cuda compilation tools, release 12.2, V12.2.140
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.0
PyTorch compiling details: PyTorch built with:
  - GCC 11.4
  - C++ Version: 201703
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - CUDA Runtime 12.2
  - NVCC architecture flags: -gencode;arch=compute_87,code=sm_87
  - CuDNN 8.9.4
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=12.2, CUDNN_VERSION=8.9.4, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=open, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=0, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.18.0a0+6043bc2
LMDeploy: 0.6.5+c5f4014
transformers: 4.45.0
gradio: 3.35.2
fastapi: 0.112.0
pydantic: 2.10.4
triton: 3.0.0

Error traceback

chat_template_config:
ChatTemplateConfig(model_name='qwen2d5', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format=None, tp=1, session_len=32768, max_batch_size=1, cache_max_entry_count=0.001, cache_chunk_size=1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
Traceback (most recent call last):
File "/home/gac/lmdeploy/test.py", line 189, in
main('/home/gac/Desktop/qwen2d5-turbomind')
File "/home/gac/lmdeploy/test.py", line 116, in main
tm_model = tm.TurboMind.from_pretrained(model_path,
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 303, in from_pretrained
return cls(model_path=pretrained_model_name_or_path,
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 106, in init
self.model_comm = self._from_workspace(
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 272, in _from_workspace
self._create_weight(model_comm)
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 153, in _create_weight
future.result()
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 146, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /home/gac/lmdeploy/src/turbomind/utils/memory_utils.cu:31
No response

Shelly-zzz · 2025-01-18T04:13:40Z

have you solved this problem? i met the same problem that when in arrch64 architecture(lmdeploy installed by source code compile), it occurs OOM, while in x86, the same program doesn't throw an error. And it seems that the standard output is different from x86, it doesn't have "convert to turbomind engine format" output, but i do use turboengine, is there any solution?

quanfeifan · 2025-01-18T10:47:38Z

@Shelly-zzz hi, I use the lmdeploy v0.4.0 and i install by source. it works.
when you use lmdeploy v0.4.0, you must do some changes. In generate.sh, -DBUILD_MULTI_GPU=ON must be OFF, and maybe you must add -DPYTHON_EXECUTABLE=(the output of which python)

Shelly-zzz · 2025-01-18T15:36:58Z

ok, thanks!

lvhan028 assigned lzhangzz Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] OOM in jetson but not in x86 #3006

[Bug] OOM in jetson but not in x86 #3006

quanfeifan commented Jan 9, 2025 •

edited

Loading

Shelly-zzz commented Jan 18, 2025

quanfeifan commented Jan 18, 2025

Shelly-zzz commented Jan 18, 2025

[Bug] OOM in jetson but not in x86 #3006

[Bug] OOM in jetson but not in x86 #3006

Comments

quanfeifan commented Jan 9, 2025 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

Error traceback

Shelly-zzz commented Jan 18, 2025

quanfeifan commented Jan 18, 2025

Shelly-zzz commented Jan 18, 2025

quanfeifan commented Jan 9, 2025 •

edited

Loading