Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] OOM in jetson but not in x86 #3006

Open
3 tasks
quanfeifan opened this issue Jan 9, 2025 · 3 comments
Open
3 tasks

[Bug] OOM in jetson but not in x86 #3006

quanfeifan opened this issue Jan 9, 2025 · 3 comments
Assignees

Comments

@quanfeifan
Copy link

quanfeifan commented Jan 9, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

我量化了qwen2.5-7b成为w4a16,然后使用代码lmdeploy/turbomind/chat.py,在x86的ubuntu20.04(3060ti 8G)上跑没问题,在jetson orin nx (16G)上跑就出现了OOM,哪怕我把cache_max_entry_count设置为0.01还是OOM,实际上我使用jtop来监控内存也发现内存远没溢出,目前难以排查。

Reproduction

lmdeploy/turbomind/chat.py,然后model_path设置为量化后的qwen2.5,同样的代码在x86上ok,在jetson上不ok
x86和jetson的lmdeploy版本均为最新的main,0.6.5,均是cuda12。均使用jetson的源码编译(https://github.com/InternLM/lmdeploy/blob/main/docs/en/get_started/installation.md#install-from-source ),jetson已生成build/lib/_turbomind.cpython-310-aarch64-linux-gnu.so

Environment

sys.platform: linux
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 21:44:20) [GCC 12.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: Orin
CUDA_HOME: /usr/local/cuda-12.2
NVCC: Cuda compilation tools, release 12.2, V12.2.140
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.0
PyTorch compiling details: PyTorch built with:
  - GCC 11.4
  - C++ Version: 201703
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - CUDA Runtime 12.2
  - NVCC architecture flags: -gencode;arch=compute_87,code=sm_87
  - CuDNN 8.9.4
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=12.2, CUDNN_VERSION=8.9.4, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=open, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=0, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.18.0a0+6043bc2
LMDeploy: 0.6.5+c5f4014
transformers: 4.45.0
gradio: 3.35.2
fastapi: 0.112.0
pydantic: 2.10.4
triton: 3.0.0

Error traceback

chat_template_config:
ChatTemplateConfig(model_name='qwen2d5', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format=None, tp=1, session_len=32768, max_batch_size=1, cache_max_entry_count=0.001, cache_chunk_size=1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
Traceback (most recent call last):
File "/home/gac/lmdeploy/test.py", line 189, in
main('/home/gac/Desktop/qwen2d5-turbomind')
File "/home/gac/lmdeploy/test.py", line 116, in main
tm_model = tm.TurboMind.from_pretrained(model_path,
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 303, in from_pretrained
return cls(model_path=pretrained_model_name_or_path,
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 106, in init
self.model_comm = self._from_workspace(
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 272, in _from_workspace
self._create_weight(model_comm)
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 153, in _create_weight
future.result()
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 146, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /home/gac/lmdeploy/src/turbomind/utils/memory_utils.cu:31
No response

@Shelly-zzz
Copy link

have you solved this problem? i met the same problem that when in arrch64 architecture(lmdeploy installed by source code compile), it occurs OOM, while in x86, the same program doesn't throw an error. And it seems that the standard output is different from x86, it doesn't have "convert to turbomind engine format" output, but i do use turboengine, is there any solution?

@quanfeifan
Copy link
Author

@Shelly-zzz hi, I use the lmdeploy v0.4.0 and i install by source. it works.
when you use lmdeploy v0.4.0, you must do some changes. In generate.sh, -DBUILD_MULTI_GPU=ON must be OFF, and maybe you must add -DPYTHON_EXECUTABLE=(the output of which python)

@Shelly-zzz
Copy link

ok, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants