You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
我量化了qwen2.5-7b成为w4a16,然后使用代码lmdeploy/turbomind/chat.py,在x86的ubuntu20.04(3060ti 8G)上跑没问题,在jetson orin nx (16G)上跑就出现了OOM,哪怕我把cache_max_entry_count设置为0.01还是OOM,实际上我使用jtop来监控内存也发现内存远没溢出,目前难以排查。
chat_template_config:
ChatTemplateConfig(model_name='qwen2d5', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format=None, tp=1, session_len=32768, max_batch_size=1, cache_max_entry_count=0.001, cache_chunk_size=1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
Traceback (most recent call last):
File "/home/gac/lmdeploy/test.py", line 189, in
main('/home/gac/Desktop/qwen2d5-turbomind')
File "/home/gac/lmdeploy/test.py", line 116, in main
tm_model = tm.TurboMind.from_pretrained(model_path,
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 303, in from_pretrained
return cls(model_path=pretrained_model_name_or_path,
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 106, in init
self.model_comm = self._from_workspace(
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 272, in _from_workspace
self._create_weight(model_comm)
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 153, in _create_weight
future.result()
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 146, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /home/gac/lmdeploy/src/turbomind/utils/memory_utils.cu:31 No response
The text was updated successfully, but these errors were encountered:
have you solved this problem? i met the same problem that when in arrch64 architecture(lmdeploy installed by source code compile), it occurs OOM, while in x86, the same program doesn't throw an error. And it seems that the standard output is different from x86, it doesn't have "convert to turbomind engine format" output, but i do use turboengine, is there any solution?
@Shelly-zzz hi, I use the lmdeploy v0.4.0 and i install by source. it works.
when you use lmdeploy v0.4.0, you must do some changes. In generate.sh, -DBUILD_MULTI_GPU=ON must be OFF, and maybe you must add -DPYTHON_EXECUTABLE=(the output of which python)
Checklist
Describe the bug
我量化了qwen2.5-7b成为w4a16,然后使用代码lmdeploy/turbomind/chat.py,在x86的ubuntu20.04(3060ti 8G)上跑没问题,在jetson orin nx (16G)上跑就出现了OOM,哪怕我把cache_max_entry_count设置为0.01还是OOM,实际上我使用jtop来监控内存也发现内存远没溢出,目前难以排查。
Reproduction
lmdeploy/turbomind/chat.py,然后model_path设置为量化后的qwen2.5,同样的代码在x86上ok,在jetson上不ok
x86和jetson的lmdeploy版本均为最新的main,0.6.5,均是cuda12。均使用jetson的源码编译(https://github.com/InternLM/lmdeploy/blob/main/docs/en/get_started/installation.md#install-from-source ),jetson已生成build/lib/_turbomind.cpython-310-aarch64-linux-gnu.so
Environment
Error traceback
chat_template_config:
ChatTemplateConfig(model_name='qwen2d5', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format=None, tp=1, session_len=32768, max_batch_size=1, cache_max_entry_count=0.001, cache_chunk_size=1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
Traceback (most recent call last):
File "/home/gac/lmdeploy/test.py", line 189, in
main('/home/gac/Desktop/qwen2d5-turbomind')
File "/home/gac/lmdeploy/test.py", line 116, in main
tm_model = tm.TurboMind.from_pretrained(model_path,
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 303, in from_pretrained
return cls(model_path=pretrained_model_name_or_path,
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 106, in init
self.model_comm = self._from_workspace(
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 272, in _from_workspace
self._create_weight(model_comm)
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 153, in _create_weight
future.result()
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/gac/miniforge3/envs/myenv/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/gac/lmdeploy/lmdeploy/turbomind/turbomind.py", line 146, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /home/gac/lmdeploy/src/turbomind/utils/memory_utils.cu:31
No response
The text was updated successfully, but these errors were encountered: