[Bug] 请问在T4上进行qwen2-14B awq4版模型推理耗时远远大于相同模型在vllm上推理的耗时，是参数哪里有问题吗？相同配置在A800上性能确实能提升 #3012

sundayKK · 2025-01-12T12:34:58Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用模型：qwen2-14b-chat 量化版模型
在vllm和TurboMind的单请求耗时上，vllm耗时在5s左右，但turboMind推理耗时要接近10s

Reproduction

对应启动配置为：
backend_config = TurbomindEngineConfig(model_format='awq',
tp=1,
max_batch_size=1,
cache_max_entry_count=0.8)
pipe = pipeline(
model_path=self.model_path,
backend_config=backend_config
)

Environment

T4环境 Driver Version: 535.161.08   CUDA Version: 12.2

Error traceback

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 请问在T4上进行qwen2-14B awq4版模型推理耗时远远大于相同模型在vllm上推理的耗时，是参数哪里有问题吗？相同配置在A800上性能确实能提升 #3012

[Bug] 请问在T4上进行qwen2-14B awq4版模型推理耗时远远大于相同模型在vllm上推理的耗时，是参数哪里有问题吗？相同配置在A800上性能确实能提升 #3012

sundayKK commented Jan 12, 2025

[Bug] 请问在T4上进行qwen2-14B awq4版模型推理耗时远远大于相同模型在vllm上推理的耗时，是参数哪里有问题吗？相同配置在A800上性能确实能提升 #3012

[Bug] 请问在T4上进行qwen2-14B awq4版模型推理耗时远远大于相同模型在vllm上推理的耗时，是参数哪里有问题吗？相同配置在A800上性能确实能提升 #3012

Comments

sundayKK commented Jan 12, 2025

Checklist

Describe the bug

Reproduction

Environment

Error traceback