-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vllm推理报错:无法在rope_scaling中获取factor字段 #96
Comments
@Potato-wll 您好,这是由于您使用的vllm版本不匹配导致的,具体可参考 #35 。 |
我用vllm启动后报错,FlashAttention only supports Ampere GPUs or newer.我的显卡是T4,用不了flashatt,怎么在哪关 |
@Potato-wll 您好,我们更新了vllm代码 以及相应的镜像,在不支持flash-attn的情况下使用xformers进行推理,请更新到最新的代码/镜像然后重试。 |
用你们最新的vllm代码安装后还是有个这个错:
|
你好,根据错误信息,这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本,麻烦检查下git commit id是否正确。 |
好的谢谢,切了分支,目前正在重新安装,请问下这个vllm版本的话,qwen2vl支持单请求多图调用吗? |
你好,切了这个分支还是报这个错,只是行数不一样了,请帮忙看看:
|
Qwen2-VL支持单条请求多个图片,具体调用方式请参考这里 |
可以提供一下您下载的模型文件中的 |
{
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"vision_start_token_id": 151652,
"vision_end_token_id": 151653,
"vision_token_id": 151654,
"image_token_id": 151655,
"video_token_id": 151656,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.41.2",
"use_cache": true,
"use_sliding_window": false,
"vision_config": {
"depth": 32,
"embed_dim": 1280,
"mlp_ratio": 4,
"num_heads": 16,
"in_chans": 3,
"hidden_size": 3584,
"patch_size": 14,
"spatial_merge_size": 2,
"spatial_patch_size": 14,
"temporal_patch_size": 2
},
"rope_scaling": {
"type": "mrope",
"mrope_section": [
16,
24,
24
]
},
"vocab_size": 152064
} |
我的安装方法如下,不知道有没有问题?:
|
我也是 按照官方fork的vllm版本,但是还是会报这个错误 |
请问是这里描述的这个问题导致的吗?:vllm-project/vllm#7905 (comment) |
问题+1,exact same problem |
@xyfZzz @docShen @Potato-wll 您好,这应当是transformers最新版本的一个bug,我已经提交了相关issue,目前请先使用如下方式安装没有bug的版本: pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 |
在config文件rope_scaling中加一个factor字段;另在https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175 前面加一行if rope_scaling and rope_scaling.get('type','default') == 'default': rope_scaling['type'] = 'mrope'可暂时解决问题 |
@lilin-git 感谢说明,这里提到的前一个方法是可行的;后一个方法不建议(由于 @xyfZzz @Potato-wll @docShen 如果重新安装transformers较为麻烦,也可使用上面提到的方法。 |
正常运行了,感谢! |
加了factor,改了qwen2.py后运行python -m vllm.entrypoints.openai.api_server --model /data/Qwen2-VL-7B-Instruct --served-model-name Qwen2-VL-7B --port 10000 |
您好,请检查一下您使用的vllm版本,似乎不是正确的版本? |
0.6.0和0.5.5都报这个错误 |
额,看链接。不是官方版本,官方版本还不支持。这个项目:https://github.com/fyabc/vllm/tree/add_qwen2_vl_new |
@wuzhizhige 目前Qwen2-VL vllm支持尚未合并到官方,请使用这个版本:https://github.com/fyabc/vllm/tree/add_qwen2_vl_new |
你好,我安装了你指定的这个版本的transformers还是有这个问题,请问在config.json中增加factor具体是怎么加呢? |
@azuercici 可以修改如下 {
...
"rope_scaling": {
"type": "mrope",
"factor": 1,
"mrope_section": [
16,
24,
24
]
},
} |
配置文件和vllm我都修改了但是还是报错:: NameError: name 'rod_scaling' is not defined |
注意名称应当是'rope_scaling'而非'rod_scaling' |
目前使用这个版本的transformer还是会报原来的错误 |
vLLM 初始化时传入 rope_scaling 参数覆盖原有 config,可以临时解决。 llm = LLM(
model=model_dir,
rope_scaling={
"type": "mrope",
"mrope_section": [
16,
24,
24
],
},
) |
@fyabc 看起来现在最新的transformers版本也没合入这个变更,还有啥方法不? |
把模型文件目录下config.json中的type修改为rope_type可以规避这个错误 |
这是我的运行代码:
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /home/wangll/llm/model_download_demo/models/Qwen/Qwen2-VL-7B-Instruct
以下是报错信息:
INFO 09-03 18:48:04 api_server.py:440] vLLM API server version 0.5.5
INFO 09-03 18:48:04 api_server.py:441] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/wangll/llm/model_download_demo/models/Qwen/Qwen2-VL-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2-VL-7B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 476, in
asyncio.run(run_server(args))
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 443, in run_server
async with build_async_engine_client(args) as async_engine_client:
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client
if (model_is_embedding(args.model, args.trust_remote_code,
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 71, in model_is_embedding
return ModelConfig(model=model_name,
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/config.py", line 214, in init
self.max_model_len = _get_and_verify_max_len(
File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/config.py", line 1650, in _get_and_verify_max_len
assert "factor" in rope_scaling
AssertionError
我去看了模型的配置文件config.json,里面的rope_scaling确实没有factor字段,
"rope_scaling": {
"type": "mrope",
"mrope_section": [
16,
24,
24
]
},
"vocab_size": 152064
}
The text was updated successfully, but these errors were encountered: