[Bug]: Unable to serve Qwen2-audio in V1 #12168

superfan89 · 2025-01-17T14:30:27Z

Your current environment

The output of `python collect_env.py`

INFO 01-17 22:19:48 __init__.py:179] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.5.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 10.3.0-1ubuntu1~18.04~1) 10.3.0
Clang version: Could not collect
CMake version: version 3.31.2
Libc version: glibc-2.27

Python version: 3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:31:09) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             2635.266
CPU max MHz:         2450.0000
CPU min MHz:         1500.0000
BogoMIPS:            4890.87
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-31
NUMA node1 CPU(s):   32-63
NUMA node2 CPU(s):   64-95
NUMA node3 CPU(s):   96-127
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca sme sev sev_es

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] nvidia-cublas-cu11==11.11.3.6
[pip3] nvidia-cuda-cupti-cu11==11.8.87
[pip3] nvidia-cuda-nvrtc-cu11==11.8.89
[pip3] nvidia-cuda-runtime-cu11==11.8.89
[pip3] nvidia-cudnn-cu11==9.1.0.70
[pip3] nvidia-cufft-cu11==10.9.0.58
[pip3] nvidia-curand-cu11==10.3.0.86
[pip3] nvidia-cusolver-cu11==11.4.1.48
[pip3] nvidia-cusparse-cu11==11.7.5.86
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu11==2.21.5
[pip3] nvidia-nvtx-cu11==11.8.86
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1+cu118
[pip3] torchaudio==2.5.1+cu118
[pip3] torchvision==0.20.1+cu118
[pip3] transformers==4.48.0
[pip3] triton==3.1.0
[conda] numpy                     1.26.3                   pypi_0    pypi
[conda] nvidia-cublas-cu11        11.11.3.6                pypi_0    pypi
[conda] nvidia-cuda-cupti-cu11    11.8.87                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu11    11.8.89                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu11  11.8.89                  pypi_0    pypi
[conda] nvidia-cudnn-cu11         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu11         10.9.0.58                pypi_0    pypi
[conda] nvidia-curand-cu11        10.3.0.86                pypi_0    pypi
[conda] nvidia-cusolver-cu11      11.4.1.48                pypi_0    pypi
[conda] nvidia-cusparse-cu11      11.7.5.86                pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu11          2.21.5                   pypi_0    pypi
[conda] nvidia-nvtx-cu11          11.8.86                  pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.5.1+cu118              pypi_0    pypi
[conda] torchaudio                2.5.1+cu118              pypi_0    pypi
[conda] torchvision               0.20.1+cu118             pypi_0    pypi
[conda] transformers              4.48.0                   pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.6.post2.dev249+gb8bfa46a
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    SYS     PXB     SYS     0-31    0               N/A
GPU1    NV12     X      NV12    NV12    SYS     PXB     SYS     0-31    0               N/A
GPU2    NV12    NV12     X      NV12    SYS     SYS     PXB     96-127  3               N/A
GPU3    NV12    NV12    NV12     X      SYS     SYS     PXB     96-127  3               N/A
NIC0    SYS     SYS     SYS     SYS      X      SYS     SYS
NIC1    PXB     PXB     SYS     SYS     SYS      X      SYS
NIC2    SYS     SYS     PXB     PXB     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2

LD_LIBRARY_PATH=/xxx/.conda/envs/vllm_v1/lib/python3.12/site-packages/cv2/../../lib64:/xxx/.local/bin:/usr/local/cuda-11.7/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=GPU-dad83af5-de81-eafa-7fd4-ff1b5e460e6e,GPU-2fd3e2ae-f180-a811-3484-2ec565c2d55c,GPU-0d90c382-438d-0572-8fc0-751f6d5fcc69,GPU-77f75057-c7a9-a82f-5a47-3b17a4bc973e
NVIDIA_PRODUCT_NAME=CUDA
NCCL_VERSION=2.13.4-1
NVIDIA_CUDA_END_OF_LIFE=1
PYTORCH_VERSION=v2.0.0
CUDA_VERSION=11.7.0
NVIDIA_DRIVER_CAPABILITIES=video,compute,utility,graphics
NVIDIA_REQUIRE_CUDA=cuda>=11.7 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

No response

🐛 Describe the bug

Failed to serve Qwen2-audio with V1 engine (would like to enable prefix caching):

VLLM_TRACE_FUNCTION=1 NCCL_DEBUG=TRACE VLLM_LOGGING_LEVEL=DEBUG VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 vllm serve /xxx/omni/Qwen2-Audio/Qwen2-Audio-7B-Instruct --limit_mm_per_prompt 'audio=5'

Traceback:

INFO 01-17 22:16:00 __init__.py:179] Automatically detected platform cuda.                        
INFO 01-17 22:16:03 api_server.py:768] vLLM API server version 0.6.6.post2.dev249+gb8bfa46a       
INFO 01-17 22:16:03 api_server.py:769] args: Namespace(subparser='serve', model_tag='/xxx/omni/Qwen2-Audio/Qwen2-Audio-7B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/xxx/omni/Qwen2-Audio/Qwen2-Audio-7B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'audio': 5}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7fde8a988b80>)                                                                       
WARNING 01-17 22:16:03 arg_utils.py:1283] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.                  
INFO 01-17 22:16:21 config.py:520] This model supports multiple tasks: {'embed', 'score', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 01-17 22:16:21 config.py:1482] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                                                                                   [33/1875]
INFO 01-17 22:16:35 __init__.py:179] Automatically detected platform cuda.                                                                                                                                                                                                  
INFO 01-17 22:16:38 core.py:45] Initializing an LLM engine (v0.6.6.post2.dev249+gb8bfa46a) with config: model='/xxx/omni/Qwen2-Audio/Qwen2-Audio-7B-Instruct', speculative_config=None, tokenizer='/xxx/omni/Qwen2-Audio/Qwen2-Audio-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/xxx/omni/Qwen2-Audio/Qwen2-Audio-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"candidate_compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"compile_sizes":[],"capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}                                                                                                                                   
INFO 01-17 22:16:41 gpu_model_runner.py:688] Starting to load model /xxx/omni/Qwen2-Audio/Qwen2-Audio-7B-Instruct...                                                                                                                                            
INFO 01-17 22:16:42 cuda.py:179] Using Flash Attention backend on V1 engine.                                                                                                                                                                                                
WARNING 01-17 22:16:42 topk_topp_sampler.py:44] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.                                                              
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]                                                                                                                                                                                                
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.52it/s]                                                                                                                                                                                        
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.36it/s]                                                                                                                                                                                        
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.18it/s]                                                                                                                                                                                        
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.17it/s]                                                                                                                                                                                        
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.45it/s]                                                                                                                                                                                        
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.35it/s]                                                                                                                                                                                        
                                                                                                                                                                                                                                                                            
INFO 01-17 22:16:46 gpu_model_runner.py:693] Loading model weights took 15.6454 GB                                                                                                                                                                                          
INFO 01-17 22:16:46 gpu_model_runner.py:767] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 3 audio items of the maximum feature size.                                                                                                   
ERROR 01-17 22:16:46 core.py:205] EngineCore hit an exception: Traceback (most recent call last):                                                                                                                                                                           
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/inputs/registry.py", line 160, in call_hf_processor                                                                                                                                            
ERROR 01-17 22:16:46 core.py:205]     return hf_processor(**data, **merged_kwargs, return_tensors="pt")                                                                                                                                                                     
ERROR 01-17 22:16:46 core.py:205]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                     
ERROR 01-17 22:16:46 core.py:205]   File "/home/yyy/.conda/envs/vllm_v1/lib/python3.12/site-packages/transformers/models/qwen2_audio/processing_qwen2_audio.py", line 115, in __call__                                                                                
ERROR 01-17 22:16:46 core.py:205]     num_audios = 1 if type(audios) == np.ndarray else len(audios)                                                                                                                                                                         
ERROR 01-17 22:16:46 core.py:205]                                                       ^^^^^^^^^^^                                                                                                                                                                         
ERROR 01-17 22:16:46 core.py:205] TypeError: object of type 'NoneType' has no len()                                                                                                                                                                                         
ERROR 01-17 22:16:46 core.py:205]                                                                                                                                                                                                                                           
ERROR 01-17 22:16:46 core.py:205] The above exception was the direct cause of the following exception:                                                                                                                                                                      
ERROR 01-17 22:16:46 core.py:205] 
ERROR 01-17 22:16:46 core.py:205] Traceback (most recent call last):
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/v1/engine/core.py", line 197, in run_engine_core
ERROR 01-17 22:16:46 core.py:205]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 01-17 22:16:46 core.py:205]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/v1/engine/core.py", line 151, in __init__
ERROR 01-17 22:16:46 core.py:205]     super().__init__(vllm_config, executor_class)
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/v1/engine/core.py", line 52, in __init__
ERROR 01-17 22:16:46 core.py:205]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 01-17 22:16:46 core.py:205]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/v1/engine/core.py", line 77, in _initialize_kv_caches
ERROR 01-17 22:16:46 core.py:205]     availble_gpu_memory = self.model_executor.determine_available_memory()
ERROR 01-17 22:16:46 core.py:205]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/v1/executor/uniproc_executor.py", line 57, in determine_available_memory
ERROR 01-17 22:16:46 core.py:205]     return self.worker.determine_available_memory()
ERROR 01-17 22:16:46 core.py:205]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/home/yyy/.conda/envs/vllm_v1/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-17 22:16:46 core.py:205]     return func(*args, **kwargs)
ERROR 01-17 22:16:46 core.py:205]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/v1/worker/gpu_worker.py", line 134, in determine_available_memory
ERROR 01-17 22:16:46 core.py:205]     self.model_runner.profile_run()
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/v1/worker/gpu_model_runner.py", line 773, in profile_run
ERROR 01-17 22:16:46 core.py:205]     dummy_request_data = self.input_registry.dummy_data_for_profiling(
ERROR 01-17 22:16:46 core.py:205]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/inputs/registry.py", line 333, in dummy_data_for_profiling
ERROR 01-17 22:16:46 core.py:205]     dummy_data = profiler.get_dummy_data(seq_len)
ERROR 01-17 22:16:46 core.py:205]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/multimodal/profiling.py", line 161, in get_dummy_data
ERROR 01-17 22:16:46 core.py:205]     mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
ERROR 01-17 22:16:46 core.py:205]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/multimodal/profiling.py", line 139, in _get_dummy_mm_inputs
ERROR 01-17 22:16:46 core.py:205]     return self.processor.apply(
ERROR 01-17 22:16:46 core.py:205]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/multimodal/processing.py", line 1104, in apply
ERROR 01-17 22:16:46 core.py:205]     prompt_ids, mm_kwargs = self._cached_apply_hf_processor(
ERROR 01-17 22:16:46 core.py:205]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/multimodal/processing.py", line 880, in _cached_apply_hf_processor
ERROR 01-17 22:16:46 core.py:205]     prompt_ids, mm_missing_kwargs = self._apply_hf_processor_main(
ERROR 01-17 22:16:46 core.py:205]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/multimodal/processing.py", line 826, in _apply_hf_processor_main
ERROR 01-17 22:16:46 core.py:205]     prompt_ids = self._apply_hf_processor_text_only(prompt)
ERROR 01-17 22:16:46 core.py:205]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/multimodal/processing.py", line 753, in _apply_hf_processor_text_only
ERROR 01-17 22:16:46 core.py:205]     prompt_ids, _ = self._apply_hf_processor_text_mm(
ERROR 01-17 22:16:46 core.py:205]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/multimodal/processing.py", line 729, in _apply_hf_processor_text_mm
ERROR 01-17 22:16:46 core.py:205]     processed_data = self._call_hf_processor(
ERROR 01-17 22:16:46 core.py:205]                      ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/model_executor/models/qwen2_audio.py", line 171, in _call_hf_processor
ERROR 01-17 22:16:46 core.py:205]     processed_outputs = super()._call_hf_processor(
ERROR 01-17 22:16:46 core.py:205]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/multimodal/processing.py", line 711, in _call_hf_processor
ERROR 01-17 22:16:46 core.py:205]     return self.info.ctx.call_hf_processor(
ERROR 01-17 22:16:46 core.py:205]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-17 22:16:46 core.py:205]   File "/xxx/code/vllm_v1/vllm/inputs/registry.py", line 165, in call_hf_processor
ERROR 01-17 22:16:46 core.py:205]     raise RuntimeError(msg) from exc
ERROR 01-17 22:16:46 core.py:205] RuntimeError: Failed to apply Qwen2AudioProcessor on data={'text': '<|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|>'} with kwargs={}
ERROR 01-17 22:16:46 core.py:205]
CRITICAL 01-17 22:16:46 core_client.py:146] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

commit id=87a0c076afafb93dd082ff3876bea08adca56c56

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2025-01-17T15:05:35Z

~~This should be fixed if you install the latest code (not the latest release).~~ Let me look into this...

DarkLight1337 · 2025-01-17T15:15:42Z

Hmm, I'm able to run this model if I set --max-model-len 4096. Do you get a similar result?

DarkLight1337 · 2025-01-17T15:16:31Z

Maybe you have to update your local HF repo as the HF processor for this model changed recently.

superfan89 · 2025-01-18T03:15:27Z

Maybe you have to update your local HF repo as the HF processor for this model changed recently.

Thanks @DarkLight1337 ! I was actually using tranformers=4.48.0 and latest vLLM local build when I encountered the above issue. I downgraded to tranformers=4.47.1 and the model was successfully loaded without any issue. I think this is caused by this HF change introduced in 4.48.0?

DarkLight1337 · 2025-01-18T14:17:42Z

This issue should be fixed in #12187, can you try it out?

superfan89 · 2025-01-20T13:50:51Z

This issue should be fixed in #12187, can you try it out?

Thanks for taking action. I verified that the issue was fixed with #12187

superfan89 added the bug Something isn't working label Jan 17, 2025

DarkLight1337 mentioned this issue Jan 19, 2025

[Bugfix] Fix multi-modal processors for transformers 4.48 #12187

Merged

simon-mo closed this as completed in #12187 Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Unable to serve Qwen2-audio in V1 #12168

[Bug]: Unable to serve Qwen2-audio in V1 #12168

superfan89 commented Jan 17, 2025 •

edited

Loading

DarkLight1337 commented Jan 17, 2025 •

edited

Loading

DarkLight1337 commented Jan 17, 2025

DarkLight1337 commented Jan 17, 2025

superfan89 commented Jan 18, 2025

DarkLight1337 commented Jan 18, 2025

superfan89 commented Jan 20, 2025 •

edited

Loading

[Bug]: Unable to serve Qwen2-audio in V1 #12168

[Bug]: Unable to serve Qwen2-audio in V1 #12168

Comments

superfan89 commented Jan 17, 2025 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

DarkLight1337 commented Jan 17, 2025 • edited Loading

DarkLight1337 commented Jan 17, 2025

DarkLight1337 commented Jan 17, 2025

superfan89 commented Jan 18, 2025

DarkLight1337 commented Jan 18, 2025

superfan89 commented Jan 20, 2025 • edited Loading

superfan89 commented Jan 17, 2025 •

edited

Loading

DarkLight1337 commented Jan 17, 2025 •

edited

Loading

superfan89 commented Jan 20, 2025 •

edited

Loading