Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Ultravox audio doesn't work with auto tool choice #14209

Open
1 task done
erkintelnyx opened this issue Mar 4, 2025 · 1 comment
Open
1 task done

[Bug]: Ultravox audio doesn't work with auto tool choice #14209

erkintelnyx opened this issue Mar 4, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@erkintelnyx
Copy link

erkintelnyx commented Mar 4, 2025

Your current environment

The output of `python collect_env.py`
Collecting environment information...
INFO 03-04 12:10:58 [__init__.py:207] Automatically detected platform rocm.
PyTorch version: 2.7.0a0+git3a58512
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.3.42133-1b9c17779

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)
CMake version: version 3.31.4
Libc version: glibc-2.35

Python version: 3.12.9 (main, Feb  5 2025, 08:49:00) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-127-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI100 (gfx908:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.3.42133
MIOpen runtime version: 3.3.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               16
On-line CPU(s) list:                  0-15
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7713 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             3999.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities
Virtualization:                       AMD-V
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            1 MiB (16 instances)
L1i cache:                            1 MiB (16 instances)
L2 cache:                             8 MiB (16 instances)
L3 cache:                             256 MiB (16 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-15
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.1
[pip3] torch==2.7.0a0+git3a58512
[pip3] torchvision==0.19.1a0+6194369
[pip3] transformers==4.49.0
[pip3] triton==3.2.0+gite5be006a
[conda] Could not collect
ROCM Version: 6.3.42133-1b9c17779
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

🐛 Describe the bug

When I run ultravox v5 via:

$ VLLM_USE_V1=1 vllm serve fixie-ai/ultravox-v0_5-llama-3_3-70b --tensor-parallel-size 8 --download-dir /app/data/models --trust-remote-code --enable-auto-tool-choice --chat-template-content-format openai --chat-template /app/vllm/examples/tool_chat_template_llama3.1_json.jinja --tool-call-parser llama3_json --enable-chunked-prefill --max-model-len 9000

INFO 03-03 18:56:52 [__init__.py:207] Automatically detected platform rocm.
INFO 03-03 18:57:09 [api_server.py:912] vLLM API server version 0.7.4.dev181+gf35f8e22.d20250303
INFO 03-03 18:57:09 [api_server.py:913] args: Namespace(subparser='serve', model_tag='fixie-ai/ultravox-v0_5-llama-3_3-70b', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='/app/vllm/examples/tool_chat_template_llama3.1_json.jinja', chat_template_content_format='openai', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='llama3_json', tool_parser_plugin='', model='fixie-ai/ultravox-v0_5-llama-3_3-70b', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir='/app/data/models', load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=9000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f2598eb9e40>)
WARNING 03-03 18:57:09 [arg_utils.py:1434] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
INFO 03-03 18:57:34 [config.py:576] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 03-03 18:57:34 [config.py:1486] Defaulting to use mp for distributed inference
INFO 03-03 18:57:34 [config.py:1519] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 03-03 18:57:34 [config.py:1661] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-03 18:57:40 [__init__.py:207] Automatically detected platform rocm.
INFO 03-03 18:57:57 [core.py:50] Initializing a V1 LLM engine (v0.7.4.dev181+gf35f8e22.d20250303) with config: model='fixie-ai/ultravox-v0_5-llama-3_3-70b', speculative_config=None, tokenizer='fixie-ai/ultravox-v0_5-llama-3_3-70b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=9000, download_dir='/app/data/models', load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=fixie-ai/ultravox-v0_5-llama-3_3-70b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-03 18:57:57 [multiproc_worker_utils.py:309] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-03 18:57:57 [custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 03-03 18:57:57 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 10485760, 10, 'psm_c2d15247'), local_subscribe_addr='ipc:///tmp/42d28b49-4bba-4d95-a4fd-1d3640a6b3a0', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 03-03 18:58:02 [__init__.py:207] Automatically detected platform rocm.

both chat completion and tool call work.
But audio doesn't work in this case (pasting base64 as well for ease of running);

from openai import OpenAI
import requests
import json
import time


openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

text = "Tell me a fun fact!"

# audio_base64 = generate_audio(text)

# mp3 base64 for "Tell me a fun fact!"
audio_base64 = ""


client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "fixie-ai/ultravox-v0_5-llama-3_3-70b"

chat_completion = client.chat.completions.create(
    messages=[
        {"role": "user", "content": [
            {
                "type": "audio_url",
                "audio_url": {
                    # Any format supported by librosa is supported
                    "url": f"data:audio/mp3;base64,{audio_base64}"
                },
            },
        ]},
    ],
    model=model,
    stream=False,
)

print(chat_completion)

on vLLM side I get:

ERROR 03-03 19:06:24 [serving_chat.py:664] Error in chat completion stream generator.
ERROR 03-03 19:06:24 [serving_chat.py:664] Traceback (most recent call last):
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 362, in chat_completion_stream_generator
ERROR 03-03 19:06:24 [serving_chat.py:664]     async for res in result_generator:
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 208, in _generate
ERROR 03-03 19:06:24 [serving_chat.py:664]     q = await self.add_request(
ERROR 03-03 19:06:24 [serving_chat.py:664]         ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 153, in add_request
ERROR 03-03 19:06:24 [serving_chat.py:664]     request = self.processor.process_inputs(request_id, prompt, params,
ERROR 03-03 19:06:24 [serving_chat.py:664]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/processor.py", line 129, in process_inputs
ERROR 03-03 19:06:24 [serving_chat.py:664]     preprocessed_inputs = self.input_preprocessor.preprocess(
ERROR 03-03 19:06:24 [serving_chat.py:664]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/inputs/preprocess.py", line 766, in preprocess
ERROR 03-03 19:06:24 [serving_chat.py:664]     return self._process_decoder_only_prompt(
ERROR 03-03 19:06:24 [serving_chat.py:664]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/inputs/preprocess.py", line 715, in _process_decoder_only_prompt
ERROR 03-03 19:06:24 [serving_chat.py:664]     prompt_comps = self._prompt_to_llm_inputs(
ERROR 03-03 19:06:24 [serving_chat.py:664]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/inputs/preprocess.py", line 347, in _prompt_to_llm_inputs
ERROR 03-03 19:06:24 [serving_chat.py:664]     return self._process_multimodal(
ERROR 03-03 19:06:24 [serving_chat.py:664]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/inputs/preprocess.py", line 277, in _process_multimodal
ERROR 03-03 19:06:24 [serving_chat.py:664]     return mm_processor.apply(prompt, mm_data, mm_processor_kwargs)
ERROR 03-03 19:06:24 [serving_chat.py:664]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1513, in apply
ERROR 03-03 19:06:24 [serving_chat.py:664]     self._validate_mm_placeholders(mm_placeholders, mm_item_counts)
ERROR 03-03 19:06:24 [serving_chat.py:664]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1423, in _validate_mm_placeholders
ERROR 03-03 19:06:24 [serving_chat.py:664]     raise RuntimeError(
ERROR 03-03 19:06:24 [serving_chat.py:664] RuntimeError: Expected there to be 1 prompt updates corresponding to 1 audio items, but instead found 0 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs, or there is a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between `_call_hf_processor` and `_get_prompt_updates`).

The result is same without specifying chat template.

This works when auto tool is not enabled.

I found this thread which mentions it should work.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@erkintelnyx erkintelnyx added the bug Something isn't working label Mar 4, 2025
@DarkLight1337
Copy link
Member

cc @farzadab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants