[Bug]: vLLM 0.5.5 and FlashInfer0.1.6 #8091

wlwqq · 2024-09-02T16:23:20Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

vllm == 0.5.5
FlashInfer==0.1.6+cu121torch2.4

🐛 Describe the bug

when i use vLLm0.5.5 and FlashInfer0.1.6 to run Gemma-2-2b in T4.
FlashInfer0.1.6 support T4: https://github.com/flashinfer-ai/flashinfer/releases
but i see：

INFO 09-02 16:07:55 model_runner.py:890] Loading model weights took 4.8999 GB
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 230, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 31, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
    engine = cls(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
    return engine_class(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in __init__
    self._initialize_kv_caches()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 390, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 113, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 222, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1097, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1415, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py", line 342, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py", line 281, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py", line 225, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py", line 165, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
  File "/opt/conda/lib/python3.10/site-packages/vllm/attention/backends/flashinfer.py", line 688, in forward
    output = torch.ops.vllm.flash_attn_varlen_func(
  File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
  File "/opt/conda/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 236, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func
    return _flash_attn_varlen_func(
  File "/opt/conda/lib/python3.10/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
  File "/opt/conda/lib/python3.10/site-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

I'm not sure if it's a problem with vLLM's integration with FlashInfer
@youkaichao @LiuXiaoxuanPKU

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

jonzhep · 2024-09-03T06:30:34Z

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

yzh119 · 2024-09-05T20:17:05Z

You are not using flashinfer backend if you encountered this error. This error message is reported by flash-attn package.

Try setting VLLM_ATTENTION_BACKEND=FLASHINFER and run the script again.

youkaichao · 2024-09-05T20:23:58Z

see #8189 (comment)

@yzh119 we do have some todos for flashinfer backend, before that, flashinfer still depends on flash attention for prefill :(

yzh119 · 2024-09-05T20:29:11Z

Hi @youkaichao , thanks for letting me know!
flashinfer v0.1.7 will be fully JIT, and I'll make it a pypi package which can be set as a vllm dependency. I'll keep you posted about the progress.

github-actions · 2024-12-05T02:06:25Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

VMatrix1900 · 2024-12-08T18:40:59Z

Hi @youkaichao , thanks for letting me know! flashinfer v0.1.7 will be fully JIT, and I'll make it a pypi package which can be set as a vllm dependency. I'll keep you posted about the progress.

Any progress?

VMatrix1900 · 2024-12-22T15:04:53Z

Hi @youkaichao , thanks for letting me know!

flashinfer v0.1.7 will be fully JIT, and I'll make it a pypi package which can be set as a vllm dependency. I'll keep you posted about the progress.

Now that flashinfer 0.2 is released, any progress on this?

youkaichao · 2024-12-22T19:00:45Z

@yzh119 is working on publishing the wheels to pypi .

wlwqq added the bug Something isn't working label Sep 2, 2024

github-actions bot added the stale label Dec 5, 2024

github-actions bot added unstale and removed stale labels Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vLLM 0.5.5 and FlashInfer0.1.6 #8091

[Bug]: vLLM 0.5.5 and FlashInfer0.1.6 #8091

wlwqq commented Sep 2, 2024 •

edited

Loading

jonzhep commented Sep 3, 2024

yzh119 commented Sep 5, 2024

youkaichao commented Sep 5, 2024

yzh119 commented Sep 5, 2024

github-actions bot commented Dec 5, 2024

VMatrix1900 commented Dec 8, 2024

VMatrix1900 commented Dec 22, 2024

youkaichao commented Dec 22, 2024

[Bug]: vLLM 0.5.5 and FlashInfer0.1.6 #8091

[Bug]: vLLM 0.5.5 and FlashInfer0.1.6 #8091

Comments

wlwqq commented Sep 2, 2024 • edited Loading

Your current environment

🐛 Describe the bug

Before submitting a new issue...

jonzhep commented Sep 3, 2024

yzh119 commented Sep 5, 2024

youkaichao commented Sep 5, 2024

yzh119 commented Sep 5, 2024

github-actions bot commented Dec 5, 2024

VMatrix1900 commented Dec 8, 2024

VMatrix1900 commented Dec 22, 2024

youkaichao commented Dec 22, 2024

wlwqq commented Sep 2, 2024 •

edited

Loading