-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vLLM 0.5.5 and FlashInfer0.1.6 #8091
Comments
RuntimeError: FlashAttention only supports Ampere GPUs or newer. |
You are not using flashinfer backend if you encountered this error. This error message is reported by flash-attn package. Try setting |
see #8189 (comment) @yzh119 we do have some todos for flashinfer backend, before that, flashinfer still depends on flash attention for prefill :( |
Hi @youkaichao , thanks for letting me know! |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Any progress? |
Now that flashinfer 0.2 is released, any progress on this? |
@yzh119 is working on publishing the wheels to pypi . |
Your current environment
The output of `python collect_env.py`
vllm == 0.5.5
FlashInfer==0.1.6+cu121torch2.4
🐛 Describe the bug
when i use vLLm0.5.5 and FlashInfer0.1.6 to run Gemma-2-2b in T4.
FlashInfer0.1.6 support T4: https://github.com/flashinfer-ai/flashinfer/releases
but i see:
I'm not sure if it's a problem with vLLM's integration with FlashInfer
@youkaichao @LiuXiaoxuanPKU
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: