-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: VLLM crashes when prefix caching is enabled #7003
Comments
with vllm=0.5.3.post, only thing i changed was enabling prefix caching, crashed with illegal cuda access errors |
@zachzzc @raywanb Still facing the same issue when adopt the following model:
My vllm version is
|
Can you share the input that caused this to error? |
@raywanb Here is my input:
This would raise the issue:
|
Seem this is multi_modal model? I do seem some related issues. Can someone confirm whether the multi_modal + prefix cache is supported or not? |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
There seems to be a bug that makes prefix caching and prompt_logprobs incompatible. I was able to reproduce the
|
Your current environment
🐛 Describe the bug
VLLM crashes 100% of the time when using an async engine initialized with
enable_prefix_caching=True
The stacktrace is:
The problem goes away completely when
enable_prefix_caching=False
. This is on VLLM version 0.4.3.The text was updated successfully, but these errors were encountered: