-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generation with Prefix-cache are slower than the ones without it ? #3154
Comments
The automatic prefix caching commit seems merged very recent and labeled as 0.3.4 release. So I assume some changes are not available on 0.3.3 Update: actually I just tested that PR #2762 and I can also confirm it's slower than original. I think the PR mentions that performance is not optimized currently. |
Although I have not tested original prefix cache, I am seeing something strange that may point to the benchmark data being skewed. One would assume like @vin136 that the second request on-ward would be fastest (due to caching). In my simple tests, the second request is always the slowest by a significant margin. At least for the master vllm checkout I did today with auto-prefix caching enabled. So my suggestion is to collect data from request No.3 forward and not No.2 and see if new implementation is slower than old. |
Yup - we are working this week on optimizing the performance. Original PR focused on correctness. Once we have performance we will focus on enabling by default |
@robertgshaw2-neuralmagic Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works. |
It's hard to tell without a bit more info about the request pattern. Can you share a snippet from the client code ? In general, you should not expect to see any speedup at current. Its still experimental and we are working on performance of the eviction data structure |
@robertgshaw2-neuralmagic thanks, we're really looking forward for the optimization! Also, could you clarify on the behavior of this feature:
|
Just to confirm, not only it isn't optimized, but also not enabled by default right? model_id = 'meta-llama/Llama-2-7b-hf'
llm = vllm.LLM(model=model_id)
cache_config = llm.llm_engine.cache_config
print(cache_config.__dict__.keys()) I get
The
|
With the latest version(0.4.0), it seems we cannot enable prefix caching with Mistral-type models(sliding window attention).
Any workaround or insights on why this is the case please |
@vin136 i think i saw this mentioned in a different issue. but for now you can go into your model config and manually change the sliding window size to null |
In my test, even though I used the same request three times, the response time was about the same, what does your test code look like?@Qubitium |
I also met the same condition, do you have any idea 🤔? |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
I'm running the tutorial vllm/offline_inference_with_prefix.py and measuring the generation times, again below is the same code with generation times
`
import argparse
from typing import List, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
import time
from vllm import LLM, SamplingParams
prefix = (
"You are an expert school principal, skilled in effectively managing "
"faculty and staff. Draft 10-15 questions for a potential first grade "
"Head Teacher for my K-12, all-girls', independent school that emphasizes "
"community, joyful discovery, and life-long learning. The candidate is "
"coming in for a first-round panel interview for a 8th grade Math "
"teaching role. They have 5 years of previous teaching experience "
"as an assistant teacher at a co-ed, public school with experience "
"in middle school math teaching. Based on these information, fulfill "
"the following paragraph: ")
Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0)
if name == 'main':
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
`
Output:
with caching time:1.9611055543646216
without caching time:0.07439832389354706
VLLM: vllm==0.3.3
The text was updated successfully, but these errors were encountered: