Generation with Prefix-cache are slower than the ones without it ? #3154

vin136 · 2024-03-02T01:38:22Z

I'm running the tutorial vllm/offline_inference_with_prefix.py and measuring the generation times, again below is the same code with generation times

`
import argparse
from typing import List, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams

import time
from vllm import LLM, SamplingParams

prefix = (
"You are an expert school principal, skilled in effectively managing "
"faculty and staff. Draft 10-15 questions for a potential first grade "
"Head Teacher for my K-12, all-girls', independent school that emphasizes "
"community, joyful discovery, and life-long learning. The candidate is "
"coming in for a first-round panel interview for a 8th grade Math "
"teaching role. They have 5 years of previous teaching experience "
"as an assistant teacher at a co-ed, public school with experience "
"in middle school math teaching. Based on these information, fulfill "
"the following paragraph: ")

Sample prompts.

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]

Create a sampling params object.

sampling_params = SamplingParams(temperature=0.0)

if name == 'main':
# Create an LLM.
llm = LLM(model="facebook/opt-125m")

generating_prompts = [prefix + prompt for prompt in prompts]

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
st = time.perf_counter()
outputs = llm.generate(generating_prompts, sampling_params)
end = time.perf_counter()
print(f"without caching time:{end-st}")

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("-" * 80)

# -1 since the last token can change when concatenating prompts.
prefix_pos = len(llm.llm_engine.tokenizer.encode(prefix)) - 1

# The llm.generate call will batch all prompts and send the batch at once if resources allow.
# The prefix will only be cached after the first batch is processed, so we need to call generate once
# to calculate the prefix and cache it.
outputs = llm.generate(generating_prompts[0],
                    sampling_params,
                    prefix_pos=[prefix_pos])

# Subsequent batches can leverage the cached prefix
st = time.perf_counter()
outputs = llm.generate(generating_prompts,
                    sampling_params,
                    prefix_pos=[prefix_pos] * len(generating_prompts))
end = time.perf_counter()
print(f"with caching time:{end-st}")

# Print the outputs. You should see the same outputs as before
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

`

Output:
with caching time:1.9611055543646216
without caching time:0.07439832389354706

VLLM: vllm==0.3.3

The text was updated successfully, but these errors were encountered:

shixianc · 2024-03-03T21:51:36Z

The automatic prefix caching commit seems merged very recent and labeled as 0.3.4 release. So I assume some changes are not available on 0.3.3

Update: actually I just tested that PR #2762 and I can also confirm it's slower than original. I think the PR mentions that performance is not optimized currently.

Qubitium · 2024-03-05T04:24:18Z

Although I have not tested original prefix cache, I am seeing something strange that may point to the benchmark data being skewed.

One would assume like @vin136 that the second request on-ward would be fastest (due to caching). In my simple tests, the second request is always the slowest by a significant margin. At least for the master vllm checkout I did today with auto-prefix caching enabled.

So my suggestion is to collect data from request No.3 forward and not No.2 and see if new implementation is slower than old.

robertgshaw2-redhat · 2024-03-10T22:56:54Z

Yup - we are working this week on optimizing the performance. Original PR focused on correctness.

Once we have performance we will focus on enabling by default

Qubitium · 2024-03-11T01:10:13Z

@robertgshaw2-neuralmagic Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

robertgshaw2-redhat · 2024-03-11T23:17:06Z

@robertgshaw2-neuralmagic Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

It's hard to tell without a bit more info about the request pattern. Can you share a snippet from the client code ?

In general, you should not expect to see any speedup at current. Its still experimental and we are working on performance of the eviction data structure

shixianc · 2024-03-12T16:44:56Z

@robertgshaw2-neuralmagic thanks, we're really looking forward for the optimization!

Also, could you clarify on the behavior of this feature:

in the same batch, first N tokens of the requests will be shared.
in the second batch, first N tokens of the rests will be shared with requests in the first batch.
Which of the above is the expected behavior? Mainly the difference is that, do we need to let the vllm engine completes prompt processing phase once for 1 request, and then sending the rest common-prefix requests?

thefirebanks · 2024-03-28T06:33:27Z

Just to confirm, not only it isn't optimized, but also not enabled by default right?
If I run:

model_id = 'meta-llama/Llama-2-7b-hf'
llm = vllm.LLM(model=model_id)
cache_config = llm.llm_engine.cache_config
print(cache_config.__dict__.keys())

I get

dict_keys(['block_size', 'gpu_memory_utilization', 'swap_space_bytes', 'cache_dtype', 'sliding_window', 'num_gpu_blocks', 'num_cpu_blocks'])

The enable_prefix_caching argument is not there... and when I try to initialize an LLM object with the parameter, I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input [In [21]] llm2 = vllm.LLM(model=model_id, enable_prefix_caching=True)
...
TypeError: __init__() got an unexpected keyword argument 'enable_prefix_caching'

vin136 · 2024-04-06T04:20:16Z

With the latest version(0.4.0), it seems we cannot enable prefix caching with Mistral-type models(sliding window attention).


if enable_caching and sliding_window is not None:
            raise NotImplementedError(
                "Sliding window is not allowed with prefix caching enabled!")

Any workaround or insights on why this is the case please

DavidPeleg6 · 2024-04-08T09:47:17Z

@vin136 i think i saw this mentioned in a different issue. but for now you can go into your model config and manually change the sliding window size to null

Maxppddcsz · 2024-04-08T12:15:18Z

Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

In my test, even though I used the same request three times, the response time was about the same, what does your test code look like?@Qubitium

HillZhang1999 · 2024-05-07T10:00:37Z

Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

In my test, even though I used the same request three times, the response time was about the same, what does your test code look like?@Qubitium

I also met the same condition, do you have any idea 🤔?

github-actions · 2024-10-30T02:01:11Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-30T02:02:21Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions bot added the stale label Oct 30, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation with Prefix-cache are slower than the ones without it ? #3154

Generation with Prefix-cache are slower than the ones without it ? #3154

vin136 commented Mar 2, 2024 •

edited

Loading

shixianc commented Mar 3, 2024 •

edited

Loading

Qubitium commented Mar 5, 2024 •

edited

Loading

robertgshaw2-redhat commented Mar 10, 2024 •

edited

Loading

Qubitium commented Mar 11, 2024 •

edited

Loading

robertgshaw2-redhat commented Mar 11, 2024 •

edited

Loading

shixianc commented Mar 12, 2024 •

edited

Loading

thefirebanks commented Mar 28, 2024

vin136 commented Apr 6, 2024 •

edited

Loading

DavidPeleg6 commented Apr 8, 2024

Maxppddcsz commented Apr 8, 2024

HillZhang1999 commented May 7, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 30, 2024

Generation with Prefix-cache are slower than the ones without it ? #3154

Generation with Prefix-cache are slower than the ones without it ? #3154

Comments

vin136 commented Mar 2, 2024 • edited Loading

Sample prompts.

Create a sampling params object.

shixianc commented Mar 3, 2024 • edited Loading

Qubitium commented Mar 5, 2024 • edited Loading

robertgshaw2-redhat commented Mar 10, 2024 • edited Loading

Qubitium commented Mar 11, 2024 • edited Loading

robertgshaw2-redhat commented Mar 11, 2024 • edited Loading

shixianc commented Mar 12, 2024 • edited Loading

thefirebanks commented Mar 28, 2024

vin136 commented Apr 6, 2024 • edited Loading

DavidPeleg6 commented Apr 8, 2024

Maxppddcsz commented Apr 8, 2024

HillZhang1999 commented May 7, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 30, 2024

vin136 commented Mar 2, 2024 •

edited

Loading

shixianc commented Mar 3, 2024 •

edited

Loading

Qubitium commented Mar 5, 2024 •

edited

Loading

robertgshaw2-redhat commented Mar 10, 2024 •

edited

Loading

Qubitium commented Mar 11, 2024 •

edited

Loading

robertgshaw2-redhat commented Mar 11, 2024 •

edited

Loading

shixianc commented Mar 12, 2024 •

edited

Loading

vin136 commented Apr 6, 2024 •

edited

Loading