Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation with Prefix-cache are slower than the ones without it ? #3154

Closed
vin136 opened this issue Mar 2, 2024 · 13 comments
Closed

Generation with Prefix-cache are slower than the ones without it ? #3154

vin136 opened this issue Mar 2, 2024 · 13 comments
Labels

Comments

@vin136
Copy link

vin136 commented Mar 2, 2024

I'm running the tutorial vllm/offline_inference_with_prefix.py and measuring the generation times, again below is the same code with generation times

`
import argparse
from typing import List, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams

import time
from vllm import LLM, SamplingParams

prefix = (
"You are an expert school principal, skilled in effectively managing "
"faculty and staff. Draft 10-15 questions for a potential first grade "
"Head Teacher for my K-12, all-girls', independent school that emphasizes "
"community, joyful discovery, and life-long learning. The candidate is "
"coming in for a first-round panel interview for a 8th grade Math "
"teaching role. They have 5 years of previous teaching experience "
"as an assistant teacher at a co-ed, public school with experience "
"in middle school math teaching. Based on these information, fulfill "
"the following paragraph: ")

Sample prompts.

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]

Create a sampling params object.

sampling_params = SamplingParams(temperature=0.0)

if name == 'main':
# Create an LLM.
llm = LLM(model="facebook/opt-125m")

generating_prompts = [prefix + prompt for prompt in prompts]

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
st = time.perf_counter()
outputs = llm.generate(generating_prompts, sampling_params)
end = time.perf_counter()
print(f"without caching time:{end-st}")

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("-" * 80)

# -1 since the last token can change when concatenating prompts.
prefix_pos = len(llm.llm_engine.tokenizer.encode(prefix)) - 1

# The llm.generate call will batch all prompts and send the batch at once if resources allow.
# The prefix will only be cached after the first batch is processed, so we need to call generate once
# to calculate the prefix and cache it.
outputs = llm.generate(generating_prompts[0],
                    sampling_params,
                    prefix_pos=[prefix_pos])

# Subsequent batches can leverage the cached prefix
st = time.perf_counter()
outputs = llm.generate(generating_prompts,
                    sampling_params,
                    prefix_pos=[prefix_pos] * len(generating_prompts))
end = time.perf_counter()
print(f"with caching time:{end-st}")

# Print the outputs. You should see the same outputs as before
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

`

Output:
with caching time:1.9611055543646216
without caching time:0.07439832389354706

VLLM: vllm==0.3.3

@shixianc
Copy link

shixianc commented Mar 3, 2024

The automatic prefix caching commit seems merged very recent and labeled as 0.3.4 release. So I assume some changes are not available on 0.3.3

Update: actually I just tested that PR #2762 and I can also confirm it's slower than original. I think the PR mentions that performance is not optimized currently.

@Qubitium
Copy link
Contributor

Qubitium commented Mar 5, 2024

Although I have not tested original prefix cache, I am seeing something strange that may point to the benchmark data being skewed.

One would assume like @vin136 that the second request on-ward would be fastest (due to caching). In my simple tests, the second request is always the slowest by a significant margin. At least for the master vllm checkout I did today with auto-prefix caching enabled.

So my suggestion is to collect data from request No.3 forward and not No.2 and see if new implementation is slower than old.

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Mar 10, 2024

Yup - we are working this week on optimizing the performance. Original PR focused on correctness.

Once we have performance we will focus on enabling by default

@Qubitium
Copy link
Contributor

Qubitium commented Mar 11, 2024

@robertgshaw2-neuralmagic Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Mar 11, 2024

@robertgshaw2-neuralmagic Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

It's hard to tell without a bit more info about the request pattern. Can you share a snippet from the client code ?

In general, you should not expect to see any speedup at current. Its still experimental and we are working on performance of the eviction data structure

@shixianc
Copy link

shixianc commented Mar 12, 2024

@robertgshaw2-neuralmagic thanks, we're really looking forward for the optimization!

Also, could you clarify on the behavior of this feature:

  1. in the same batch, first N tokens of the requests will be shared.
  2. in the second batch, first N tokens of the rests will be shared with requests in the first batch.
    Which of the above is the expected behavior? Mainly the difference is that, do we need to let the vllm engine completes prompt processing phase once for 1 request, and then sending the rest common-prefix requests?

@thefirebanks
Copy link

Just to confirm, not only it isn't optimized, but also not enabled by default right?
If I run:

model_id = 'meta-llama/Llama-2-7b-hf'
llm = vllm.LLM(model=model_id)
cache_config = llm.llm_engine.cache_config
print(cache_config.__dict__.keys())

I get

dict_keys(['block_size', 'gpu_memory_utilization', 'swap_space_bytes', 'cache_dtype', 'sliding_window', 'num_gpu_blocks', 'num_cpu_blocks'])

The enable_prefix_caching argument is not there... and when I try to initialize an LLM object with the parameter, I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input [In [21]] llm2 = vllm.LLM(model=model_id, enable_prefix_caching=True)
...
TypeError: __init__() got an unexpected keyword argument 'enable_prefix_caching'

@vin136
Copy link
Author

vin136 commented Apr 6, 2024

With the latest version(0.4.0), it seems we cannot enable prefix caching with Mistral-type models(sliding window attention).


if enable_caching and sliding_window is not None:
            raise NotImplementedError(
                "Sliding window is not allowed with prefix caching enabled!")

Any workaround or insights on why this is the case please

@DavidPeleg6
Copy link

@vin136 i think i saw this mentioned in a different issue. but for now you can go into your model config and manually change the sliding window size to null

@Maxppddcsz
Copy link

Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

In my test, even though I used the same request three times, the response time was about the same, what does your test code look like?@Qubitium

@HillZhang1999
Copy link

Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

In my test, even though I used the same request three times, the response time was about the same, what does your test code look like?@Qubitium

I also met the same condition, do you have any idea 🤔?

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 30, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants