-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial works on enabling automatic prefix caching #162
initial works on enabling automatic prefix caching #162
Conversation
While reviewing the changes for myself, I found some issues with cache operation implementations. |
b9aedd2
to
60753bc
Compare
60753bc
to
c05e969
Compare
a203d1a
to
5e6d30b
Compare
Please note that this PR contains the very basics of enabling automatic prefix caching on gaudi, which only contains the following.
For full support, following jobs should be done.
Further optimizations could be done to boost performance.
|
vllm/worker/habana_model_runner.py
Outdated
_PAD_SLOT_ID = torch.iinfo(torch.int).max - 256 | ||
_PAD_BLOCK_ID = torch.iinfo(torch.int).max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be using block 0 and slot 0 for padding as those block and slot can be used. With prefix caching enabled, block manager allocates block from 0. Even for the cases without prefix caching, there are cases that we use 100% of blocks.
So instead, I suggest you to use int max for the padded value as we can leverage the fact that TPC implementation of index select and index put in gaudi tolerates index out of range.
import torch
import habana_frameworks.torch
foo = torch.zeros(4, device="hpu")
idx = torch.tensor([1203], device="hpu")
print(foo[idx]) # prints tensor([0.], device="hpu")
foo[idx] = 1203 # this is nop
print(foo) # prints tensor([0., 0., 0., 0.], device="hpu")
c71f35f
to
79a093c
Compare
@kzawora-intel |
Hi @huijjj #12 is now merged, please update https://github.com/HabanaAI/vllm-fork/blob/habana_main/requirements-hpu.txt to 79f3aa7 |
@michalkuligowski Thanks, I've updated the requirements. |
There are some conflicts to resolve on the branch, do you want me to look into it or do you want to resolve them? |
7331919
to
4d6417f
Compare
Conflicts are resolved. |
@huijjj there are some static code analysis issues, please run format.sh script from main directory |
@michalkuligowski |
cpu-test seems to be failing due to the issue I mentioned above, As blocks are used from 0 with prefix caching enabled, we shouldn't be using block 0 for paddings. So instead, I leveraged the fact that synapseAI silently suppresses the index out of range error and returns 0(padded value set in TPC kernel) for index selects and do nothing for index puts. However as this is not the default behaviour of torch(in both cpu and gpu) this seems to be troublesome. Does this issue need to be resolved for my PR to get merged? |
Hi there, I've looked through the code and I think it's great and mostly in a mergeable state. There are two things I have some comments on - padding slot id and recompilations. For padding, let's not use out-of-bound blocks and let's stick to block 0. Block 0 is handled correctly since #313 was merged. I've tested your PR with both and it works fine. For recompilations, it's a bit more complicated: I've used the following script to test if prefix caching is working properly (taken mostly from https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html): import time
from vllm import LLM, SamplingParams
# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
LONG_PROMPT = (
"You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n"
+ """
| ID | Name | Age | Occupation | Country | Email | Phone Number | Address |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1 | John Doe | 29 | Engineer | USA | [email protected] | 555-1234 | 123 Elm St, Springfield, IL |
| 2 | Jane Smith | 34 | Doctor | Canada | [email protected] | 555-5678 | 456 Oak St, Toronto, ON |
| 3 | Alice Johnson | 27 | Teacher | UK | [email protected] | 555-8765 | 789 Pine St, London, UK |
| 4 | Bob Brown | 45 | Artist | Australia | [email protected] | 555-4321 | 321 Maple St, Sydney, NSW |
| 5 | Carol White | 31 | Scientist | New Zealand | [email protected] | 555-6789 | 654 Birch St, Wellington, NZ |
| 6 | Dave Green | 28 | Lawyer | Ireland | [email protected] | 555-3456 | 987 Cedar St, Dublin, IE |
| 7 | Emma Black | 40 | Musician | USA | [email protected] | 555-1111 | 246 Ash St, New York, NY |
| 8 | Frank Blue | 37 | Chef | Canada | [email protected] | 555-2222 | 135 Spruce St, Vancouver, BC |
| 9 | Grace Yellow | 50 | Engineer | UK | [email protected] | 555-3333 | 864 Fir St, Manchester, UK |
| 10 | Henry Violet | 32 | Artist | Australia | [email protected] | 555-4444 | 753 Willow St, Melbourne, VIC|
| 11 | Irene Orange | 26 | Scientist | New Zealand | [email protected] | 555-5555 | 912 Poplar St, Auckland, NZ |
| 12 | Jack Indigo | 38 | Teacher | Ireland | [email protected] | 555-6666 | 159 Elm St, Cork, IE |
| 13 | Karen Red | 41 | Lawyer | USA | [email protected] | 555-7777 | 357 Cedar St, Boston, MA |
| 14 | Leo Brown | 30 | Chef | Canada | [email protected] | 555-8888 | 246 Oak St, Calgary, AB |
| 15 | Mia Green | 33 | Musician | UK | [email protected] | 555-9999 | 975 Pine St, Edinburgh, UK |
| 16 | Noah Yellow | 29 | Doctor | Australia | [email protected] | 555-0000 | 864 Birch St, Brisbane, QLD |
| 17 | Olivia Blue | 35 | Engineer | New Zealand | [email protected] | 555-1212 | 753 Maple St, Hamilton, NZ |
| 18 | Peter Black | 42 | Artist | Ireland | [email protected] | 555-3434 | 912 Fir St, Limerick, IE |
| 19 | Quinn White | 28 | Scientist | USA | [email protected] | 555-5656 | 159 Willow St, Seattle, WA |
| 20 | Rachel Red | 31 | Teacher | Canada | [email protected] | 555-7878 | 357 Poplar St, Ottawa, ON |
| 21 | Steve Green | 44 | Lawyer | UK | [email protected] | 555-9090 | 753 Elm St, Birmingham, UK |
| 22 | Tina Blue | 36 | Musician | Australia | [email protected] | 555-1213 | 864 Cedar St, Perth, WA |
| 23 | Umar Black | 39 | Chef | New Zealand | [email protected] | 555-3435 | 975 Spruce St, Christchurch, NZ|
| 24 | Victor Yellow | 43 | Engineer | Ireland | [email protected] | 555-5657 | 246 Willow St, Galway, IE |
| 25 | Wendy Orange | 27 | Artist | USA | [email protected] | 555-7879 | 135 Elm St, Denver, CO |
| 26 | Xavier Green | 34 | Scientist | Canada | [email protected] | 555-9091 | 357 Oak St, Montreal, QC |
| 27 | Yara Red | 41 | Teacher | UK | [email protected] | 555-1214 | 975 Pine St, Leeds, UK |
| 28 | Zack Blue | 30 | Lawyer | Australia | [email protected] | 555-3436 | 135 Birch St, Adelaide, SA |
| 29 | Amy White | 33 | Musician | New Zealand | [email protected] | 555-5658 | 159 Maple St, Wellington, NZ |
| 30 | Ben Black | 38 | Chef | Ireland | [email protected] | 555-7870 | 246 Fir St, Waterford, IE |
"""
)
def get_generation_time(llm, sampling_params, prompts):
# time the generation
start_time = time.time()
output = llm.generate(prompts, sampling_params=sampling_params)
end_time = time.time()
# print the output and generation time
print(f"Output: {output[0].outputs[0].text}")
print(f"Generation time: {end_time - start_time} seconds.")
# set enable_prefix_caching=True to enable APC
llm = LLM(
model="meta-llama/Llama-3.1-8B",
enable_prefix_caching=True,
max_num_seqs=1,
max_model_len=8192,
dtype="bfloat16",
)
sampling_params = SamplingParams(temperature=0, max_tokens=100)
names = [
"John Doe",
"Jane Smith",
"Alice Johnson",
"Bob Brown",
"Carol White",
"Dave Green",
"Emma Black",
"Frank Blue",
"Grace Yellow",
"Henry Violet",
"Irene Orange",
"Jack Indigo",
"Karen Red",
"Leo Brown",
"Mia Green",
"Noah Yellow",
"Olivia Blue",
"Peter Black",
"Quinn White",
"Rachel Red",
"Steve Green",
"Tina Blue",
"Umar Black",
"Victor Yellow",
"Wendy Orange",
"Xavier Green",
"Yara Red",
"Zack Blue",
"Amy White",
"Ben Black",
]
for name in names:
get_generation_time(
llm,
sampling_params,
LONG_PROMPT
+ f"Question: what is the age of {name}? Your answer: The age of {name} is ",
) With
With
So, overall, in this case we see ~2x performance boost in steady state. That's great! But here's the thing that worries me: Non-prefix caching achieves steady performance after first iteration (graph compilation), and prefix caching does additional graph compilation in the second step. The graph with prefix-cached prefills is never warmed up, and will need to be compiled on the go. This is something we avoid currently with warmup phase for regular prefills and decodes, but there is no such warmup mechanism for prefix caching, meaning if we enable it, we're guaranteed to have recompilations in runtime. Even worse, at the executor level, we don't even know that we recompiled and we can't throw a warning to the user, since cached and uncached prefills are treated as the same phase. We should probably make a distinction between prompt_uncached and prompt_cached phases, and perform warmup on both (if prefix caching is enabled). That said, I don't think we should include warmup in the scope of this PR. We can merge this as is (with padding ids changed), and create follow-up PRs for the mentioned features. This is already very good. |
cc7d5b8
to
b84fc09
Compare
@kzawora-intel I updated the pad values to 0 as you mentioned. PR is now ready to be merged. Thanks. However, I still have some concerns of using block 0 like this, I also see the similar concerns in the vllm-project/vllm upstream PR. Please let me know if any further actions are made regarding this. |
Please fix formatting with format.sh so that all checks pass, and we can merge it. |
Code formatted, thanks |
@huijjj @kzawora-intel , I tried with APC and met one error caused by vllm-hpu-extension Please help to review. |
This PR enables automatic prefix caching in intel gaudi HPUs.
Please refer to this RFC for detailed informations about prefix caching.