initial works on enabling automatic prefix caching #162

huijjj · 2024-08-06T12:52:00Z

This PR enables automatic prefix caching in intel gaudi HPUs.
Please refer to this RFC for detailed informations about prefix caching.

huijjj · 2024-08-07T07:11:42Z

While reviewing the changes for myself, I found some issues with cache operation implementations.
Currently, habana_main branch has changed version of cache engine implementation, but it seems like those changes are not reflected to hpu/cache_ops.py yet. However as those cache operations are needed in prefix caching scenario, I will add the required changes to this PR.

huijjj · 2024-09-19T01:48:41Z

Please note that this PR contains the very basics of enabling automatic prefix caching on gaudi, which only contains the following.

prefix trie based block manager for automatic prefix caching - already implemented in codes from vllm-project/vllm
HabanaPagedAttention.forward_prefix (this PR)

For full support, following jobs should be done.

pad inputs to proper buckets
warmup logics for prompts with context
enable fully supported HpuGraph

Further optimizations could be done to boost performance.

reduce host overhead
enable fused spda in forward_prefix

huijjj · 2024-09-19T02:26:16Z

vllm/worker/habana_model_runner.py

+_PAD_SLOT_ID = torch.iinfo(torch.int).max - 256
+_PAD_BLOCK_ID = torch.iinfo(torch.int).max 


We shouldn't be using block 0 and slot 0 for padding as those block and slot can be used. With prefix caching enabled, block manager allocates block from 0. Even for the cases without prefix caching, there are cases that we use 100% of blocks.
So instead, I suggest you to use int max for the padded value as we can leverage the fact that TPC implementation of index select and index put in gaudi tolerates index out of range.

import torch import habana_frameworks.torch foo = torch.zeros(4, device="hpu") idx = torch.tensor([1203], device="hpu") print(foo[idx]) # prints tensor([0.], device="hpu") foo[idx] = 1203 # this is nop print(foo) # prints tensor([0., 0., 0., 0.], device="hpu")

huijjj · 2024-10-16T04:53:18Z

@kzawora-intel
Splited the PR in two just as you requested. Please take a look in PR at vllm-hpu-extension first.
After that is merged, I will bump the requirements-hpu.txt to install proper commit of vllm-hpu-extension in this PR.

michalkuligowski · 2024-10-17T07:20:48Z

Hi @huijjj #12 is now merged, please update https://github.com/HabanaAI/vllm-fork/blob/habana_main/requirements-hpu.txt to 79f3aa7

huijjj · 2024-10-17T07:59:16Z

@michalkuligowski Thanks, I've updated the requirements.

michalkuligowski · 2024-10-17T08:31:50Z

There are some conflicts to resolve on the branch, do you want me to look into it or do you want to resolve them?

huijjj · 2024-10-17T10:56:20Z

Conflicts are resolved.

michalkuligowski · 2024-10-17T12:12:16Z

@huijjj there are some static code analysis issues, please run format.sh script from main directory

huijjj · 2024-10-17T12:26:20Z

@michalkuligowski
Sorry, I forgot to format code after the rebase. Codes are now formated with format.sh(merge-base set to origin/habana_main).

huijjj · 2024-10-17T12:39:47Z

cpu-test seems to be failing due to the issue I mentioned above,

As blocks are used from 0 with prefix caching enabled, we shouldn't be using block 0 for paddings. So instead, I leveraged the fact that synapseAI silently suppresses the index out of range error and returns 0(padded value set in TPC kernel) for index selects and do nothing for index puts. However as this is not the default behaviour of torch(in both cpu and gpu) this seems to be troublesome.

Does this issue need to be resolved for my PR to get merged?
To properly handle this issue, I think this PR should be also revisited.

kzawora-intel · 2024-10-28T15:46:20Z

Hi there, I've looked through the code and I think it's great and mostly in a mergeable state. There are two things I have some comments on - padding slot id and recompilations.

For padding, let's not use out-of-bound blocks and let's stick to block 0. Block 0 is handled correctly since #313 was merged. I've tested your PR with both and it works fine.

For recompilations, it's a bit more complicated:

I've used the following script to test if prefix caching is working properly (taken mostly from https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html):

import time
from vllm import LLM, SamplingParams


# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
LONG_PROMPT = (
    "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n"
    + """
| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1   | John Doe      | 29  | Engineer      | USA           | [email protected]   | 555-1234       | 123 Elm St, Springfield, IL  |
| 2   | Jane Smith    | 34  | Doctor        | Canada        | [email protected] | 555-5678       | 456 Oak St, Toronto, ON      |
| 3   | Alice Johnson | 27  | Teacher       | UK            | [email protected]    | 555-8765       | 789 Pine St, London, UK      |
| 4   | Bob Brown     | 45  | Artist        | Australia     | [email protected]      | 555-4321       | 321 Maple St, Sydney, NSW    |
| 5   | Carol White   | 31  | Scientist     | New Zealand   | [email protected]    | 555-6789       | 654 Birch St, Wellington, NZ |
| 6   | Dave Green    | 28  | Lawyer        | Ireland       | [email protected]     | 555-3456       | 987 Cedar St, Dublin, IE     |
| 7   | Emma Black    | 40  | Musician      | USA           | [email protected]     | 555-1111       | 246 Ash St, New York, NY     |
| 8   | Frank Blue    | 37  | Chef          | Canada        | [email protected]    | 555-2222       | 135 Spruce St, Vancouver, BC |
| 9   | Grace Yellow  | 50  | Engineer      | UK            | [email protected]    | 555-3333       | 864 Fir St, Manchester, UK   |
| 10  | Henry Violet  | 32  | Artist        | Australia     | [email protected]    | 555-4444       | 753 Willow St, Melbourne, VIC|
| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | [email protected]    | 555-5555       | 912 Poplar St, Auckland, NZ  |
| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | [email protected]     | 555-6666       | 159 Elm St, Cork, IE         |
| 13  | Karen Red     | 41  | Lawyer        | USA           | [email protected]    | 555-7777       | 357 Cedar St, Boston, MA     |
| 14  | Leo Brown     | 30  | Chef          | Canada        | [email protected]      | 555-8888       | 246 Oak St, Calgary, AB      |
| 15  | Mia Green     | 33  | Musician      | UK            | [email protected]      | 555-9999       | 975 Pine St, Edinburgh, UK   |
| 16  | Noah Yellow   | 29  | Doctor        | Australia     | [email protected]     | 555-0000       | 864 Birch St, Brisbane, QLD  |
| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | [email protected]   | 555-1212       | 753 Maple St, Hamilton, NZ   |
| 18  | Peter Black   | 42  | Artist        | Ireland       | [email protected]    | 555-3434       | 912 Fir St, Limerick, IE     |
| 19  | Quinn White   | 28  | Scientist     | USA           | [email protected]    | 555-5656       | 159 Willow St, Seattle, WA   |
| 20  | Rachel Red    | 31  | Teacher       | Canada        | [email protected]   | 555-7878       | 357 Poplar St, Ottawa, ON    |
| 21  | Steve Green   | 44  | Lawyer        | UK            | [email protected]    | 555-9090       | 753 Elm St, Birmingham, UK   |
| 22  | Tina Blue     | 36  | Musician      | Australia     | [email protected]     | 555-1213       | 864 Cedar St, Perth, WA      |
| 23  | Umar Black    | 39  | Chef          | New Zealand   | [email protected]     | 555-3435       | 975 Spruce St, Christchurch, NZ|
| 24  | Victor Yellow | 43  | Engineer      | Ireland       | [email protected]   | 555-5657       | 246 Willow St, Galway, IE    |
| 25  | Wendy Orange  | 27  | Artist        | USA           | [email protected]    | 555-7879       | 135 Elm St, Denver, CO       |
| 26  | Xavier Green  | 34  | Scientist     | Canada        | [email protected]   | 555-9091       | 357 Oak St, Montreal, QC     |
| 27  | Yara Red      | 41  | Teacher       | UK            | [email protected]     | 555-1214       | 975 Pine St, Leeds, UK       |
| 28  | Zack Blue     | 30  | Lawyer        | Australia     | [email protected]     | 555-3436       | 135 Birch St, Adelaide, SA   |
| 29  | Amy White     | 33  | Musician      | New Zealand   | [email protected]      | 555-5658       | 159 Maple St, Wellington, NZ |
| 30  | Ben Black     | 38  | Chef          | Ireland       | [email protected]      | 555-7870       | 246 Fir St, Waterford, IE    |
"""
)


def get_generation_time(llm, sampling_params, prompts):
    # time the generation
    start_time = time.time()
    output = llm.generate(prompts, sampling_params=sampling_params)
    end_time = time.time()
    # print the output and generation time
    print(f"Output: {output[0].outputs[0].text}")
    print(f"Generation time: {end_time - start_time} seconds.")


# set enable_prefix_caching=True to enable APC
llm = LLM(
    model="meta-llama/Llama-3.1-8B",
    enable_prefix_caching=True,
    max_num_seqs=1,
    max_model_len=8192,
    dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0, max_tokens=100)

names = [
    "John Doe",
    "Jane Smith",
    "Alice Johnson",
    "Bob Brown",
    "Carol White",
    "Dave Green",
    "Emma Black",
    "Frank Blue",
    "Grace Yellow",
    "Henry Violet",
    "Irene Orange",
    "Jack Indigo",
    "Karen Red",
    "Leo Brown",
    "Mia Green",
    "Noah Yellow",
    "Olivia Blue",
    "Peter Black",
    "Quinn White",
    "Rachel Red",
    "Steve Green",
    "Tina Blue",
    "Umar Black",
    "Victor Yellow",
    "Wendy Orange",
    "Xavier Green",
    "Yara Red",
    "Zack Blue",
    "Amy White",
    "Ben Black",
]
for name in names:
    get_generation_time(
        llm,
        sampling_params,
        LONG_PROMPT
        + f"Question: what is the age of {name}? Your answer: The age of {name} is ",
    )

With enable_prefix_caching=False I can see that the performance is steady:

Processed prompts:   0%|                                                                                                                       | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 10-28 17:27:29 hpu_model_runner.py:1887] Configuration: (prompt, 1, 1408) was not warmed-up!
WARNING 10-28 17:27:30 hpu_model_runner.py:1887] Configuration: (decode, 1, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.15it/s, est. speed input: 1587.83 toks/s, output: 3.44 toks/s]
Output: 29.
Generation time: 0.8759636878967285 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.76it/s, est. speed input: 12139.45 toks/s, output: 26.31 toks/s]
Output: 34.
Generation time: 0.11672687530517578 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.82it/s, est. speed input: 12218.54 toks/s, output: 26.48 toks/s]
Output: 27.
Generation time: 0.11556267738342285 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.97it/s, est. speed input: 12427.35 toks/s, output: 26.94 toks/s]
Output: 45.
Generation time: 0.11353206634521484 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.85it/s, est. speed input: 12273.23 toks/s, output: 26.60 toks/s]
Output: 31.
Generation time: 0.11492085456848145 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.86it/s, est. speed input: 12281.22 toks/s, output: 26.62 toks/s]
Output: 28.
...

With enable_prefix_caching=True I see the following:

Processed prompts:   0%|                                                                                                                       | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 10-28 17:28:22 hpu_model_runner.py:1887] Configuration: (prompt, 1, 1408) was not warmed-up!
WARNING 10-28 17:28:23 hpu_model_runner.py:1887] Configuration: (decode, 1, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.15it/s, est. speed input: 1596.23 toks/s, output: 3.46 toks/s]
Output: 29.
Generation time: 0.871584415435791 seconds.
Processed prompts:   0%|                                                                                                                       | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 10-28 17:28:23 hpu_model_runner.py:1887] Configuration: (prompt, 1, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.96it/s, est. speed input: 4098.56 toks/s, output: 8.88 toks/s]
Output: 34.
Generation time: 0.3404667377471924 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.52it/s, est. speed input: 27051.96 toks/s, output: 58.62 toks/s]
Output: 27.
Generation time: 0.053774118423461914 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.30it/s, est. speed input: 28120.92 toks/s, output: 60.94 toks/s]
Output: 45.
Generation time: 0.05145907402038574 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.31it/s, est. speed input: 28131.82 toks/s, output: 60.96 toks/s]
Output: 31.
Generation time: 0.051267385482788086 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.35it/s, est. speed input: 28190.57 toks/s, output: 61.09 toks/s]
Output: 28.
...

So, overall, in this case we see ~2x performance boost in steady state. That's great! But here's the thing that worries me: Non-prefix caching achieves steady performance after first iteration (graph compilation), and prefix caching does additional graph compilation in the second step. The graph with prefix-cached prefills is never warmed up, and will need to be compiled on the go. This is something we avoid currently with warmup phase for regular prefills and decodes, but there is no such warmup mechanism for prefix caching, meaning if we enable it, we're guaranteed to have recompilations in runtime. Even worse, at the executor level, we don't even know that we recompiled and we can't throw a warning to the user, since cached and uncached prefills are treated as the same phase. We should probably make a distinction between prompt_uncached and prompt_cached phases, and perform warmup on both (if prefix caching is enabled).

That said, I don't think we should include warmup in the scope of this PR. We can merge this as is (with padding ids changed), and create follow-up PRs for the mentioned features. This is already very good.

huijjj · 2024-10-29T04:00:53Z

@kzawora-intel I updated the pad values to 0 as you mentioned. PR is now ready to be merged. Thanks.

However, I still have some concerns of using block 0 like this, I also see the similar concerns in the vllm-project/vllm upstream PR. Please let me know if any further actions are made regarding this.

kzawora-intel · 2024-10-29T07:53:54Z

Please fix formatting with format.sh so that all checks pass, and we can merge it.

huijjj · 2024-10-29T08:17:53Z

Code formatted, thanks

xuechendi · 2024-11-15T00:41:33Z

@huijjj @kzawora-intel , I tried with APC and met one error caused by vllm-hpu-extension
I proposed a fix, and now APC is working
Fix: HabanaAI/vllm-hpu-extension#33

Please help to review.

huijjj mentioned this pull request Aug 8, 2024

[Performance]: context aware HpuRotaryEmbedding implementation #166

Closed

huijjj force-pushed the enable-prefix-caching branch from b9aedd2 to 60753bc Compare August 21, 2024 06:59

kzawora-intel added the external Issues or PRs submitted by external users label Aug 29, 2024

huijjj force-pushed the enable-prefix-caching branch from 60753bc to c05e969 Compare September 6, 2024 05:53

huijjj mentioned this pull request Sep 12, 2024

[Bug]: Prefix Caching raises error #273

Closed

huijjj force-pushed the enable-prefix-caching branch from a203d1a to 5e6d30b Compare September 15, 2024 11:00

michalkuligowski approved these changes Sep 17, 2024

View reviewed changes

huijjj commented Sep 19, 2024

View reviewed changes

huijjj mentioned this pull request Sep 25, 2024

Enable prefix caching SqueezeBits/vllm-fork#11

Merged

huijjj mentioned this pull request Oct 16, 2024

add: migrate changes for prefix caching from vllm repository HabanaAI/vllm-hpu-extension#12

Merged

huijjj added 2 commits October 16, 2024 04:47

add: prefix caching

456f45f

fix: use proper shapes for lora

79a093c

huijjj force-pushed the enable-prefix-caching branch from c71f35f to 79a093c Compare October 16, 2024 04:48

sync: dump vllm-hpu-extension commit

642af38

huijjj force-pushed the enable-prefix-caching branch from 7331919 to 4d6417f Compare October 17, 2024 10:55

Merge branch 'habana_main' into enable-prefix-caching

4d6417f

chores: format code

b6b369a

huijjj force-pushed the enable-prefix-caching branch from cc7d5b8 to b84fc09 Compare October 29, 2024 01:39

Merge branch 'habana_main' into enable-prefix-caching

b84fc09

fix: use block 0 and slot 0 for padding

a9b2123

misc: format codes

00860d5

kzawora-intel approved these changes Oct 29, 2024

View reviewed changes

kzawora-intel merged commit 1dcdb37 into HabanaAI:habana_main Oct 29, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial works on enabling automatic prefix caching #162

initial works on enabling automatic prefix caching #162

huijjj commented Aug 6, 2024 •

edited

Loading

huijjj commented Aug 7, 2024

huijjj commented Sep 19, 2024

huijjj Sep 19, 2024

huijjj commented Oct 16, 2024

michalkuligowski commented Oct 17, 2024

huijjj commented Oct 17, 2024

michalkuligowski commented Oct 17, 2024

huijjj commented Oct 17, 2024

michalkuligowski commented Oct 17, 2024

huijjj commented Oct 17, 2024

huijjj commented Oct 17, 2024

kzawora-intel commented Oct 28, 2024 •

edited

Loading

huijjj commented Oct 29, 2024

kzawora-intel commented Oct 29, 2024

huijjj commented Oct 29, 2024

xuechendi commented Nov 15, 2024 •

edited

Loading

		_PAD_SLOT_ID = torch.iinfo(torch.int).max - 256
		_PAD_BLOCK_ID = torch.iinfo(torch.int).max

initial works on enabling automatic prefix caching #162

initial works on enabling automatic prefix caching #162

Conversation

huijjj commented Aug 6, 2024 • edited Loading

huijjj commented Aug 7, 2024

huijjj commented Sep 19, 2024

huijjj Sep 19, 2024

Choose a reason for hiding this comment

huijjj commented Oct 16, 2024

michalkuligowski commented Oct 17, 2024

huijjj commented Oct 17, 2024

michalkuligowski commented Oct 17, 2024

huijjj commented Oct 17, 2024

michalkuligowski commented Oct 17, 2024

huijjj commented Oct 17, 2024

huijjj commented Oct 17, 2024

kzawora-intel commented Oct 28, 2024 • edited Loading

huijjj commented Oct 29, 2024

kzawora-intel commented Oct 29, 2024

huijjj commented Oct 29, 2024

xuechendi commented Nov 15, 2024 • edited Loading

huijjj commented Aug 6, 2024 •

edited

Loading

kzawora-intel commented Oct 28, 2024 •

edited

Loading

xuechendi commented Nov 15, 2024 •

edited

Loading