Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial works on enabling automatic prefix caching #162

Merged
merged 8 commits into from
Oct 29, 2024

Conversation

huijjj
Copy link

@huijjj huijjj commented Aug 6, 2024

This PR enables automatic prefix caching in intel gaudi HPUs.
Please refer to this RFC for detailed informations about prefix caching.

@huijjj
Copy link
Author

huijjj commented Aug 7, 2024

While reviewing the changes for myself, I found some issues with cache operation implementations.
Currently, habana_main branch has changed version of cache engine implementation, but it seems like those changes are not reflected to hpu/cache_ops.py yet. However as those cache operations are needed in prefix caching scenario, I will add the required changes to this PR.

@huijjj huijjj force-pushed the enable-prefix-caching branch from b9aedd2 to 60753bc Compare August 21, 2024 06:59
@kzawora-intel kzawora-intel added the external Issues or PRs submitted by external users label Aug 29, 2024
@huijjj huijjj force-pushed the enable-prefix-caching branch from 60753bc to c05e969 Compare September 6, 2024 05:53
@huijjj huijjj force-pushed the enable-prefix-caching branch from a203d1a to 5e6d30b Compare September 15, 2024 11:00
@huijjj
Copy link
Author

huijjj commented Sep 19, 2024

Please note that this PR contains the very basics of enabling automatic prefix caching on gaudi, which only contains the following.

  • prefix trie based block manager for automatic prefix caching - already implemented in codes from vllm-project/vllm
  • HabanaPagedAttention.forward_prefix (this PR)

For full support, following jobs should be done.

  • pad inputs to proper buckets
  • warmup logics for prompts with context
  • enable fully supported HpuGraph

Further optimizations could be done to boost performance.

  • reduce host overhead
  • enable fused spda in forward_prefix

Comment on lines 57 to 58
_PAD_SLOT_ID = torch.iinfo(torch.int).max - 256
_PAD_BLOCK_ID = torch.iinfo(torch.int).max
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be using block 0 and slot 0 for padding as those block and slot can be used. With prefix caching enabled, block manager allocates block from 0. Even for the cases without prefix caching, there are cases that we use 100% of blocks.
So instead, I suggest you to use int max for the padded value as we can leverage the fact that TPC implementation of index select and index put in gaudi tolerates index out of range.

import torch
import habana_frameworks.torch

foo = torch.zeros(4, device="hpu")
idx = torch.tensor([1203], device="hpu")

print(foo[idx]) # prints tensor([0.], device="hpu")
foo[idx] = 1203 # this is nop
print(foo) # prints tensor([0., 0., 0., 0.], device="hpu")

@huijjj huijjj force-pushed the enable-prefix-caching branch from c71f35f to 79a093c Compare October 16, 2024 04:48
@huijjj
Copy link
Author

huijjj commented Oct 16, 2024

@kzawora-intel
Splited the PR in two just as you requested. Please take a look in PR at vllm-hpu-extension first.
After that is merged, I will bump the requirements-hpu.txt to install proper commit of vllm-hpu-extension in this PR.

@michalkuligowski
Copy link

Hi @huijjj #12 is now merged, please update https://github.com/HabanaAI/vllm-fork/blob/habana_main/requirements-hpu.txt to 79f3aa7

@huijjj
Copy link
Author

huijjj commented Oct 17, 2024

@michalkuligowski Thanks, I've updated the requirements.

@michalkuligowski
Copy link

There are some conflicts to resolve on the branch, do you want me to look into it or do you want to resolve them?

@huijjj huijjj force-pushed the enable-prefix-caching branch from 7331919 to 4d6417f Compare October 17, 2024 10:55
@huijjj
Copy link
Author

huijjj commented Oct 17, 2024

Conflicts are resolved.

@michalkuligowski
Copy link

@huijjj there are some static code analysis issues, please run format.sh script from main directory

@huijjj
Copy link
Author

huijjj commented Oct 17, 2024

@michalkuligowski
Sorry, I forgot to format code after the rebase. Codes are now formated with format.sh(merge-base set to origin/habana_main).

@huijjj
Copy link
Author

huijjj commented Oct 17, 2024

cpu-test seems to be failing due to the issue I mentioned above,

As blocks are used from 0 with prefix caching enabled, we shouldn't be using block 0 for paddings. So instead, I leveraged the fact that synapseAI silently suppresses the index out of range error and returns 0(padded value set in TPC kernel) for index selects and do nothing for index puts. However as this is not the default behaviour of torch(in both cpu and gpu) this seems to be troublesome.

Does this issue need to be resolved for my PR to get merged?
To properly handle this issue, I think this PR should be also revisited.

@kzawora-intel
Copy link

kzawora-intel commented Oct 28, 2024

Hi there, I've looked through the code and I think it's great and mostly in a mergeable state. There are two things I have some comments on - padding slot id and recompilations.

For padding, let's not use out-of-bound blocks and let's stick to block 0. Block 0 is handled correctly since #313 was merged. I've tested your PR with both and it works fine.

For recompilations, it's a bit more complicated:

I've used the following script to test if prefix caching is working properly (taken mostly from https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html):

import time
from vllm import LLM, SamplingParams


# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
LONG_PROMPT = (
    "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n"
    + """
| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1   | John Doe      | 29  | Engineer      | USA           | [email protected]   | 555-1234       | 123 Elm St, Springfield, IL  |
| 2   | Jane Smith    | 34  | Doctor        | Canada        | [email protected] | 555-5678       | 456 Oak St, Toronto, ON      |
| 3   | Alice Johnson | 27  | Teacher       | UK            | [email protected]    | 555-8765       | 789 Pine St, London, UK      |
| 4   | Bob Brown     | 45  | Artist        | Australia     | [email protected]      | 555-4321       | 321 Maple St, Sydney, NSW    |
| 5   | Carol White   | 31  | Scientist     | New Zealand   | [email protected]    | 555-6789       | 654 Birch St, Wellington, NZ |
| 6   | Dave Green    | 28  | Lawyer        | Ireland       | [email protected]     | 555-3456       | 987 Cedar St, Dublin, IE     |
| 7   | Emma Black    | 40  | Musician      | USA           | [email protected]     | 555-1111       | 246 Ash St, New York, NY     |
| 8   | Frank Blue    | 37  | Chef          | Canada        | [email protected]    | 555-2222       | 135 Spruce St, Vancouver, BC |
| 9   | Grace Yellow  | 50  | Engineer      | UK            | [email protected]    | 555-3333       | 864 Fir St, Manchester, UK   |
| 10  | Henry Violet  | 32  | Artist        | Australia     | [email protected]    | 555-4444       | 753 Willow St, Melbourne, VIC|
| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | [email protected]    | 555-5555       | 912 Poplar St, Auckland, NZ  |
| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | [email protected]     | 555-6666       | 159 Elm St, Cork, IE         |
| 13  | Karen Red     | 41  | Lawyer        | USA           | [email protected]    | 555-7777       | 357 Cedar St, Boston, MA     |
| 14  | Leo Brown     | 30  | Chef          | Canada        | [email protected]      | 555-8888       | 246 Oak St, Calgary, AB      |
| 15  | Mia Green     | 33  | Musician      | UK            | [email protected]      | 555-9999       | 975 Pine St, Edinburgh, UK   |
| 16  | Noah Yellow   | 29  | Doctor        | Australia     | [email protected]     | 555-0000       | 864 Birch St, Brisbane, QLD  |
| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | [email protected]   | 555-1212       | 753 Maple St, Hamilton, NZ   |
| 18  | Peter Black   | 42  | Artist        | Ireland       | [email protected]    | 555-3434       | 912 Fir St, Limerick, IE     |
| 19  | Quinn White   | 28  | Scientist     | USA           | [email protected]    | 555-5656       | 159 Willow St, Seattle, WA   |
| 20  | Rachel Red    | 31  | Teacher       | Canada        | [email protected]   | 555-7878       | 357 Poplar St, Ottawa, ON    |
| 21  | Steve Green   | 44  | Lawyer        | UK            | [email protected]    | 555-9090       | 753 Elm St, Birmingham, UK   |
| 22  | Tina Blue     | 36  | Musician      | Australia     | [email protected]     | 555-1213       | 864 Cedar St, Perth, WA      |
| 23  | Umar Black    | 39  | Chef          | New Zealand   | [email protected]     | 555-3435       | 975 Spruce St, Christchurch, NZ|
| 24  | Victor Yellow | 43  | Engineer      | Ireland       | [email protected]   | 555-5657       | 246 Willow St, Galway, IE    |
| 25  | Wendy Orange  | 27  | Artist        | USA           | [email protected]    | 555-7879       | 135 Elm St, Denver, CO       |
| 26  | Xavier Green  | 34  | Scientist     | Canada        | [email protected]   | 555-9091       | 357 Oak St, Montreal, QC     |
| 27  | Yara Red      | 41  | Teacher       | UK            | [email protected]     | 555-1214       | 975 Pine St, Leeds, UK       |
| 28  | Zack Blue     | 30  | Lawyer        | Australia     | [email protected]     | 555-3436       | 135 Birch St, Adelaide, SA   |
| 29  | Amy White     | 33  | Musician      | New Zealand   | [email protected]      | 555-5658       | 159 Maple St, Wellington, NZ |
| 30  | Ben Black     | 38  | Chef          | Ireland       | [email protected]      | 555-7870       | 246 Fir St, Waterford, IE    |
"""
)


def get_generation_time(llm, sampling_params, prompts):
    # time the generation
    start_time = time.time()
    output = llm.generate(prompts, sampling_params=sampling_params)
    end_time = time.time()
    # print the output and generation time
    print(f"Output: {output[0].outputs[0].text}")
    print(f"Generation time: {end_time - start_time} seconds.")


# set enable_prefix_caching=True to enable APC
llm = LLM(
    model="meta-llama/Llama-3.1-8B",
    enable_prefix_caching=True,
    max_num_seqs=1,
    max_model_len=8192,
    dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0, max_tokens=100)

names = [
    "John Doe",
    "Jane Smith",
    "Alice Johnson",
    "Bob Brown",
    "Carol White",
    "Dave Green",
    "Emma Black",
    "Frank Blue",
    "Grace Yellow",
    "Henry Violet",
    "Irene Orange",
    "Jack Indigo",
    "Karen Red",
    "Leo Brown",
    "Mia Green",
    "Noah Yellow",
    "Olivia Blue",
    "Peter Black",
    "Quinn White",
    "Rachel Red",
    "Steve Green",
    "Tina Blue",
    "Umar Black",
    "Victor Yellow",
    "Wendy Orange",
    "Xavier Green",
    "Yara Red",
    "Zack Blue",
    "Amy White",
    "Ben Black",
]
for name in names:
    get_generation_time(
        llm,
        sampling_params,
        LONG_PROMPT
        + f"Question: what is the age of {name}? Your answer: The age of {name} is ",
    )

With enable_prefix_caching=False I can see that the performance is steady:

Processed prompts:   0%|                                                                                                                       | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 10-28 17:27:29 hpu_model_runner.py:1887] Configuration: (prompt, 1, 1408) was not warmed-up!
WARNING 10-28 17:27:30 hpu_model_runner.py:1887] Configuration: (decode, 1, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.15it/s, est. speed input: 1587.83 toks/s, output: 3.44 toks/s]
Output: 29.
Generation time: 0.8759636878967285 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.76it/s, est. speed input: 12139.45 toks/s, output: 26.31 toks/s]
Output: 34.
Generation time: 0.11672687530517578 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.82it/s, est. speed input: 12218.54 toks/s, output: 26.48 toks/s]
Output: 27.
Generation time: 0.11556267738342285 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.97it/s, est. speed input: 12427.35 toks/s, output: 26.94 toks/s]
Output: 45.
Generation time: 0.11353206634521484 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.85it/s, est. speed input: 12273.23 toks/s, output: 26.60 toks/s]
Output: 31.
Generation time: 0.11492085456848145 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.86it/s, est. speed input: 12281.22 toks/s, output: 26.62 toks/s]
Output: 28.
...

With enable_prefix_caching=True I see the following:

Processed prompts:   0%|                                                                                                                       | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 10-28 17:28:22 hpu_model_runner.py:1887] Configuration: (prompt, 1, 1408) was not warmed-up!
WARNING 10-28 17:28:23 hpu_model_runner.py:1887] Configuration: (decode, 1, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.15it/s, est. speed input: 1596.23 toks/s, output: 3.46 toks/s]
Output: 29.
Generation time: 0.871584415435791 seconds.
Processed prompts:   0%|                                                                                                                       | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 10-28 17:28:23 hpu_model_runner.py:1887] Configuration: (prompt, 1, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.96it/s, est. speed input: 4098.56 toks/s, output: 8.88 toks/s]
Output: 34.
Generation time: 0.3404667377471924 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.52it/s, est. speed input: 27051.96 toks/s, output: 58.62 toks/s]
Output: 27.
Generation time: 0.053774118423461914 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.30it/s, est. speed input: 28120.92 toks/s, output: 60.94 toks/s]
Output: 45.
Generation time: 0.05145907402038574 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.31it/s, est. speed input: 28131.82 toks/s, output: 60.96 toks/s]
Output: 31.
Generation time: 0.051267385482788086 seconds.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.35it/s, est. speed input: 28190.57 toks/s, output: 61.09 toks/s]
Output: 28.
...

So, overall, in this case we see ~2x performance boost in steady state. That's great! But here's the thing that worries me: Non-prefix caching achieves steady performance after first iteration (graph compilation), and prefix caching does additional graph compilation in the second step. The graph with prefix-cached prefills is never warmed up, and will need to be compiled on the go. This is something we avoid currently with warmup phase for regular prefills and decodes, but there is no such warmup mechanism for prefix caching, meaning if we enable it, we're guaranteed to have recompilations in runtime. Even worse, at the executor level, we don't even know that we recompiled and we can't throw a warning to the user, since cached and uncached prefills are treated as the same phase. We should probably make a distinction between prompt_uncached and prompt_cached phases, and perform warmup on both (if prefix caching is enabled).

That said, I don't think we should include warmup in the scope of this PR. We can merge this as is (with padding ids changed), and create follow-up PRs for the mentioned features. This is already very good.

@huijjj huijjj force-pushed the enable-prefix-caching branch from cc7d5b8 to b84fc09 Compare October 29, 2024 01:39
@huijjj
Copy link
Author

huijjj commented Oct 29, 2024

@kzawora-intel I updated the pad values to 0 as you mentioned. PR is now ready to be merged. Thanks.

However, I still have some concerns of using block 0 like this, I also see the similar concerns in the vllm-project/vllm upstream PR. Please let me know if any further actions are made regarding this.

@kzawora-intel
Copy link

Please fix formatting with format.sh so that all checks pass, and we can merge it.

@huijjj
Copy link
Author

huijjj commented Oct 29, 2024

Code formatted, thanks

@kzawora-intel kzawora-intel merged commit 1dcdb37 into HabanaAI:habana_main Oct 29, 2024
17 checks passed
@xuechendi
Copy link

xuechendi commented Nov 15, 2024

@huijjj @kzawora-intel , I tried with APC and met one error caused by vllm-hpu-extension
I proposed a fix, and now APC is working
Fix: HabanaAI/vllm-hpu-extension#33

Please help to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external Issues or PRs submitted by external users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants