Torch engine prefix caching #1393

grimoire · 2024-04-04T04:40:09Z

Enable by set shared_cache=True.

backend one loop optimize performance auto status prefill only when necessary update message

zhyncs · 2024-04-07T06:16:44Z

Hi @grimoire @lvhan028 Why did you choose the radix tree implementation? Have you considered using the hash table implementation? What factors did you consider, such as scalability or performance? Thanks.

grimoire · 2024-04-07T06:41:44Z

Any detail about the hash table implementation?
Honestly, I do not like my radix tree implementation in this PR.

zhyncs · 2024-04-07T06:51:17Z

Any detail about the hash table implementation? Honestly, I do not like my radix tree implementation in this PR.

@ispobock may follow up. Currently researched the implementations of vLLM, RTP-LLM, and SGLang

ispobock · 2024-04-07T07:53:21Z

@grimoire We compared the prefix cache implementation for other projects:

vllm
- Hash Table
- compute hash key for each block: hash(prefix tokens, tokens in this block)
- block level reuse, if seq1: xxxxyyyy, seq2: xxxxzzzz, seq3 xxxxyyyyzzzz, each block contains 4 tokens, then seq2 can reuse the first block of seq1, seq3 can reuse 2 blocks of seq1
- now only support prefix cache ( xxxxxoooo ), but plan to support general cache (xxxoooxxxooo) in the future
- maybe need to consider hash collision
- Complexity:
  - AssumeN is the number of seq,L is the length of seq
  - Time (Find & Insert): O(N*(L^2)), because compute hash key needs O(L^2), mentioned here
  - Space: O(N*L)
rtp-llm
- Hash Table
- compute hash key for each seq: hash(tokens in sequence)
- block level reuse, like vllm
- Complexity:
  - Time
    - Find: O((N^2)*L), due to token level match
    - Insert: O(N*L)
  - Space: O(N*L)
sglang
- Radix Tree
- can only support prefix cache ( xxxxxoooo ), cannot support general cache (xxxoooxxxooo)
- Complexity:
  - Time (Find & Insert): O(N*L)
  - Space: worst O(N*L), if no shared part

grimoire · 2024-04-07T08:13:38Z

When do we need general cache?

grimoire · 2024-04-07T08:39:29Z

@ispobock Do they support window attention? How do they evict blocks? Would it take a long time if we have a large amount of blocks?

s-lora would increase number of blocks(by use a small block size) and window attention would make the block eviction more complex. I failed to find a good solution.

zhyncs · 2024-04-07T08:47:50Z

@ispobock Do they support window attention? How do they evict blocks? Would it take a long time if we have a large amount of blocks?

s-lora would increase number of blocks(by use a small block size) and window attention would make the block eviction more complex. I failed to find a good solution.

In mistralai-sf24/hackathon, sliding window has been removed https://x.com/mistralailabs/status/1771670765521281370

zhyncs · 2024-04-07T08:53:15Z

And I think this approach is acceptable for now.

lmdeploy/lmdeploy/pytorch/config.py

Lines 63 to 66 in 137d106

    
           if self.window_size > 1 and self.shared_cache: 
        
               logger.warning( 
        
                   'Shared cache is not available for window attention.') 
        
               self.shared_cache = False

ispobock · 2024-04-07T08:54:06Z

@grimoire

When do we need general cache?

For example seq1: xxxxyyyyzzzz, seq2: yyyyzzzz, 4 tokens per block, for general cache, seq2 may use the last 2 cached blocks of seq1.
It's mentioned in vllm's design, but I'm not sure the real usage and implementation.

How do they evict blocks? Would it take a long time if we have a large amount of blocks?

It seems all of them are using reference count + LRU for evict policy.

zhyncs · 2024-04-07T08:57:14Z

And I think this approach is acceptable for now.

lmdeploy/lmdeploy/pytorch/config.py

Lines 63 to 66 in 137d106

if self.window_size > 1 and self.shared_cache:

logger.warning(

'Shared cache is not available for window attention.')

self.shared_cache = False

ref https://github.com/vllm-project/vllm/pull/2762/files#r1495331586

grimoire · 2024-04-07T09:46:12Z

Sure, let's ignore the sliding window for now.

It seems that the hash map does not bring much benefits to prefix matching. Eviction by blocks takes more time than eviction by node(sort by visit time, update ref-count/visit-time, update sequence status...).

But adding new concept node into the schedule made the code error prone and hard to maintain.
Any advice?

ispobock · 2024-04-07T11:29:51Z

vllm didn't take the radix tree implementation due to the hard maintenance:

Major benefits of this design over a KV block Trie

Sometimes, caching is not limited to prefix caching:

With Mistral's sliding window attention, we only need to cache the last tokens in the sliding window.

With attention sinks, we need to cache the first few tokens and the latest tokens.

Maintaining hash table is simpler than maintaining a tree.

Extensible to more advanced caching policy (the one above is just an example).

In sglang, actually there is no block concept because the size of each page is equivalent to one token, which simplified the implementation.

lzhangzz · 2024-04-07T12:44:10Z

For example seq1: xxxxyyyyzzzz, seq2: yyyyzzzz, 4 tokens per block, for general cache, seq2 may use the last 2 cached blocks of seq1.

In this case

The positional embedding used for yyyyzzzz is offsetted by 4 steps (instead of starting from 0)
xxxx which is involved in the computation of xxxxyyyyzzzz is ignored.

The result will be different from computing yyyyzzzz directly. The outcome maybe similar but you have no guarantee on it.

zhyncs · 2024-04-08T03:21:25Z

vllm didn't take the radix tree implementation due to the hard maintenance:

Major benefits of this design over a KV block Trie

Sometimes, caching is not limited to prefix caching:

With Mistral's sliding window attention, we only need to cache the last tokens in the sliding window.

With attention sinks, we need to cache the first few tokens and the latest tokens.

Maintaining hash table is simpler than maintaining a tree.

Extensible to more advanced caching policy (the one above is just an example).

In sglang, actually there is no block concept because the size of each page is equivalent to one token, which simplified the implementation.

Hi @grimoire Do you have any suggestions?

grimoire · 2024-04-08T03:55:17Z

Maintaining hash table is simpler than maintaining a tree.

That's true, especially when block size is not 1. In this PR, node is a wrap of sequence with meta info. I want to share the same block manage code to ease the implementation, but it ... sucks.

I want to try the block-based strategy. Guess it would take a long time to design and prototype since I don't want to break any features that already exist.

zhyncs · 2024-04-08T06:28:41Z

Hi @grimoire I would like to know, is the completion of this PR currently ready for normal use? Thanks.

grimoire · 2024-04-08T07:45:24Z

@zhyncs Yes, this is not a draft.

zhyncs · 2024-04-09T06:01:18Z

ref #1407 (comment)

zhyncs · 2024-04-18T08:58:00Z

@grimoire We compared the prefix cache implementation for other projects:

vllm

Hash Table

compute hash key for each block: hash(prefix tokens, tokens in this block)

block level reuse, if seq1: xxxxyyyy, seq2: xxxxzzzz, seq3 xxxxyyyyzzzz, each block contains 4 tokens, then seq2 can reuse the first block of seq1, seq3 can reuse 2 blocks of seq1

now only support prefix cache ( xxxxxoooo ), but plan to support general cache (xxxoooxxxooo) in the future

maybe need to consider hash collision

Complexity:

AssumeN is the number of seq,L is the length of seq

Time (Find & Insert): O(N*(L^2)), because compute hash key needs O(L^2), mentioned here

Space: O(N*L)

rtp-llm

Hash Table

compute hash key for each seq: hash(tokens in sequence)

block level reuse, like vllm

Complexity:

Time

Find: O((N^2)*L), due to token level match

Insert: O(N*L)

Space: O(N*L)

sglang

Radix Tree

can only support prefix cache ( xxxxxoooo ), cannot support general cache (xxxoooxxxooo)

Complexity:

Time (Find & Insert): O(N*L)

Space: worst O(N*L), if no shared part

After sgl-project/sglang#364, SGLang Radix Tree implementation RPS increased by nearly 10%

merrymercy · 2024-05-16T22:58:17Z

Very good discussion here. ref vllm-project/vllm#2614 (comment)

grimoire added 9 commits March 19, 2024 19:33

engine update

d27abb3

backend one loop optimize performance auto status prefill only when necessary update message

first

0a52b57

add rtree manager test

f9651f6

update scheduler ut

4558a25

remove comment

0f88aff

update block

68042f3

fix remove node

f9e1d0d

default false

36f01ca

merge main

137d106

ispobock mentioned this pull request Apr 8, 2024

[Feature] Turbomind engine prefix caching #1407

Closed

grimoire changed the title ~~Torch engine prefix cacheing~~ Torch engine prefix caching Apr 11, 2024

grimoire marked this pull request as draft April 12, 2024 07:40

lvhan028 closed this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch engine prefix caching #1393

Torch engine prefix caching #1393

grimoire commented Apr 4, 2024

zhyncs commented Apr 7, 2024

grimoire commented Apr 7, 2024

zhyncs commented Apr 7, 2024

ispobock commented Apr 7, 2024 •

edited

Loading

grimoire commented Apr 7, 2024

grimoire commented Apr 7, 2024

zhyncs commented Apr 7, 2024

zhyncs commented Apr 7, 2024

ispobock commented Apr 7, 2024

zhyncs commented Apr 7, 2024

grimoire commented Apr 7, 2024

ispobock commented Apr 7, 2024

Major benefits of this design over a KV block Trie

lzhangzz commented Apr 7, 2024

zhyncs commented Apr 8, 2024

Major benefits of this design over a KV block Trie

grimoire commented Apr 8, 2024 •

edited

Loading

zhyncs commented Apr 8, 2024

grimoire commented Apr 8, 2024

zhyncs commented Apr 9, 2024

zhyncs commented Apr 18, 2024

merrymercy commented May 16, 2024

Torch engine prefix caching #1393

Torch engine prefix caching #1393

Conversation

grimoire commented Apr 4, 2024

zhyncs commented Apr 7, 2024

grimoire commented Apr 7, 2024

zhyncs commented Apr 7, 2024

ispobock commented Apr 7, 2024 • edited Loading

grimoire commented Apr 7, 2024

grimoire commented Apr 7, 2024

zhyncs commented Apr 7, 2024

zhyncs commented Apr 7, 2024

ispobock commented Apr 7, 2024

zhyncs commented Apr 7, 2024

grimoire commented Apr 7, 2024

ispobock commented Apr 7, 2024

Major benefits of this design over a KV block Trie

lzhangzz commented Apr 7, 2024

zhyncs commented Apr 8, 2024

Major benefits of this design over a KV block Trie

grimoire commented Apr 8, 2024 • edited Loading

zhyncs commented Apr 8, 2024

grimoire commented Apr 8, 2024

zhyncs commented Apr 9, 2024

zhyncs commented Apr 18, 2024

merrymercy commented May 16, 2024

ispobock commented Apr 7, 2024 •

edited

Loading

grimoire commented Apr 8, 2024 •

edited

Loading