-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch engine prefix caching #1393
Conversation
backend one loop optimize performance auto status prefill only when necessary update message
Any detail about the hash table implementation? |
@ispobock may follow up. Currently researched the implementations of vLLM, RTP-LLM, and SGLang |
@grimoire We compared the prefix cache implementation for other projects:
|
When do we need general cache? |
@ispobock Do they support window attention? How do they evict blocks? Would it take a long time if we have a large amount of blocks? s-lora would increase number of blocks(by use a small block size) and window attention would make the block eviction more complex. I failed to find a good solution. |
In mistralai-sf24/hackathon, sliding window has been removed https://x.com/mistralailabs/status/1771670765521281370 |
And I think this approach is acceptable for now. lmdeploy/lmdeploy/pytorch/config.py Lines 63 to 66 in 137d106
|
For example seq1:
It seems all of them are using reference count + LRU for evict policy. |
ref https://github.com/vllm-project/vllm/pull/2762/files#r1495331586 |
Sure, let's ignore the sliding window for now. It seems that the hash map does not bring much benefits to prefix matching. Eviction by blocks takes more time than eviction by node(sort by visit time, update ref-count/visit-time, update sequence status...). But adding new concept |
vllm didn't take the radix tree implementation due to the hard maintenance:
In sglang, actually there is no |
In this case
The result will be different from computing |
Hi @grimoire Do you have any suggestions? |
That's true, especially when block size is not 1. In this PR, I want to try the block-based strategy. Guess it would take a long time to design and prototype since I don't want to break any features that already exist. |
Hi @grimoire I would like to know, is the completion of this PR currently ready for normal use? Thanks. |
@zhyncs Yes, this is not a draft. |
ref #1407 (comment) |
After sgl-project/sglang#364, SGLang Radix Tree implementation RPS increased by nearly 10% |
Very good discussion here. ref vllm-project/vllm#2614 (comment) |
Enable by set
shared_cache=True
.