Question: Does paged attention demonstrate prefix sharing? #2354

bob-just-bob · 2024-01-05T16:19:55Z

Reading https://arxiv.org/abs/2311.04934 and wondering if I would gain anything from prompt cache.

My use case is having prompts with overlaping prefixes (mostly a few big ones). And I already use vllm paged attention.

Assuming I would only want to cache kv states for prefixes (not positioned anywhere like in the paper).
Would there be any gains in caching attention prefix states, or is paged attention and vllm indeed already doing this?

Paper:

Paged attention also demonstrates simple prefix sharing,
where different prompts with an identical prefix share
KV Cache

Goal:

                                           shared inputs with prompt1
                                               |
                                               |
 +---------------------------------+     +-----+------+--------------------+
 |                                 | ... | ////|///// |                    |
 +---------------------------------+     +------------+--------------------+
  prompt 1                                           prompt 2
  request 1                                          request 2


- store prefix->kvs
- request
  - find shared inputs
  - assert_kv_cache(prefix-kvs)


Any gain from this idea?

So do we with paged attention already skip the attention for the shared inputs, or is there anything to be gainend from
additionally caching prefix kvs?

If it already caches across requests, what is the mechanism that keeps kv-cache entries from busting?
Wondering if there are still potential tweaks to make to make sure certain prefixes stay in kv-cache.

The text was updated successfully, but these errors were encountered:

William394873 · 2024-01-17T06:23:56Z

Same question. Is there any update?

franklyd · 2024-01-17T06:57:04Z

Is it related to the PR?
#1669

William394873 · 2024-01-17T07:06:02Z

Thanks! @franklyd, but is there any detailed document/API regarding this mechanism? For example, how exactly they store the prefixes, how long it gonna lasts, how to match, etc.. New to vllm here :)

rkooo567 · 2024-03-03T10:20:20Z

I believe this #2614 issue can resolve your question! (it is also merged yesterday)

github-actions · 2024-10-30T02:03:44Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-30T02:03:08Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions bot added the stale label Oct 30, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Does paged attention demonstrate prefix sharing? #2354

Question: Does paged attention demonstrate prefix sharing? #2354

bob-just-bob commented Jan 5, 2024 •

edited

Loading

William394873 commented Jan 17, 2024

franklyd commented Jan 17, 2024

William394873 commented Jan 17, 2024

rkooo567 commented Mar 3, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 30, 2024

Question: Does paged attention demonstrate prefix sharing? #2354

Question: Does paged attention demonstrate prefix sharing? #2354

Comments

bob-just-bob commented Jan 5, 2024 • edited Loading

William394873 commented Jan 17, 2024

franklyd commented Jan 17, 2024

William394873 commented Jan 17, 2024

rkooo567 commented Mar 3, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 30, 2024

bob-just-bob commented Jan 5, 2024 •

edited

Loading