[Feature]: PD separation supports prefix caching #12257

skyCreateXian · 2025-01-21T09:20:56Z

🚀 The feature, motivation and pitch

kv transfer agent recv_kv_caches_and_hidden_states and send_kv_caches_and_hidden_states failed to support prefix caching

Mainly due to the following code in simple_connector.py L159, L215
'seq_lens = model_input.attn_metadata.seq_lens'

If the prefix caching is opened and hit, the hit part in the previous text will be marked as calculated, and input_token will be the uncalculated part

Alternatives

Idea: After opening prefix caching, only prefix and transfer increments to decode. Therefore, consider subtracting context_lens from seq_lens to solve this problem?
'seq_lens = (model_input.attn_metadata.seq_lens_tensor - model_input.attn_metadata.context_lens_tensor).tolist()'

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

skyCreateXian added the feature request label Jan 21, 2025

skyCreateXian added a commit to skyCreateXian/vllm that referenced this issue Jan 21, 2025

[Feature]: PD separation supports prefix caching vllm-project#12257

96b6993

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: PD separation supports prefix caching #12257

[Feature]: PD separation supports prefix caching #12257

skyCreateXian commented Jan 21, 2025

[Feature]: PD separation supports prefix caching #12257

[Feature]: PD separation supports prefix caching #12257

Comments

skyCreateXian commented Jan 21, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...