[CPU] Remove the limitation that requires to memset zero for KVCache of PagedAttention #28681

luo-cheng2021 · 2025-01-26T06:49:02Z

Details:

Remove the limitation that requires to memset zero for KVCache of PagedAttention
- Performance impact estimation for padding zero:
  - only affect first token
  - memset cost for worst case for llama2-7b like(header number=32, head size=128, precision=f16, block size=32, layer number=32, memory speed=50GB/s):
    1(batch number)*32(header number)*31(token number to pad)*128(header size)*2(f16 precision)*2(k+v)*32(layer number)=16.25M, the cost will be 16.25M/(50GB/s)=0.3ms, it should have small impact comparing to the cost of first token which typically is hundreds or thousands of milliseconds.
Potential incorrect result bug if head size is not multiple of 16

Tickets:

160731

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

usstq · 2025-02-07T03:46:10Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

@@ -2356,6 +2356,28 @@ struct AttentionExecutor : public PagedAttentionExecutor {
                _slot_mapping.ptr<int32_t>()[idx++] =
                    block_number * _helper._block_size + block_offset % _helper._block_size;
            }
+            // To simplify tails of the kernels for Q*K and W*V:


this is a WA for first-token-kernels which couldn't correctly support tails, matmul(attn_socre, value) to be exact, why this WA not near the code of kernels?

It will be useful if padding zero logic is merged into exec_loop_mixed/pack kv, but if merged the workitem will be reduced from header number*zero tokens to header number. So keep it here still reasonable since here is the centralized logic to handle the destination kvcache.

usstq

LGTM!

dmitry-gorokhov · 2025-02-11T06:44:56Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

+            //    W*V aka [m, k1] * [n1, k1]', there is no tails handing for n1, so tails of v_cache need to be set to
+            //    zero.
+            // for second token, the kernels have tails handling logic
+            if (q_len != 1 && kv_len % _helper._block_size != 0) {


So in serving scenarios (where prompt processing is interleaved with second token generation) and in beam-search or speculative decoding cases (where even seconf token processed with q != 1) we will have memsets on each iteration?

zero tails for first token

b2d4ccb

github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 26, 2025

luo-cheng2021 marked this pull request as ready for review January 27, 2025 01:19

luo-cheng2021 requested review from a team as code owners January 27, 2025 01:19

ilya-lavrenov assigned dmitry-gorokhov Jan 27, 2025

ilya-lavrenov added this to the 2025.1 milestone Jan 27, 2025

ilya-lavrenov mentioned this pull request Jan 28, 2025

CB: preparation for relying on KV cache precisions from plugins openvinotoolkit/openvino.genai#1634

Merged

Merge remote-tracking branch 'upstream/master' into luocheng/pa_tails

c83abe3

yuxu42 requested a review from usstq February 6, 2025 02:35

luo-cheng2021 mentioned this pull request Feb 6, 2025

[CPU] Remove memset WA for PagedAttention openvinotoolkit/openvino.genai#1678

Open

luo-cheng2021 added 2 commits February 6, 2025 11:45

fix wrong index for header number

39f42d4

Merge remote-tracking branch 'upstream/master' into luocheng/pa_tails

f631222

usstq reviewed Feb 7, 2025

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp Show resolved Hide resolved

usstq reviewed Feb 7, 2025

View reviewed changes

apply review comment

a80957a

luo-cheng2021 requested a review from usstq February 7, 2025 08:04

usstq approved these changes Feb 10, 2025

View reviewed changes

dmitry-gorokhov approved these changes Feb 11, 2025

View reviewed changes

dmitry-gorokhov added this pull request to the merge queue Feb 11, 2025

Merged via the queue into openvinotoolkit:master with commit a6cdc76 Feb 11, 2025
181 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Remove the limitation that requires to memset zero for KVCache of PagedAttention #28681

[CPU] Remove the limitation that requires to memset zero for KVCache of PagedAttention #28681

luo-cheng2021 commented Jan 26, 2025 •

edited

Loading

usstq Feb 7, 2025

luo-cheng2021 Feb 7, 2025

usstq left a comment

dmitry-gorokhov Feb 11, 2025 •

edited

Loading

[CPU] Remove the limitation that requires to memset zero for KVCache of PagedAttention #28681

[CPU] Remove the limitation that requires to memset zero for KVCache of PagedAttention #28681

Conversation

luo-cheng2021 commented Jan 26, 2025 • edited Loading

Details:

Tickets:

usstq Feb 7, 2025

Choose a reason for hiding this comment

luo-cheng2021 Feb 7, 2025

Choose a reason for hiding this comment

usstq left a comment

Choose a reason for hiding this comment

dmitry-gorokhov Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

luo-cheng2021 commented Jan 26, 2025 •

edited

Loading

dmitry-gorokhov Feb 11, 2025 •

edited

Loading