[Web] Compatibility with PagedKVCache in WebGPU #16554

CharlieFRuan · 2024-02-11T00:41:55Z

This PR introduces various WebGPU changes to accommodate the new PagedKVCache interface. All changes below are essential for making models that use PagedKVCache runnable under WebGPU:

Require exactly same-dtype matching for WebGPU smem reuse in storage_rewrite.cc
Rename AttentionKVCache to AttentionKVCacheLegacy for the old KVcache interface in lm_support.cc; include paged_kv_cache.cc when making wasm_runtime subsequently
In WebGPU codegen:
- Declare local variables within the function scope rather than the module scope
- Generate while (true) rather than while (1)
Require 10 maxStorageBuffersPerShaderStage rather than the default 8 from the WebGPU device when initializing runtime; this is required for new kernels introduced in PagedKVCache
In deviceCopyToCPU(), when raw bytes to write are not multiples of 4, we pad them, as required by WebGPU's writeBuffer().

Co-authored-by: Rick Zhou [email protected]

MasterJH5574

LGTM. Thank you @CharlieFRuan and @rickzx for carrying through!

PagedKVCache is introduced in MLC-LLM a while back to unite the interface for KVCache. This PR makes WebLLM compatible with the new PagedKVCache interface, encapsulating it with the goal that WebLLM users will not notice any difference. This PR is equivalent to the changes to `llm_chat.cc` in mlc-ai/mlc-llm#1651, and should address issues like mlc-ai/mlc-llm#1628. There are still existing model compilation issues regarding `workgroup_size` (since WebGPU, unlike most other backends, can only support 256 number of threads). We will address this issue more elegantly soon; for now, compiling llama-based models require manually changing kernel sizes as shown in [this branch](https://github.com/CharlieFRuan/mlc-llm/tree/local-workgroupSize-webLLM-kvCache). This PR is also largely dependent on apache/tvm#16554.

CharlieFRuan and others added 4 commits February 6, 2024 16:44

Initial fixings, still have runtime bugs

a06ca74

Require exactly same-dtype matching for WebGPU smem reuse

6250ce2

Change local variable declaration from module to function scope

fa7235c

Remove logging

7744aae

CharlieFRuan mentioned this pull request Feb 11, 2024

[LLMChat] Make llm_chat compatible with PagedKVCache mlc-ai/web-llm#293

Merged

Fix lint

9a71d81

tqchen requested a review from MasterJH5574 February 12, 2024 17:34

MasterJH5574 approved these changes Feb 12, 2024

View reviewed changes

MasterJH5574 merged commit b04b1ac into apache:main Feb 12, 2024
19 checks passed

MasterJH5574 mentioned this pull request Mar 4, 2024

[Bug] [llama2-7B] fail to execute Llama-2-7b-chat-hf-q4f16_1-MLC mlc-ai/mlc-llm#1551

Closed

ysh329 mentioned this pull request Apr 21, 2024

[Release] v0.16.0 Release Candidate Notes #16911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Web] Compatibility with PagedKVCache in WebGPU #16554

[Web] Compatibility with PagedKVCache in WebGPU #16554

CharlieFRuan commented Feb 11, 2024 •

edited

Loading

MasterJH5574 left a comment

[Web] Compatibility with PagedKVCache in WebGPU #16554

[Web] Compatibility with PagedKVCache in WebGPU #16554

Conversation

CharlieFRuan commented Feb 11, 2024 • edited Loading

MasterJH5574 left a comment

Choose a reason for hiding this comment

CharlieFRuan commented Feb 11, 2024 •

edited

Loading