Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Web] Compatibility with PagedKVCache in WebGPU #16554

Merged
merged 5 commits into from
Feb 12, 2024

Conversation

CharlieFRuan
Copy link
Contributor

@CharlieFRuan CharlieFRuan commented Feb 11, 2024

This PR introduces various WebGPU changes to accommodate the new PagedKVCache interface. All changes below are essential for making models that use PagedKVCache runnable under WebGPU:

  • Require exactly same-dtype matching for WebGPU smem reuse in storage_rewrite.cc
  • Rename AttentionKVCache to AttentionKVCacheLegacy for the old KVcache interface in lm_support.cc; include paged_kv_cache.cc when making wasm_runtime subsequently
  • In WebGPU codegen:
    • Declare local variables within the function scope rather than the module scope
    • Generate while (true) rather than while (1)
  • Require 10 maxStorageBuffersPerShaderStage rather than the default 8 from the WebGPU device when initializing runtime; this is required for new kernels introduced in PagedKVCache
  • In deviceCopyToCPU(), when raw bytes to write are not multiples of 4, we pad them, as required by WebGPU's writeBuffer().

Co-authored-by: Rick Zhou [email protected]

@tqchen tqchen requested a review from MasterJH5574 February 12, 2024 17:34
Copy link
Contributor

@MasterJH5574 MasterJH5574 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @CharlieFRuan and @rickzx for carrying through!

@MasterJH5574 MasterJH5574 merged commit b04b1ac into apache:main Feb 12, 2024
19 checks passed
CharlieFRuan added a commit to mlc-ai/web-llm that referenced this pull request Feb 13, 2024
PagedKVCache is introduced in MLC-LLM a while back to unite the
interface for KVCache. This PR makes WebLLM compatible with the new
PagedKVCache interface, encapsulating it with the goal that WebLLM users
will not notice any difference.

This PR is equivalent to the changes to `llm_chat.cc` in
mlc-ai/mlc-llm#1651, and should address issues
like mlc-ai/mlc-llm#1628.

There are still existing model compilation issues regarding
`workgroup_size` (since WebGPU, unlike most other backends, can only
support 256 number of threads). We will address this issue more
elegantly soon; for now, compiling llama-based models require manually
changing kernel sizes as shown in [this
branch](https://github.com/CharlieFRuan/mlc-llm/tree/local-workgroupSize-webLLM-kvCache).

This PR is also largely dependent on
apache/tvm#16554.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants