Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement single_query_cached_kv_attention kernel #3

Merged
merged 14 commits into from
Mar 1, 2023

Conversation

WoosukKwon
Copy link
Collaborator

This PR adds the single_query_cached_kv_attention kernel.

Supported data types:

  • half
  • float

Tested models:

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B

Tested GPUs:

  • A100

@WoosukKwon WoosukKwon merged commit 0deacbc into main Mar 1, 2023
@WoosukKwon WoosukKwon deleted the attention-kernel branch March 1, 2023 23:11
v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023
xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Spycsh pushed a commit to Spycsh/vllm that referenced this pull request Feb 27, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024
Passing alibi_slopes and sliding_window to PagedAttention extension
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 20, 2024
mujjingun added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024
mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
* Remove assertion

* adapting jamba vllm to changes after hf release, working on weight loading in modeling file

* splitting the JambaDecoderLayer to JambaMambaDecoderLayer and JambaAttentionDecoderLayer

* weight loading from hf checkpoint supposedly works, might be a mixup in the MoE between the gated and non-gated weights

* Add mamba from jamba modeling file

* Remove slow forward

* Modifications to mamba_mixer

* Save changes, WIP

* Fix cache placement

* Debugging

* Additions and logging

* Jamba with mamba cache handling

* Clean up

* Another cleanup

* Use vllm's RMSNorm instead of JambaRMSNorm, Thier implementation is with
fused kernel

* Clean up and orginization of the objects to handle the mamba cache

* Shorten the code for kv cache mem

* Move cache handling inside the Mixer

* Add mamba to the wheel requirements

* Add mamba to the requirements script

* Add mamba_metadata

* Add to __init__ __all__

* Revert 2 commits

ad1a3db 'Add mamba to the requirements script'
75ed2c8 'Add mamba to the wheel requirements'

* Clean up

* Naming

* Apply whitespace suggestions from code review

* pass tie_word_embeddings to PretrainedConfig init

* Replace repeat with expand as expand doesn't require more mem

* Allocate really small cache if needed , don't use meta

* Fix for expanded

---------

Co-authored-by: Mor Zusman <[email protected]>
Co-authored-by: Erez Schwartz <[email protected]>
Co-authored-by: tomeras91 <[email protected]>
linxihui pushed a commit to linxihui/vllm that referenced this pull request May 14, 2024
…ope-type

minor change for LongRoPE config to account for rename from longrope …
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
PanJason added a commit to PanJason/vllm that referenced this pull request Sep 21, 2024
* Add example disk swap config. Add unit tests for CC with memory tiering

* Layered transfer for DRAM. Transfer in cuda streams

* Fix the missing arg

* Fix context caching online serving

This commit enables layered transmission for DRAM first. Now the
transmission is done in different cuda streams. xformers, flash infer
and flash attention are supported. Optimized transfer for disk
is still pending.

Cherry-pick Yangshen's commit

---------

Co-authored-by: yangshen <[email protected]>
PanJason added a commit to PanJason/vllm that referenced this pull request Sep 21, 2024
* Add example disk swap config. Add unit tests for CC with memory tiering

* Layered transfer for DRAM. Transfer in cuda streams

* Fix the missing arg

* Fix context caching online serving

This commit enables layered transmission for DRAM first. Now the
transmission is done in different cuda streams. xformers, flash infer
and flash attention are supported. Optimized transfer for disk
is still pending.

Cherry-pick Yangshen's commit

---------

Co-authored-by: yangshen <[email protected]>
zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024
Optimize the KV transfer pipe implementation
MengqingCao pushed a commit to MengqingCao/vllm that referenced this pull request Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant