Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the rushed out multi-query kernel #44

Closed
zhuohan123 opened this issue Apr 22, 2023 · 2 comments
Closed

Fix the rushed out multi-query kernel #44

zhuohan123 opened this issue Apr 22, 2023 · 2 comments
Assignees

Comments

@zhuohan123
Copy link
Member

  1. Fix the correctness issue in the current FlashAttention-copy-based kernel. Make sure we call the FlashAttention kernel correctly. Evaluate the performance of this kernel.
  2. Reduce the memory usage of the current kernel by limiting the buffer size and calling the kernel multiple times.
@tmm1
Copy link
Contributor

tmm1 commented Aug 3, 2023

current FlashAttention-copy-based kernel

I believe this is referring to #4, however as of #70 flash-attn is no longer used.

@hmellor
Copy link
Collaborator

hmellor commented Mar 8, 2024

Closing based on @tmm1's comment about flash-attn no longer being used.

@hmellor hmellor closed this as completed Mar 8, 2024
tianyil1 pushed a commit to tianyil1/vllm that referenced this issue Jun 5, 2024
* Trimmed metadata - part 1

* [WIP] HPU graphs for decode

* [WIP] Graph allocation algorithm reworked

* Cleanup

* Add graph memory estimations

* Fix multinode synchronization

* Create attn_bias inside HPU graph

* Cleanup after rebase

* Increase default VLLM_GRAPH_RESERVED_MEM to 0.3

* Remove obsolete class

* Tweak default HPU graph parameters
fxmarty pushed a commit to fxmarty/vllm-public that referenced this issue Jun 12, 2024
* adding fp8 gemm tunner to gradlib

* formatting

* add instructions

* Linting

* adding fp8 gemm tunner to gradlib

formatting

add instructions

* Linting fp8 gradlib

* fix merging issue of ROCm_performance.md

* delete fp8_gemm_tuner.py

* Fix linting for triton: unmeld if with constexpr

* update tutorial

* Fix linting again

* fix typo

---------

Co-authored-by: Matthew Wong <[email protected]>
yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024
@alixiaodi alixiaodi mentioned this issue Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants