Fix the rushed out multi-query kernel #44

zhuohan123 · 2023-04-22T03:49:18Z

Fix the correctness issue in the current FlashAttention-copy-based kernel. Make sure we call the FlashAttention kernel correctly. Evaluate the performance of this kernel.
Reduce the memory usage of the current kernel by limiting the buffer size and calling the kernel multiple times.

tmm1 · 2023-08-03T17:31:26Z

current FlashAttention-copy-based kernel

I believe this is referring to #4, however as of #70 flash-attn is no longer used.

hmellor · 2024-03-08T10:19:19Z

Closing based on @tmm1's comment about flash-attn no longer being used.

* Trimmed metadata - part 1 * [WIP] HPU graphs for decode * [WIP] Graph allocation algorithm reworked * Cleanup * Add graph memory estimations * Fix multinode synchronization * Create attn_bias inside HPU graph * Cleanup after rebase * Increase default VLLM_GRAPH_RESERVED_MEM to 0.3 * Remove obsolete class * Tweak default HPU graph parameters

* adding fp8 gemm tunner to gradlib * formatting * add instructions * Linting * adding fp8 gemm tunner to gradlib formatting add instructions * Linting fp8 gradlib * fix merging issue of ROCm_performance.md * delete fp8_gemm_tuner.py * Fix linting for triton: unmeld if with constexpr * update tutorial * Fix linting again * fix typo --------- Co-authored-by: Matthew Wong <[email protected]>

WoosukKwon self-assigned this May 2, 2023

zhuohan123 mentioned this issue Jun 25, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

shanshanpt mentioned this issue Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this issue Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hmellor closed this as completed Mar 8, 2024

yuhuixu1993 mentioned this issue Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024

Update setup.py naming (vllm-project#44)

a477771

ZHJ19970917 mentioned this issue Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

alixiaodi mentioned this issue Aug 2, 2024

[Bug]: #7072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the rushed out multi-query kernel #44

Fix the rushed out multi-query kernel #44

zhuohan123 commented Apr 22, 2023

tmm1 commented Aug 3, 2023

hmellor commented Mar 8, 2024

Fix the rushed out multi-query kernel #44

Fix the rushed out multi-query kernel #44

Comments

zhuohan123 commented Apr 22, 2023

tmm1 commented Aug 3, 2023

hmellor commented Mar 8, 2024