Replace FlashAttention with xformers #70

WoosukKwon · 2023-05-04T10:31:35Z

This PR replaces FlashAttention with xformers.

Pros:

Richer features & higher compatibility. xformers supports attention bias, FP32, head size 256, and old GPUs (such as V100) while FlashAttention does not.
xformers provides pre-compiled python wheels, while FlashAttention compiles the entire CUDA code during installation.
Future-proof, as the repository is maintained by many developers from Meta.

Cons:

xformers can be slower than FlashAttention for small inputs, because it incurs higher CPU overheads.
xformers internally creates a new tensor for the attention output. In our case, this leads to an extra copy overhead, because we concatenate the outputs of the two attention ops.

WoosukKwon · 2023-05-04T10:32:18Z

README.md

-pip install sentencepiece  # Required for LlamaTokenizer.
-pip install ninja  # To parallelize the compilation of flash-attn.
-pip install flash-attn  # This may take up to 10 mins.
+pip install ninja psutil numpy sentencepiece ray torch transformers xformers


TODO (in the next PR): specify the exact dependencies in setup.py.

zhisbug · 2023-05-04T10:33:23Z

is the memory footprint same with flashattention?

zhisbug · 2023-05-05T08:10:59Z

I did a test myself and found the memory saving is almost the same.

WoosukKwon · 2023-05-05T08:50:04Z

It seems the memory usage is comparable to FlashAttention's. @zhuohan123 Please review the PR.

zhuohan123

LGTM! Thanks!

zhuohan123 · 2023-05-05T08:53:50Z

cacheflow/master/server.py

@@ -213,7 +213,7 @@ def add_server_arguments(parser: argparse.ArgumentParser):
    parser.add_argument('--use-np-cache', action='store_true',
                        help='save a numpy copy of model weights for faster loading')
    parser.add_argument('--use-dummy-weights', action='store_true', help='use dummy values for model weights')
-    # NOTE(woosuk): FlashAttention does not support float32.
+    # TODO(woosuk): Support FP32 for debugging.


Does xformers support FP32?

Yes, it does. It is our attention kernel that does not support FP32. More precisely, our attention kernel currently does not support some block sizes when FP32 is used. I will fix this in the future.

SUMMARY: for Apache 4(b) -- "You must cause any modified files to carry prominent notices stating that You changed the files" https://www.apache.org/licenses/LICENSE-2.0 TEST PLAN: GHA

* Enabling some basic tests for ROCm 6.2 Use strict xfail for ROCm 6.2 test repairs * Use lenient xfail instead --------- Co-authored-by: Alexei V. Ivanov <[email protected]>

WoosukKwon added 12 commits May 3, 2023 05:51

Fix GPTAttention

c653294

Merge branch 'main' into xformers

49e406a

Remove OPT and LLaMA attention

00c6129

Replace flash attention with xformers

a8afbca

Remove flash attention in comments

4a47179

Fix attention kernel tests

76ad4dc

Minor

0330f83

Make tests robust to precision errors

e6bfe3a

Merge branch 'main' into xformers

36e9f96

Add bfloat16 in kernel unit tests

f0c462b

Update README

02ad58b

Head size

bd7657b

WoosukKwon requested a review from zhuohan123 May 4, 2023 10:31

WoosukKwon commented May 4, 2023

View reviewed changes

zhuohan123 approved these changes May 5, 2023

View reviewed changes

WoosukKwon mentioned this pull request May 5, 2023

Support FP32 #72

Closed

WoosukKwon merged commit c9d5b6d into main May 5, 2023

WoosukKwon deleted the xformers branch May 5, 2023 09:01

tmm1 mentioned this pull request Aug 3, 2023

Fix the rushed out multi-query kernel #44

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Replace FlashAttention with xformers (vllm-project#70)

b43acbb

yazdanbakhsh mentioned this pull request Sep 13, 2024

[Bug]: Can't load gemma-2-9b-it with vllm 0.5.2 #6462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace FlashAttention with xformers #70

Replace FlashAttention with xformers #70

WoosukKwon commented May 4, 2023

WoosukKwon May 4, 2023

zhisbug commented May 4, 2023

zhisbug commented May 5, 2023

WoosukKwon commented May 5, 2023

zhuohan123 left a comment

zhuohan123 May 5, 2023

WoosukKwon May 5, 2023

Replace FlashAttention with xformers #70

Replace FlashAttention with xformers #70

Conversation

WoosukKwon commented May 4, 2023

WoosukKwon May 4, 2023

Choose a reason for hiding this comment

zhisbug commented May 4, 2023

zhisbug commented May 5, 2023

WoosukKwon commented May 5, 2023

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 May 5, 2023

Choose a reason for hiding this comment

WoosukKwon May 5, 2023

Choose a reason for hiding this comment