-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace FlashAttention with xformers #70
Conversation
pip install sentencepiece # Required for LlamaTokenizer. | ||
pip install ninja # To parallelize the compilation of flash-attn. | ||
pip install flash-attn # This may take up to 10 mins. | ||
pip install ninja psutil numpy sentencepiece ray torch transformers xformers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO (in the next PR): specify the exact dependencies in setup.py
.
is the memory footprint same with flashattention? |
I did a test myself and found the memory saving is almost the same. |
It seems the memory usage is comparable to FlashAttention's. @zhuohan123 Please review the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
@@ -213,7 +213,7 @@ def add_server_arguments(parser: argparse.ArgumentParser): | |||
parser.add_argument('--use-np-cache', action='store_true', | |||
help='save a numpy copy of model weights for faster loading') | |||
parser.add_argument('--use-dummy-weights', action='store_true', help='use dummy values for model weights') | |||
# NOTE(woosuk): FlashAttention does not support float32. | |||
# TODO(woosuk): Support FP32 for debugging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does xformers support FP32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does. It is our attention kernel that does not support FP32. More precisely, our attention kernel currently does not support some block sizes when FP32 is used. I will fix this in the future.
SUMMARY: for Apache 4(b) -- "You must cause any modified files to carry prominent notices stating that You changed the files" https://www.apache.org/licenses/LICENSE-2.0 TEST PLAN: GHA
* Enabling some basic tests for ROCm 6.2 Use strict xfail for ROCm 6.2 test repairs * Use lenient xfail instead --------- Co-authored-by: Alexei V. Ivanov <[email protected]>
This PR replaces FlashAttention with xformers.
Pros:
Cons: