Add flash attention support for inference #367
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Flash-attention released an updates that optimize for inference. This PR adds the adaptation for speeding up inference with flash-attention.
Specifically, this PR includes:
inference_hf.py
andgradio_demo.py
by adding the--flash_attn
parameter.padding_mask
in flash attention and xformers patches. This parameter was added inLlamaAttention.forward
function in transformers v4.34.Related Issue
Add
padding_mask
: #326