Add flash attention support for inference #367

GoGoJoestar · 2023-10-25T07:08:00Z

Description

Flash-attention released an updates that optimize for inference. This PR adds the adaptation for speeding up inference with flash-attention.
Specifically, this PR includes:

Add flash attention patch for inference. Users can use flash attention in inference with inference_hf.py and gradio_demo.py by adding the --flash_attn parameter.
Add padding_mask in flash attention and xformers patches. This parameter was added in LlamaAttention.forward function in transformers v4.34.

Related Issue

Add padding_mask: #326

airaria · 2023-10-26T01:36:47Z

flash-attention and speculative sampling work correctly, but the inference speed is slow when enabling both CFG sampling and speculative sampling.
We should advise users not to use both CFG sampling and speculative sampling simultaneously.

Add flash attention support for inference

de63b9c

airaria self-requested a review October 25, 2023 07:35

GoGoJoestar and others added 2 commits October 25, 2023 17:53

update: output information when using flash-attention/xformers-attention

9a155b0

Remove trailing whitespace

c89237f

airaria approved these changes Oct 26, 2023

View reviewed changes

ymcui merged commit c20d308 into ymcui:main Oct 26, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flash attention support for inference #367

Add flash attention support for inference #367

GoGoJoestar commented Oct 25, 2023 •

edited

Loading

airaria commented Oct 26, 2023 •

edited

Loading

Add flash attention support for inference #367

Add flash attention support for inference #367

Conversation

GoGoJoestar commented Oct 25, 2023 • edited Loading

Description

Related Issue

airaria commented Oct 26, 2023 • edited Loading

GoGoJoestar commented Oct 25, 2023 •

edited

Loading

airaria commented Oct 26, 2023 •

edited

Loading