-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Up to 2x speedup on GPUs using memory efficient attention #532
Up to 2x speedup on GPUs using memory efficient attention #532
Conversation
fb4e372
to
a29c689
Compare
Hi @MatthieuTPHR , Nice PR! On xFormers side, we are working on improving the packaging so that it can be more easily installed by users, while shipping the pre-compiled binaries as well. We are also continuing to optimize the kernel for some configurations, we will keep the |
And about more optimized kernels for K=40, @danthe3rd has been looking very closely on further optimizations and has some ideas for optimizing our current kernels for smaller K, I'll let him chime in but contributions are more than welcome! |
I will put this PR in draft until the dependencies issues are solved |
Hey @MatthieuTPHR, Thanks a lot for opening the PR - it looks very cool! Trying it out now :-) Generally we're quite careful with not adding new dependencies to |
Hi @MatthieuTPHR - this looks like a great improvement!
We've been improving the forward (including fairly recently, in facebookresearch/xformers#388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks? |
I've tried running the code in this PR, but I'm getting the following error:
when installing Do I need a specific version of triton cc @MatthieuTPHR ? |
typically with stable diffusion 512x512 yields a 64x64 latent space -> 4096 tokens (but higher res would be even better). edit: folding the number of heads in the batch to better give a sense of the tensor sizes in practice |
I use For the 1024x1024 on the A6000 I have 4 iterations per second |
specifically edit: but no triton installed should also work actually |
I'm testing on a
Note: When I import xformers I'm getting:
With this setup I'm running: from diffusers import StableDiffusionPipeline
import numpy as np
import torch
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
use_auth_token=True,
)
pipe.to("cuda")
prompt = "A fantasy landscape, trending on artstation"
generator = torch.Generator(device="cuda").manual_seed(0)
with torch.autocast("cuda"):
output = pipe(prompt=prompt, guidance_scale=7.5, generator=generator, output_type="np")
print(np.sum(np.abs(output.images[:3, :3, :3, :3])))
mem_bytes = torch.cuda.max_memory_allocated()
print(mem_bytes)
And I'm getting the exact same speed. Any ideas what could be the problem here? |
It might look like the CUDA extensions are not being compiled when installing xformers. can you try doing import torch, xformers.ops
print(torch.ops.xformers.efficient_attention_forward_generic) and see if ti prints something like If that doesn't print what I mentioned, there are a few options why this isn't being compiled:
|
Hi @patrickvonplaten , Here is my full setup:
Once this is done I can notice the following speedup: 10 iterations per seconds without and 21 with, both at 512x512 in fp16. |
23155d9
to
e515599
Compare
Thank you @MatthieuTPHR, super exited to see ideas on fast & memory-efficient attention having an impact! |
Hey @fmassa, When running: import torch, xformers.ops
print(torch.ops.xformers.efficient_attention_forward_generic) I'm getting:
There already seems to be a problem I guess? It would be really nice if we could somehow show the community that it's easy to install and use :-) |
@patrickvonplaten this means that indeed xformers was compiled without the CUDA extensions.
Yes, I totally agree, and we are working on that :-) If you are not compiling xformers on a machine with CUDA (i.e., if you |
We are using the default parameters from the CompVis repo, I believe the parameters are as follows:
The sequence length could also be higher if we use a 1024x1024 or 2048x2048 input. The downscale factor between the input and the latent space is 8. |
Does it require a GPU with tensor cores (RTX 20 Series and above) ? then :
|
@TheLastBen what is your GPU model? xformers supports architectures above sm60 (P100+) - and possibly above sm50 (untested). The most important speedups are achieved on GPUs with tensor cores (sm70+ aka V100 and later), but it's not a requirement |
@danthe3rd I have a gtx 1070ti, the message is coming from Triton so I don't think it's the main cause for the crash |
these are two different topics:
|
…e unused self.scale parameter
…class and use enable_xformers_memory_efficient_attention method
be50082
to
0f75d57
Compare
@NouamaneTazi the CI tests error seem unrelated to the code on this branch:
Do you know if this is common to other PR ? |
@MatthieuTPHR yes, the failing tests are unrelated, merging this now! Thanks a lot @MatthieuTPHR @NouamaneTazi and the |
@MatthieuTPHR I'd certainly be interested in seeing the AITemplate, couldn't get it going locally due to OOM during building. https://github.com/microsoft/DeepSpeed-MII txt2img example here https://github.com/microsoft/DeepSpeed-MII/tree/main/examples/local Note for xformers, triton seems to give another .5 it/s locally on 512x512 images, unfortunately can't get it compiled for windows conda locally, only got it working on wsl2 ubuntu. For xformers you need to use a pinned version, since they have shuffled the location of some functions. |
xformers now has dev conda packages for some combinations of Python, CUDA and PyTorch. See the README. |
@MatthieuTPHR This PR for DeepSpeed MII is a nice tutorial for SD using DeepSpeed MII and lays out some of the optimizations they do, |
|
I have the following issues with xformers:
|
…e#532) * 2x speedup using memory efficient attention * remove einops dependency * Swap K, M in op instantiation * Simplify code, remove unnecessary maybe_init call and function, remove unused self.scale parameter * make xformers a soft dependency * remove one-liner functions * change one letter variable to appropriate names * Remove Env variable dependency, remove MemoryEfficientCrossAttention class and use enable_xformers_memory_efficient_attention method * Add memory efficient attention toggle to img2img and inpaint pipelines * Clearer management of xformers' availability * update optimizations markdown to add info about memory efficient attention * add benchmarks for TITAN RTX * More detailed explanation of how the mem eff benchmark were ran * Removing autocast from optimization markdown * import_utils: import torch only if is available Co-authored-by: Nouamane Tazi <[email protected]>
Why ?
While stable diffusion democratized the access to text to image generative models, it can still be relatively long to generate an image on consumer GPUs. The GPU memory requirements are also hindering the use of diffusion on small GPUs.
How ?
Recent work on optimizing the bandwitdh in the attention block have generated huge speed ups and gains in GPU memory usage. The most recent being Flash Attention (from @tridao, code, paper) .
In this PR we use the MemoryEfficientAttention implementation from xformers (cc. @fmassa, @danthe3rd, @blefaudeux) to both speedup the cross-attention speed and decrease its GPU memory requirements.
The memory efficient attention can be activated by setting the environment variable
USE_MEMORY_EFFICIENT_ATTENTION=1
and installing the
xformers
library:This installation is a known pain point, there are two ways to improve that:
Thank you @tridao, @fmassa, @danthe3rd for the work on Flash Attention and its integration in xformers.
Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)
Note:
Speedups on various GPUs with a 512x512 shape and running FP16:
How to test:
I use the following setup:
Then create a python file, mine is named
test.py
, with the following code:Then run in the aforementioned docker container