Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Up to 2x speedup on GPUs using memory efficient attention #532

Merged

Conversation

MatthieuToulemont
Copy link
Contributor

@MatthieuToulemont MatthieuToulemont commented Sep 16, 2022

Why ?

While stable diffusion democratized the access to text to image generative models, it can still be relatively long to generate an image on consumer GPUs. The GPU memory requirements are also hindering the use of diffusion on small GPUs.

How ?

Recent work on optimizing the bandwitdh in the attention block have generated huge speed ups and gains in GPU memory usage. The most recent being Flash Attention (from @tridao, code, paper) .

In this PR we use the MemoryEfficientAttention implementation from xformers (cc. @fmassa, @danthe3rd, @blefaudeux) to both speedup the cross-attention speed and decrease its GPU memory requirements.

The memory efficient attention can be activated by setting the environment variable USE_MEMORY_EFFICIENT_ATTENTION=1
and installing the xformers library:

pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers

This installation is a known pain point, there are two ways to improve that:

  • xformers ships wheels
  • this dependency could be made optional in this repository

Thank you @tridao, @fmassa, @danthe3rd for the work on Flash Attention and its integration in xformers.
Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

Note:

  • Masked cross-attention is not supported yet with this implementation but could be added in future work.
  • This PR does not solve the xformers dependency just yet, any help appreciated

Speedups on various GPUs with a 512x512 shape and running FP16:

GPU Base Attention FP16 Memory Efficient Attention FP16
NVIDIA Tesla T4 3.5it/s 5.5it/s
NVIDIA 3060 RTX 4.6it/s 7.8it/s
NVIDIA A10G 8.88it/s 15.6it/s
NVIDIA RTX A6000 11.7it/s 21.09it/s
A100-SXM-80GB 18.7it/s 29.5it/s

How to test:

I use the following setup:

sudo docker run -it --gpus=all --ipc=host -v /home:/home nvcr.io/nvidia/pytorch:22.08-py3 bash

# Then 
pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers
pip install transformers ftfy scipy

# Followed by
cd PATH_TO_DIFFUSER_FOLDER
git checkout memory_efficient_attention
pip install -e . 

Then create a python file, mine is named test.py, with the following code:

import torch
from diffusers import StableDiffusionPipeline


pipe = StableDiffusionPipeline.from_pretrained(
   "CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, use_auth_token=True
).to("cuda")

with torch.inference_mode(), torch.autocast("cuda"):
   image = pipe("a small cat")

Then run in the aforementioned docker container

# Test without Memory Efficient Attention: 
python test.py

# Test with Memory Efficient Attention: 
USE_MEMORY_EFFICIENT_ATTENTION=1 python test.py

@MatthieuToulemont MatthieuToulemont force-pushed the memory_efficient_attention branch 3 times, most recently from fb4e372 to a29c689 Compare September 16, 2022 15:04
@fmassa
Copy link

fmassa commented Sep 16, 2022

Hi @MatthieuTPHR ,

Nice PR!

On xFormers side, we are working on improving the packaging so that it can be more easily installed by users, while shipping the pre-compiled binaries as well.

We are also continuing to optimize the kernel for some configurations, we will keep the K=40 in mind for the future.

@fmassa
Copy link

fmassa commented Sep 16, 2022

And about more optimized kernels for K=40, @danthe3rd has been looking very closely on further optimizations and has some ideas for optimizing our current kernels for smaller K, I'll let him chime in but contributions are more than welcome!

@MatthieuToulemont
Copy link
Contributor Author

I will put this PR in draft until the dependencies issues are solved

@patrickvonplaten
Copy link
Contributor

Hey @MatthieuTPHR,

Thanks a lot for opening the PR - it looks very cool! Trying it out now :-)

Generally we're quite careful with not adding new dependencies to diffusers, but I think we might be able to make it a soft-dependency if the speed-up is big enough!

@MatthieuToulemont MatthieuToulemont marked this pull request as draft September 16, 2022 15:32
@danthe3rd
Copy link

Hi @MatthieuTPHR - this looks like a great improvement!

Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

We've been improving the forward (including fairly recently, in facebookresearch/xformers#388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks?

@patrickvonplaten
Copy link
Contributor

I've tried running the code in this PR, but I'm getting the following error:

AttributeError: module 'triton.language' has no attribute 'constexpr'

when installing xformers with pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers

Do I need a specific version of triton cc @MatthieuTPHR ?

@blefaudeux
Copy link
Contributor

blefaudeux commented Sep 16, 2022

Hi @MatthieuTPHR - this looks like a great improvement!

Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

We've been improving the forward (including fairly recently, in facebookresearch/xformers#388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks?

typically with stable diffusion 512x512 yields a 64x64 latent space -> 4096 tokens (but higher res would be even better).
After that the embedding is 320 over 8 heads, which yields a head dim of 40. For inference batch is typically 2 (conditioned and unconditioned diffusion), but can depend on the methods. So [16, 4096, 40] is a good baseline, [16, 16384, 40] is a bonus (1024 rendering)

edit: folding the number of heads in the batch to better give a sense of the tensor sizes in practice

@MatthieuToulemont
Copy link
Contributor Author

MatthieuToulemont commented Sep 16, 2022

Hi @MatthieuTPHR - this looks like a great improvement!

Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

We've been improving the forward (including fairly recently, in facebookresearch/xformers#388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks?

typically with stable diffusion 512x512 yields a 64x64 latent space -> 4096 tokens (but higher res would be even better). After that the embedding is 320 over 8 heads, which yields a head dim of 40. For inference batch is typically 2 (conditioned and unconditioned diffusion), but can depend on the methods. So [2, 4096, 40] is a good baseline, [2, 16384, 40] is a bonus (1024 rendering)

I use triton==2.0.0.dev20220701.

For the 1024x1024 on the A6000 I have 4 iterations per second

@blefaudeux
Copy link
Contributor

blefaudeux commented Sep 16, 2022

I've tried running the code in this PR, but I'm getting the following error:

AttributeError: module 'triton.language' has no attribute 'constexpr'

when installing xformers with pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers

Do I need a specific version of triton cc @MatthieuTPHR ?

specifically triton == 2.0.0.dev20220701 will work, it needs to be updated on xformers but that's WIP (the newer ones break a couple of kernels)

edit: but no triton installed should also work actually

@patrickvonplaten
Copy link
Contributor

I'm testing on a NVIDIA TITAN RTX on this branch with the following package dependencies:

- CUDA Version: 11.6
- torch: 1.12.1+cu102
- xformers: 0.0.13.dev (installed with `pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers`)
- triton: 2.0.0 (installed with `pip install triton==2.0.0.dev20220701`)

Note:

When I import xformers I'm getting:

libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
0.0.13.dev

With this setup I'm running:

from diffusers import StableDiffusionPipeline
import numpy as np
import torch

model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    use_auth_token=True,
)
pipe.to("cuda")

prompt = "A fantasy landscape, trending on artstation"
generator = torch.Generator(device="cuda").manual_seed(0)

with torch.autocast("cuda"):
    output = pipe(prompt=prompt, guidance_scale=7.5, generator=generator, output_type="np")

print(np.sum(np.abs(output.images[:3, :3, :3, :3])))
mem_bytes = torch.cuda.max_memory_allocated()
print(mem_bytes)
  1. Without xformers simply on the current main branch. Without xformers I'm getting: 9.15iterations/sec
  2. I'm running on this branch with:
export USE_MEMORY_EFFICIENT_ATTENTION=1

And I'm getting the exact same speed.

Any ideas what could be the problem here?

@fmassa
Copy link

fmassa commented Sep 16, 2022

Hi @patrickvonplaten

It might look like the CUDA extensions are not being compiled when installing xformers.

can you try doing

import torch, xformers.ops

print(torch.ops.xformers.efficient_attention_forward_generic)

and see if ti prints something like <OpOverloadPacket(op='xformers.efficient_attention_forward_generic')> ?

If that doesn't print what I mentioned, there are a few options why this isn't being compiled:
A few questions:

  • do you have CUDA setup in the machine you installed xformers?
  • do you have a nvcc in the machine you used to install xformers?

@MatthieuToulemont
Copy link
Contributor Author

MatthieuToulemont commented Sep 16, 2022

Hi @patrickvonplaten ,

Here is my full setup:

sudo docker run -it --gpus=all --ipc=host -v /home:/home nvcr.io/nvidia/pytorch:22.08-py3 bash

# Then 
pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers
pip install transformers ftfy scipy

# Followed by
cd PATH_TO_DIFFUSER_FOLDER
git checkout memory_efficient_attention
pip install -e . 

Once this is done I can notice the following speedup: 10 iterations per seconds without and 21 with, both at 512x512 in fp16.

@MatthieuToulemont MatthieuToulemont force-pushed the memory_efficient_attention branch from 23155d9 to e515599 Compare September 16, 2022 17:00
@tridao
Copy link

tridao commented Sep 16, 2022

Thank you @MatthieuTPHR, super exited to see ideas on fast & memory-efficient attention having an impact!

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Sep 16, 2022

Hey @fmassa,

When running:

import torch, xformers.ops

print(torch.ops.xformers.efficient_attention_forward_generic)

I'm getting:

'_OpNamespace' object has no attribute 'efficient_attention_forward_generic'

There already seems to be a problem I guess?

It would be really nice if we could somehow show the community that it's easy to install and use :-)

@fmassa
Copy link

fmassa commented Sep 16, 2022

@patrickvonplaten this means that indeed xformers was compiled without the CUDA extensions.

It would be really nice if we could somehow show the community that it's easy to install and use :-)

Yes, I totally agree, and we are working on that :-)

If you are not compiling xformers on a machine with CUDA (i.e., if you pip install in a machine without GPUs and then ssh to a machine with CPUs), you could try using FORCE_CUDA=1 pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers to try and compile xformers. Compiling it with its CUDA extensions should take a while, so it might be useful to pass --verbose to pip to see what are the things that are being done.

@MatthieuToulemont
Copy link
Contributor Author

Hi @MatthieuTPHR - this looks like a great improvement!

Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

We've been improving the forward (including fairly recently, in facebookresearch/xformers#388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks?

We are using the default parameters from the CompVis repo, I believe the parameters are as follows:

  • sequence_length for a 512x512 input: q_len = 64x64 = 4096
  • number of heads = 8
  • datatype: by default the HF repo allows for FP32, FP16, BF16

The sequence length could also be higher if we use a 1024x1024 or 2048x2048 input. The downscale factor between the input and the latent space is 8.

@TheLastBen
Copy link

Does it require a GPU with tensor cores (RTX 20 Series and above) ?
getting : WARNING:root:Blocksparse is not available: the current GPU does not expose Tensor cores

then :

Could not run 'xformers::efficient_attention_forward_generic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build)

@danthe3rd
Copy link

@TheLastBen what is your GPU model? xformers supports architectures above sm60 (P100+) - and possibly above sm50 (untested). The most important speedups are achieved on GPUs with tensor cores (sm70+ aka V100 and later), but it's not a requirement

@TheLastBen
Copy link

@danthe3rd I have a gtx 1070ti, the message is coming from Triton so I don't think it's the main cause for the crash

@blefaudeux
Copy link
Contributor

Does it require a GPU with tensor cores (RTX 20 Series and above) ? getting : WARNING:root:Blocksparse is not available: the current GPU does not expose Tensor cores

then :

Could not run 'xformers::efficient_attention_forward_generic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build)

these are two different topics:

  • first message is a warning in that the blocksparse attention will not be available, it's fine for you because you're not targeting this usecase anyway
  • the second message is your issue, when doing the xformers setup it could not build the CUDA kernels, so it does not find them at runtime. You can refer to this message to get more info around that, I would recommend a conda env or docker where all the cuda and torch components align (hard to get right outside of that)

@MatthieuToulemont MatthieuToulemont force-pushed the memory_efficient_attention branch from be50082 to 0f75d57 Compare November 1, 2022 18:40
@MatthieuToulemont
Copy link
Contributor Author

@NouamaneTazi the CI tests error seem unrelated to the code on this branch:

=========================== short test summary info ============================
FAILED tests/test_config.py::ConfigTester::test_load_ddim_from_euler - OSError: runwayml/stable-diffusion-v1-5 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login`.
FAILED tests/test_config.py::ConfigTester::test_load_ddim_from_euler_ancestral - OSError: runwayml/stable-diffusion-v1-5 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login`.
FAILED tests/test_config.py::ConfigTester::test_load_ddim_from_pndm - OSError: runwayml/stable-diffusion-v1-5 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login`.
FAILED tests/test_config.py::ConfigTester::test_load_pndm - OSError: runwayml/stable-diffusion-v1-5 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login`.
========== 4 failed, 233 passed, 136 skipped, 131 warnings in 50.61s ===========

Do you know if this is common to other PR ?

@patil-suraj
Copy link
Contributor

patil-suraj commented Nov 2, 2022

@MatthieuTPHR yes, the failing tests are unrelated, merging this now!

Thanks a lot @MatthieuTPHR @NouamaneTazi and the xformers team!

@Thomas-MMJ
Copy link

Thomas-MMJ commented Nov 3, 2022

@MatthieuTPHR I'd certainly be interested in seeing the AITemplate, couldn't get it going locally due to OOM during building.
Another thing that might be of interested is DeepSpeed MII

https://github.com/microsoft/DeepSpeed-MII

txt2img example here

https://github.com/microsoft/DeepSpeed-MII/tree/main/examples/local

Note for xformers, triton seems to give another .5 it/s locally on 512x512 images, unfortunately can't get it compiled for windows conda locally, only got it working on wsl2 ubuntu. For xformers you need to use a pinned version, since they have shuffled the location of some functions.

@bottler
Copy link

bottler commented Nov 3, 2022

@fmassa, @danthe3rd and the xformers team, thanks a lot for this amazing efficient implementation! It would be amazing if xformers is pip installable, this will make it very for users to use the new attention. Is there any ETA for when xformers will be pip installable ?

xformers now has dev conda packages for some combinations of Python, CUDA and PyTorch. See the README.

@Thomas-MMJ
Copy link

Thomas-MMJ commented Nov 4, 2022

@MatthieuTPHR

This PR for DeepSpeed MII is a nice tutorial for SD using DeepSpeed MII and lays out some of the optimizations they do,

microsoft/DeepSpeed-MII#90

https://github.com/microsoft/DeepSpeed-MII/blob/7ce86af278bc4da1479c17309d87539ccf78f994/examples/benchmark/txt2img/README.md

@0xdevalias
Copy link
Contributor

0xdevalias commented Nov 10, 2022

AITemplate + xformers combination just dropped:

Done: facebookincubator/AITemplate#74

Originally posted by @antinucleon in facebookincubator/AITemplate#13 (comment)


Sync to v0.1.1 version

Impact on current examples:

  • Stable Diffusion: A100-40GB / CUDA 11.6, 50 steps (ms)

Batch 1

Module AIT v0.1 AIT v0.1.1 v0.1.1 Speedup
CLIP 0.87 0.87 1X
UNet 22.47 18.11 1.24X
VAE 37.43 20.14 1.85X
Sum of Three 1161.8 926.51 1.25X
Pipeline 1282.98 1013 1.26X
v0.1: 42.45 it/s v0.1.1: 53.30 it/s

Batch 16

Module v0.1 v0.1.1 Speedup
Pipeline 14931.95 11064.81 1.34X

  • BERT
    CUDA long sequence performance will be significantly boosted by using new mem_eff_attention codegen
  • VIT
    CUDA large resolution performance will be significantly boosted by using new mem_eff_attention codegen

Originally posted by @0xdevalias in AUTOMATIC1111/stable-diffusion-webui#1625 (comment)

@spider853
Copy link

I have the following issues with xformers:

  1. custom embedings are almost not taken into consideration
  2. ghosting artifacts and loweres appears on the generated images

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
…e#532)

* 2x speedup using memory efficient attention

* remove einops dependency

* Swap K, M in op instantiation

* Simplify code, remove unnecessary maybe_init call and function, remove unused self.scale parameter

* make xformers a soft dependency

* remove one-liner functions

* change one letter variable to appropriate names

* Remove Env variable dependency, remove MemoryEfficientCrossAttention class and use enable_xformers_memory_efficient_attention method

* Add memory efficient attention toggle to img2img and inpaint pipelines

* Clearer management of xformers' availability

* update optimizations markdown to add info about memory efficient attention

* add benchmarks for TITAN RTX

* More detailed explanation of how the mem eff benchmark were ran

* Removing autocast from optimization markdown

* import_utils: import torch only if is available

Co-authored-by: Nouamane Tazi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.