-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minimal stable diffusion GPU memory usage with accelerate hooks #850
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Just for reference, this is how I solved the same problem - https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/b75deaa743c77415f277fedd37494a661a0cbaf2/sdgrpcserver/pipeline/unified_pipeline.py#L364. This gets used as a baseclass for a pipeline rather than DiffusionPipeline, and moves one model at a time onto the GPU. The advantage compared to this is that all the models run on the GPU, the disadvantage is that it's dependent on garbage collection to free the GPU memory. |
@hafriedlander thank you for the suggestion. I found out that we can do that with accelerate: if we add this hook to each model, it will offload to GPU onnly when performing inference, keeping the memory usage minimal. I tried to avoid losing time with I/O and thus did not added everything to be loaded to GPU. Nevertheless, I agree with you this is a possible add-on to the PR. |
Hey @piEsposito, Generally, I think I'm fine with this PR! Just one question: In our experiments it's better and faster to run the model in pure "fp16" and to not use autocast at all - see: #371 Should we maybe try this here as well? E.g. just removing the |
Otherwise happy to add this to the pipeline :-) |
@patrickvonplaten thank you for your feedback. About your concern: I think we can't run the model in pure fp16 because some layers (e.g. layernorm) are implemented only for fp32. Because of that we can only keep the unet on fp16 (as it will run on the GPU), and the other models need to be in fp32 to be able to run on CPU. I know this is not optimal, but it would let people with smaller GPUs have access to Stable Diffusion with the price of a decrese in performance. When I put everything on
That way, the autocast works as a way to put the tensors in the right precision without having to intrude on the pipeline code. Please let me know if there is a way to keep them on fp16 o CPU and be able to run those layers, if that's the case I can add it to the PR. I tried using For now, best I could do is explicitely keeping unet on fp16 o gpu and everything else on fp32 on cpu. It is not the best scenario, but at least people with less powerful setups will (like my mother) will be able to run it on less powerful setups. What do you think? Thanks! |
Update: I've opened huggingface/accelerate#768 to solve the issue with If the PR on accelerate is merge, we can change this PR to get as little as 640mb of GPU memory usage. If this accelerate PR is merged we would just have to change diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py Lines 122 to 137 in 0b601e6
to def cuda_with_minimal_gpu_usage(self):
if is_accelerate_available():
from accelerate import cpu_offload
else:
raise ImportError("Please install accelerate via `pip install accelerate`")
device = torch.device("cuda")
self.enable_attention_slicing(1)
for cpu_offloaded_model in [self.unet, self.text_encoder, self.vae, self.safety_checker]:
cpu_offload(cpu_offloaded_model, device) And reduce memory usage even further thanks to CPU offload. |
@patrickvonplaten I will wait for Would that be ok? |
@sgugger, can I ask for a new patch release of accelerate with (huggingface/accelerate@5e8ab12 from huggingface/accelerate#768) merged? It would unblock this PR. |
Not sure we will do a patch release for that, it's not a regression introduced by anything, just a bug fix. The big model inference in Accelerate is getting improved every day, so Diffusers should test again the main branch of Accelerate for now (that's what we do in Transformers). |
Hey @piEsposito, Cool to see that the change has been merged to Then we can just ask people to install accelerate from github until
Does this sound good to you? :-) |
@patrickvonplaten just did it. The problem is the tests won't pass because For my commits below this comment you will notice that, if this is possible, I'm having a hard time to figure it out haha. |
Ah I see regarding the tests, let me and @anton-l take care of it :-) Don't worry about the failing test we'll make a change to the testing files |
.github/workflows/pr_tests.yml
Outdated
run: | | ||
python -m pip install --upgrade pip | ||
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cpu | ||
python -m pip install -e .[quality,test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python -m pip install -e .[quality,test] | |
python -m pip install git+https://github.com/huggingface/accelerate | |
python -m pip install -e .[quality,test] |
This should work I think :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried putting the accelerate from master after the quality,test
to ensure this is the version we will keep for the tests.
.github/workflows/push_tests.yml
Outdated
python -m pip install --upgrade pip | ||
python -m pip uninstall -y torch torchvision torchtext | ||
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cu116 | ||
python -m pip install -e .[quality,test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python -m pip install -e .[quality,test] | |
python -m pip install git+https://github.com/huggingface/accelerate | |
python -m pip install -e .[quality,test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried putting the accelerate from master after the quality,test
to ensure this is the version we will keep for the tests. How does that sound?
Actually you seem to have figured out already how the githuh actions work :-) Think you'll just need to install accelerate before |
Actually, I think you were right. We can install it before because it will comply with |
@patrickvonplaten it worked, only the MPS tests on Apple M1 are not passing, but I think that's happening on the other PRs too right? |
Now I think we are good to go. |
@patrickvonplaten, can we add this funcionality to the other Stable Diffusion Pipelines on follow up PRs? |
@patrickvonplaten, sorry to bother, can I do anything to help moving on with this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR @piEsposito - sorry for replying so late! I was a bit swamped with issues 😅
@anton-l FYI regarding the tests -> until the next |
@patrickvonplaten anything I can do to help with the issues? Thank you for merging that one. Can I do the same for the other stable diffusion pipelines? |
Yes I think adding this to the img2img and also inpaint pipeline makes a lot of sense :-) Also would you be interested in adding a section about this feature and the max memory usage to: https://huggingface.co/docs/diffusers/optimization/fp16 maybe? |
All right, I will do that on the next few days. About the documentation: how can I do that? |
@piEsposito regarding the documentation you simply need to change the corresponding files here: https://github.com/huggingface/diffusers/tree/main/docs/source as well as add docstring to your method. |
I think as soon as we have docs up as mentioned here: #850 (comment) we can promote this super cool new reduction to less than 2GB RAM, what do you think? :-) |
@patrickvonplaten great idea, I will try to do that till the end of the week. Should I add it to https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx ? If we join this with attention slicing, it gets as small as 700mb. |
…ingface#850) * add method to enable cuda with minimal gpu usage to stable diffusion * add test to minimal cuda memory usage * ensure all models but unet are onn torch.float32 * move to cpu_offload along with minor internal changes to make it work * make it test against accelerate master branch * coming back, its official: I don't know how to make it test againt the master branch from accelerate * make it install accelerate from master on tests * go back to accelerate>=0.11 * undo prettier formatting on yml files * undo prettier formatting on yml files againn
Attempts to solve #540 in a more readable, less intrusive, less verbose way than #537.
@patil-suraj I made another attempt to solve this, now in a non-intrusive way, using
accelerate.cpu_offload
to keep on GPU only the parts of the models that are being used on operations, keeping the GPU memory footprint as little as800 mb
.Would you be so gentle as to take a look and tell me if I'm going on the right direction?
Thanks!
Changes:
torch.device("meta")
on the.device
getter, returntorch.device("cpu")
, otherwise we won't be able to move tensors toself.device
if using accelerateStableDiffusionPipeline.cuda_with_minimal_gpu_usage
method toacclerate.cpu_offload
all models to reduce GPU memory footprint.accelerate
from master, as per @sgugger suggestion.We need to install
accelerate
from master to use it, as this PR depends on huggingface/accelerate#768.