-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to avoid model offloading/reloading #2311
Comments
holy smokes, Total VRAM 81051 MB, total RAM 1814211 MB! You can disable offloading with |
@mashb1t we are specifying those options, but the weird thing we noticed was that, at every subsequent request LCM lora would get loaded again and again (preparing time is never 0 seconds, always >=400ms). This is something we'd want to somehow optimize as it is super wasteful in the current state. I have a feeling that this is related to how Fooocus maintains "unet" copies and how it fuses loras, but we weren't able to figure out a way to always maintain an LCM-lora fused model in memory to get rid of this. Any ideas/suggestions would be super amazing! |
@isidentical I'm sadly not that deep into the model loading and patching myself and I'm afraid I can't be of much help here. |
@lllyasviel are you able to shed some light on the situation? |
@mashb1t just to update everyone here, we were able to fuse the LCM lora into the original juggernautXL model and then removed this line. This allowed us to get the generation timings from ~2 seconds to ~1.4-1.6 seconds (400-600ms shaved) for our own workflows. Not sure how much it is worth upstreaming this since it has its own pros/cons (e.g. you need a lot of VRAM and fooocus is generally a consumer orianted app so users might choose the flexiblity over couple hundred miliseconds). So for our part, I think the issue can be closed. Fooocus/modules/async_worker.py Line 183 in 1c999be
|
@isidentical thank you for the update and for sharing your insights, much appreciated. As you've also already hinted at, the goal of Fooocus is to lower entry barriers and allow as many users as possible to generate images, even with low hardware specs/knowledge/internet bandwith. I assume the average Fooocus user doesn't mind waiting an additional few 100ms or even a second when the alternative is a sacrifice in flexibility. |
@mashb1t fooocus is an application we really like, so thank you both for maintaining it!! I don't see any sponsor button on the repo or on any of the maintainers, but we'd love to at least contribute financially if not as code/knowledge! |
@isidentical @mashb1t Interesting material on optimization of SD XL has just been released, perhaps something might be useful to you. https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl |
@poor7 great article! Fooocus already uses almost all of the mentioned optimisation mechanisms except pre-compilation to keep ease of use high. This is why the VRAM footprint only is 4GB (even lower than the min. 6GB mentioned in the article), but a bit of swap is needed for offloading the models for fast access. |
I have a brand new laptop with 32GB RAM and RTX4070. No matter what I do with switches, I continue to get the reload of model on every image created in a batch (like badayvedat). I have tinkered with --disable offload from VRAM, --always-gpu, --always high vram. Nothing seems to prevent the reloading on each image generation. Is there something I might be missing? |
Read Troubleshoot
Describe the problem
Fooocus reloads the model and offload(s) the clones on every inference request (the same request). Notice the
[Fooocus Model Management] Moving model(s) has taken <x> seconds
on the second and third inference logs.Full Console Log
cc: @isidentical
The text was updated successfully, but these errors were encountered: