how to avoid model offloading/reloading #2311

badayvedat · 2024-02-20T19:57:05Z

Read Troubleshoot

I confirm that I have read the Troubleshoot guide before making this issue.

Describe the problem
Fooocus reloads the model and offload(s) the clones on every inference request (the same request). Notice the [Fooocus Model Management] Moving model(s) has taken <x> seconds on the second and third inference logs.

Full Console Log

Total VRAM 81051 MB, total RAM 1814211 MB
Set vram state to: HIGH_VRAM
Device: cuda:0 NVIDIA A100-SXM4-80GB : native
VAE dtype: torch.bfloat16
Using pytorch cross attention
/root/.cache/isolate/virtualenv/e92a635dd941079698e1bee142ebe7517b06330053a4306442cc7f043ef113ae/lib/python3.11/site-packages/transformers/models/clip/feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
  warnings.warn(
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
Refiner unloaded.
model_type EPS
UNet ADM Dimension 2816
Using pytorch attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using pytorch attention in VAE
extra {'cond_stage_model.clip_l.text_projection', 'cond_stage_model.clip_l.logit_scale', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids'}
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
Base model loaded: /data/fooocus/models/checkpoints/juggernautXL_version6Rundiffusion.safetensors
Request to load LoRAs [['sd_xl_offset_example-lora_1.0.safetensors', 0.1], ['None', 1.0], ['None', 1.0], ['None', 1.0], ['None', 1.0]] for model [/data/fooocus/models/checkpoints/juggernautXL_version6Rundiffusion.safetensors].
Loaded LoRA [/data/fooocus/models/loras/sd_xl_offset_example-lora_1.0.safetensors] for UNet [/data/fooocus/models/checkpoints/juggernautXL_version6Rundiffusion.safetensors] with 788 keys at weight 0.1.
Fooocus V2 Expansion: Vocab with 642 words.
Fooocus Expansion engine loaded for cuda:0, use_fp16 = True.
Requested to load SDXLClipModel
Requested to load GPT2LMHeadModel
Loading 2 new models
[Fooocus Model Management] Moving model(s) has taken 0.38 seconds


### First Request ###
Enter LCM mode.
[Fooocus] Downloading LCM components ...
[Parameters] Adaptive CFG = 1.0
[Parameters] Sharpness = 0.0
[Parameters] ADM Scale = 1.0 : 1.0 : 0.0
[Parameters] CFG = 1.0
[Parameters] Seed = 176400
[Fooocus] Downloading control models ...
[Fooocus] Loading control models ...
[Parameters] Sampler = lcm - lcm
[Parameters] Steps = 8 - 8
[Fooocus] Initializing ...
[Fooocus] Loading models ...
Refiner unloaded.
Request to load LoRAs [['sd_xl_offset_example-lora_1.0.safetensors', 0.1], ['None', 1.0], ['None', 1.0], ['None', 1.0], ['None', 1.0], ('sdxl_lcm_lora.safetensors', 1.0)] for model [/data/fooocus/models/checkpoints/juggernautXL_version6Rundiffusion.safetensors].
Loaded LoRA [/data/fooocus/models/loras/sd_xl_offset_example-lora_1.0.safetensors] for UNet [/data/fooocus/models/checkpoints/juggernautXL_version6Rundiffusion.safetensors] with 788 keys at weight 0.1.
Loaded LoRA [/data/fooocus/models/loras/sdxl_lcm_lora.safetensors] for UNet [/data/fooocus/models/checkpoints/juggernautXL_version6Rundiffusion.safetensors] with 788 keys at weight 1.0.
Requested to load SDXLClipModel
Loading 1 new model
unload clone 1
[Fooocus Model Management] Moving model(s) has taken 0.94 seconds
[Fooocus] Processing prompts ...
[Fooocus] Preparing Fooocus text #1 ...
[Prompt Expansion] an astronaut in the jungle, cold color palette with butterflies in the background, highly detailed, 8k, professional composition, breathtaking, beautiful, sharp focus, cinematic, intricate, elegant, rich deep colors, perfect, aesthetic, very inspirational, sincere, cute, magical, fine detail, full strong creative, positive, vibrant, colorful, coherent, whole complex artistic, great thought
[Fooocus] Encoding positive #1 ...
[Fooocus] Image processing ...
[Parameters] Denoising Strength = 1.0
[Parameters] Initial Latent shape: Image Space (1024, 1024)
Preparation time: 5.48 seconds
Using lcm scheduler.
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.39970144629478455, sigma_max = 14.614640235900879
Requested to load SDXL
Loading 1 new model
unload clone 2
[Fooocus Model Management] Moving model(s) has taken 0.44 seconds
  0%|          | 0/8 [00:00<?, ?it/s]
 12%|█▎        | 1/8 [00:00<00:03,  1.82it/s]
 38%|███▊      | 3/8 [00:00<00:01,  4.99it/s]
 62%|██████▎   | 5/8 [00:00<00:00,  7.33it/s]
 88%|████████▊ | 7/8 [00:01<00:00,  9.03it/s]
100%|██████████| 8/8 [00:01<00:00,  7.37it/s]
Requested to load AutoencoderKL
Loading 1 new model
Image generated with private log at: /data/fooocus/outputs/2024-02-20/log.html
Generating and saving time: 3.03 seconds

### Second Request ###
Enter LCM mode.
[Fooocus] Downloading LCM components ...
[Parameters] Adaptive CFG = 1.0
[Parameters] Sharpness = 0.0
[Parameters] ADM Scale = 1.0 : 1.0 : 0.0
[Parameters] CFG = 1.0
[Parameters] Seed = 176400
[Fooocus] Downloading control models ...
[Fooocus] Loading control models ...
[Parameters] Sampler = lcm - lcm
[Parameters] Steps = 8 - 8
[Fooocus] Initializing ...
[Fooocus] Loading models ...
Refiner unloaded.
[Fooocus] Processing prompts ...
[Fooocus] Preparing Fooocus text #1 ...
[Prompt Expansion] an astronaut in the jungle, cold color palette with butterflies in the background, highly detailed, 8k, professional composition, breathtaking, beautiful, sharp focus, cinematic, intricate, elegant, rich deep colors, perfect, aesthetic, very inspirational, sincere, cute, magical, fine detail, full strong creative, positive, vibrant, colorful, coherent, whole complex artistic, great thought
[Fooocus] Encoding positive #1 ...
[Fooocus] Image processing ...
[Parameters] Denoising Strength = 1.0
[Parameters] Initial Latent shape: Image Space (1024, 1024)
Preparation time: 0.63 seconds
Using lcm scheduler.
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.43672478199005127, sigma_max = 14.614640235900879
Requested to load SDXL
Loading 1 new model
unload clone 3
[Fooocus Model Management] Moving model(s) has taken 0.36 seconds    <----- Shouldn't have move any models
  0%|          | 0/8 [00:00<?, ?it/s]
 25%|██▌       | 2/8 [00:00<00:00, 12.06it/s]
 50%|█████     | 4/8 [00:00<00:00, 12.65it/s]
 75%|███████▌  | 6/8 [00:00<00:00, 12.66it/s]
100%|██████████| 8/8 [00:00<00:00, 12.61it/s]
100%|██████████| 8/8 [00:00<00:00, 12.57it/s]
Image generated with private log at: /data/fooocus/outputs/2024-02-20/log.html
Generating and saving time: 2.25 seconds

### Third Request ###
Enter LCM mode.
[Fooocus] Downloading LCM components ...
[Parameters] Adaptive CFG = 1.0
[Parameters] Sharpness = 0.0
[Parameters] ADM Scale = 1.0 : 1.0 : 0.0
[Parameters] CFG = 1.0
[Parameters] Seed = 176400
[Fooocus] Downloading control models ...
[Fooocus] Loading control models ...
[Parameters] Sampler = lcm - lcm
[Parameters] Steps = 8 - 8
[Fooocus] Initializing ...
[Fooocus] Loading models ...
Refiner unloaded.
[Fooocus] Processing prompts ...
[Fooocus] Preparing Fooocus text #1 ...
[Prompt Expansion] an astronaut in the jungle, cold color palette with butterflies in the background, highly detailed, 8k, professional composition, breathtaking, beautiful, sharp focus, cinematic, intricate, elegant, rich deep colors, perfect, aesthetic, very inspirational, sincere, cute, magical, fine detail, full strong creative, positive, vibrant, colorful, coherent, whole complex artistic, great thought
[Fooocus] Encoding positive #1 ...
[Fooocus] Image processing ...
[Parameters] Denoising Strength = 1.0
[Parameters] Initial Latent shape: Image Space (1024, 1024)
Preparation time: 0.68 seconds
Using lcm scheduler.
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.43672478199005127, sigma_max = 14.614640235900879
Requested to load SDXL
Loading 1 new model
unload clone 3
[Fooocus Model Management] Moving model(s) has taken 0.37 seconds    <----- Shouldn't have move any models
  0%|          | 0/8 [00:00<?, ?it/s]
 25%|██▌       | 2/8 [00:00<00:00, 12.13it/s]
 50%|█████     | 4/8 [00:00<00:00, 12.33it/s]
 75%|███████▌  | 6/8 [00:00<00:00, 12.48it/s]
100%|██████████| 8/8 [00:00<00:00, 12.43it/s]
100%|██████████| 8/8 [00:00<00:00, 12.40it/s]
Image generated with private log at: /data/fooocus/outputs/2024-02-20/log.html
Generating and saving time: 2.26 seconds

cc: @isidentical

The text was updated successfully, but these errors were encountered:

mashb1t · 2024-02-20T22:33:26Z

holy smokes, Total VRAM 81051 MB, total RAM 1814211 MB!

You can disable offloading with --disable-offload-from-vram or better in your case use always GPU with --always-gpu

isidentical · 2024-02-20T22:52:35Z

@mashb1t we are specifying those options, but the weird thing we noticed was that, at every subsequent request LCM lora would get loaded again and again (preparing time is never 0 seconds, always >=400ms). This is something we'd want to somehow optimize as it is super wasteful in the current state.

I have a feeling that this is related to how Fooocus maintains "unet" copies and how it fuses loras, but we weren't able to figure out a way to always maintain an LCM-lora fused model in memory to get rid of this. Any ideas/suggestions would be super amazing!

mashb1t · 2024-02-20T23:12:40Z

@isidentical I'm sadly not that deep into the model loading and patching myself and I'm afraid I can't be of much help here.
As you've mentioned, I'd have also assumed that --always-gpu at least would keep the models (both checkpoints and LoRAs) loaded.
Maybe the fusing happens on the fly as juggernaut by default isn't provided with integrated LCM.
To dig deeper, please check https://github.com/lllyasviel/Fooocus/blob/main/ldm_patched/modules/model_patcher.py and / or https://github.com/lllyasviel/Fooocus/blob/main/ldm_patched/modules/model_management.py for offloading.
Happy to hear your results and thank you for the analysis!

mashb1t · 2024-02-22T20:30:17Z

@lllyasviel are you able to shed some light on the situation?

isidentical · 2024-02-22T21:20:39Z

@mashb1t just to update everyone here, we were able to fuse the LCM lora into the original juggernautXL model and then removed this line. This allowed us to get the generation timings from ~2 seconds to ~1.4-1.6 seconds (400-600ms shaved) for our own workflows. Not sure how much it is worth upstreaming this since it has its own pros/cons (e.g. you need a lot of VRAM and fooocus is generally a consumer orianted app so users might choose the flexiblity over couple hundred miliseconds). So for our part, I think the issue can be closed.

Fooocus/modules/async_worker.py

Line 183 in 1c999be

loras += [(modules.config.downloading_sdxl_lcm_lora(), 1.0)]

mashb1t · 2024-02-22T21:39:14Z

@isidentical thank you for the update and for sharing your insights, much appreciated. As you've also already hinted at, the goal of Fooocus is to lower entry barriers and allow as many users as possible to generate images, even with low hardware specs/knowledge/internet bandwith. I assume the average Fooocus user doesn't mind waiting an additional few 100ms or even a second when the alternative is a sacrifice in flexibility.
Thanks again, closing as solved.

isidentical · 2024-02-22T23:15:01Z

@mashb1t fooocus is an application we really like, so thank you both for maintaining it!! I don't see any sponsor button on the repo or on any of the maintainers, but we'd love to at least contribute financially if not as code/knowledge!

poor7 · 2024-02-23T03:32:09Z

@isidentical @mashb1t Interesting material on optimization of SD XL has just been released, perhaps something might be useful to you. https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl

mashb1t · 2024-02-23T07:40:40Z

@poor7 great article! Fooocus already uses almost all of the mentioned optimisation mechanisms except pre-compilation to keep ease of use high. This is why the VRAM footprint only is 4GB (even lower than the min. 6GB mentioned in the article), but a bit of swap is needed for offloading the models for fast access.

Moorer624 · 2024-11-23T19:55:44Z

I have a brand new laptop with 32GB RAM and RTX4070. No matter what I do with switches, I continue to get the reload of model on every image created in a batch (like badayvedat). I have tinkered with --disable offload from VRAM, --always-gpu, --always high vram. Nothing seems to prevent the reloading on each image generation. Is there something I might be missing?

mashb1t closed this as completed Feb 20, 2024

mashb1t added the question Further information is requested label Feb 20, 2024

mashb1t reopened this Feb 20, 2024

mashb1t closed this as completed Feb 22, 2024

mashb1t mentioned this issue May 2, 2024

[Bug]: Inpainting always moves models despite using --always-gpu and --disable-offload-from-vram #2811

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to avoid model offloading/reloading #2311

how to avoid model offloading/reloading #2311

badayvedat commented Feb 20, 2024

mashb1t commented Feb 20, 2024

isidentical commented Feb 20, 2024

mashb1t commented Feb 20, 2024

mashb1t commented Feb 22, 2024

isidentical commented Feb 22, 2024

mashb1t commented Feb 22, 2024

isidentical commented Feb 22, 2024

poor7 commented Feb 23, 2024 •

edited

Loading

mashb1t commented Feb 23, 2024

Moorer624 commented Nov 23, 2024

how to avoid model offloading/reloading #2311

how to avoid model offloading/reloading #2311

Comments

badayvedat commented Feb 20, 2024

mashb1t commented Feb 20, 2024

isidentical commented Feb 20, 2024

mashb1t commented Feb 20, 2024

mashb1t commented Feb 22, 2024

isidentical commented Feb 22, 2024

mashb1t commented Feb 22, 2024

isidentical commented Feb 22, 2024

poor7 commented Feb 23, 2024 • edited Loading

mashb1t commented Feb 23, 2024

Moorer624 commented Nov 23, 2024

poor7 commented Feb 23, 2024 •

edited

Loading