Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase of model moving time with each flux generation #2652

Open
R2008KK opened this issue Feb 13, 2025 · 0 comments
Open

Increase of model moving time with each flux generation #2652

R2008KK opened this issue Feb 13, 2025 · 0 comments

Comments

@R2008KK
Copy link

R2008KK commented Feb 13, 2025

I've noticed that with each new generation when using flux models, the model transfer time keeps getting longer and longer.

Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-644-gde1670a4
Commit hash: de1670a
Launching Web UI with arguments:
Total VRAM 8191 MB, total RAM 16335 MB
pytorch version: 2.4.0+cu124
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3050 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
D:\AI\Forge\system\python\lib\site-packages\transformers\utils\hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
Using pytorch cross attention
Using pytorch attention for VAE
ControlNet preprocessor location: D:\AI\Forge\webui\models\ControlNetPreprocessor
2025-02-13 20:27:02,860 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-schnell-bnb-nf4.safetensors', 'hash': '7d3d1873'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Startup time: 78.8s (prepare environment: 19.2s, launcher: 1.8s, import torch: 37.3s, initialize shared: 1.8s, other imports: 2.0s, setup gfpgan: 0.2s, list SD models: 0.6s, load scripts: 9.0s, create ui: 3.5s, gradio launch: 3.9s).
Model selected: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Loading Model: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
StateDict Keys: {'transformer': 1722, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0}
Using Detected T5 Data Type: torch.float8_e4m3fn
Using Detected UNet Type: nf4
Using pre-quant state dict!
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'nf4', 'computation_dtype': torch.bfloat16}
Model loaded in 22.3s (unload existing model: 0.2s, forge model load: 22.1s).
[LORA] Loaded D:\AI\Forge\webui\models\Lora\Anime_Furry_Style_Flux.safetensors for KModel-UNet with 304 keys at weight 0.7 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 7723.54 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 7184.00 MB, Model Require: 5153.49 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 1006.51 MB, All loaded to GPU.
Moving model(s) has taken 24.04 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 9411.13 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1911.42 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 7144.03 MB, Model Require: 6246.84 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -126.81 MB, CPU Swap Loaded (blocked method): 1435.50 MB, GPU Loaded: 4811.34 MB
Moving model(s) has taken 148.36 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [01:12<00:00, 7.28s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2125.69 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7134.06 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5950.19 MB, All loaded to GPU.
Moving model(s) has taken 54.55 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [02:20<00:00, 14.03s/it]
Environment vars changed: {'stream': True, 'inference_memory': 1024.0, 'pin_shared_memory': False}:20<00:00, 5.18s/it]
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Current free memory is 6974.41 MB ... Unload model IntegratedAutoencoderKL Done.
[LORA] Loaded D:\AI\Forge\webui\models\Lora\Anime_Furry_Style_Flux.safetensors for KModel-UNet with 304 keys at weight 0.9 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 7817.77 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 7135.05 MB, Model Require: 5225.98 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 885.07 MB, All loaded to GPU.
Moving model(s) has taken 244.25 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 9411.08 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1900.58 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 7130.06 MB, Model Require: 6246.80 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -140.74 MB, CPU Swap Loaded (blocked method): 1435.50 MB, GPU Loaded: 4811.30 MB
Moving model(s) has taken 687.08 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [01:06<00:00, 6.69s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2127.72 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7128.09 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5944.22 MB, All loaded to GPU.
Moving model(s) has taken 426.07 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [09:14<00:00, 55.42s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [09:13<00:00, 5.14s/it]

I had a similar problem a few months ago (long models moving) and I solved it with the "gpu_for_t5" extension (https://github.com/Juqowel/GPU_For_T5) while putting cpu. However, after a while the problem resolved itself and the extension did not affect anything else.

I have now tried this extension again and it helped, however I want to find out what is the reason for such a big difference in speed, as I didn't find a direct answer or didn't understand it here: #1591

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant