You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed that with each new generation when using flux models, the model transfer time keeps getting longer and longer.
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-644-gde1670a4
Commit hash: de1670a
Launching Web UI with arguments:
Total VRAM 8191 MB, total RAM 16335 MB
pytorch version: 2.4.0+cu124
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3050 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
D:\AI\Forge\system\python\lib\site-packages\transformers\utils\hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
Using pytorch cross attention
Using pytorch attention for VAE
ControlNet preprocessor location: D:\AI\Forge\webui\models\ControlNetPreprocessor
2025-02-13 20:27:02,860 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-schnell-bnb-nf4.safetensors', 'hash': '7d3d1873'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True in launch().
Startup time: 78.8s (prepare environment: 19.2s, launcher: 1.8s, import torch: 37.3s, initialize shared: 1.8s, other imports: 2.0s, setup gfpgan: 0.2s, list SD models: 0.6s, load scripts: 9.0s, create ui: 3.5s, gradio launch: 3.9s).
Model selected: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Loading Model: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
StateDict Keys: {'transformer': 1722, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0}
Using Detected T5 Data Type: torch.float8_e4m3fn
Using Detected UNet Type: nf4
Using pre-quant state dict!
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'nf4', 'computation_dtype': torch.bfloat16}
Model loaded in 22.3s (unload existing model: 0.2s, forge model load: 22.1s).
[LORA] Loaded D:\AI\Forge\webui\models\Lora\Anime_Furry_Style_Flux.safetensors for KModel-UNet with 304 keys at weight 0.7 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 7723.54 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 7184.00 MB, Model Require: 5153.49 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 1006.51 MB, All loaded to GPU.
Moving model(s) has taken 24.04 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 9411.13 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1911.42 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 7144.03 MB, Model Require: 6246.84 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -126.81 MB, CPU Swap Loaded (blocked method): 1435.50 MB, GPU Loaded: 4811.34 MB
Moving model(s) has taken 148.36 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [01:12<00:00, 7.28s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2125.69 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7134.06 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5950.19 MB, All loaded to GPU.
Moving model(s) has taken 54.55 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [02:20<00:00, 14.03s/it]
Environment vars changed: {'stream': True, 'inference_memory': 1024.0, 'pin_shared_memory': False}:20<00:00, 5.18s/it]
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Current free memory is 6974.41 MB ... Unload model IntegratedAutoencoderKL Done.
[LORA] Loaded D:\AI\Forge\webui\models\Lora\Anime_Furry_Style_Flux.safetensors for KModel-UNet with 304 keys at weight 0.9 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 7817.77 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 7135.05 MB, Model Require: 5225.98 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 885.07 MB, All loaded to GPU.
Moving model(s) has taken 244.25 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 9411.08 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1900.58 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 7130.06 MB, Model Require: 6246.80 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -140.74 MB, CPU Swap Loaded (blocked method): 1435.50 MB, GPU Loaded: 4811.30 MB
Moving model(s) has taken 687.08 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [01:06<00:00, 6.69s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2127.72 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7128.09 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5944.22 MB, All loaded to GPU.
Moving model(s) has taken 426.07 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [09:14<00:00, 55.42s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [09:13<00:00, 5.14s/it]
I had a similar problem a few months ago (long models moving) and I solved it with the "gpu_for_t5" extension (https://github.com/Juqowel/GPU_For_T5) while putting cpu. However, after a while the problem resolved itself and the extension did not affect anything else.
I have now tried this extension again and it helped, however I want to find out what is the reason for such a big difference in speed, as I didn't find a direct answer or didn't understand it here: #1591
The text was updated successfully, but these errors were encountered:
I've noticed that with each new generation when using flux models, the model transfer time keeps getting longer and longer.
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-644-gde1670a4
Commit hash: de1670a
Launching Web UI with arguments:
Total VRAM 8191 MB, total RAM 16335 MB
pytorch version: 2.4.0+cu124
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3050 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
D:\AI\Forge\system\python\lib\site-packages\transformers\utils\hub.py:128: FutureWarning: Using
TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. UseHF_HOME
instead.warnings.warn(
Using pytorch cross attention
Using pytorch attention for VAE
ControlNet preprocessor location: D:\AI\Forge\webui\models\ControlNetPreprocessor
2025-02-13 20:27:02,860 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-schnell-bnb-nf4.safetensors', 'hash': '7d3d1873'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL: http://127.0.0.1:7860
To create a public link, set
share=True
inlaunch()
.Startup time: 78.8s (prepare environment: 19.2s, launcher: 1.8s, import torch: 37.3s, initialize shared: 1.8s, other imports: 2.0s, setup gfpgan: 0.2s, list SD models: 0.6s, load scripts: 9.0s, create ui: 3.5s, gradio launch: 3.9s).
Model selected: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Loading Model: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
StateDict Keys: {'transformer': 1722, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0}
Using Detected T5 Data Type: torch.float8_e4m3fn
Using Detected UNet Type: nf4
Using pre-quant state dict!
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'nf4', 'computation_dtype': torch.bfloat16}
Model loaded in 22.3s (unload existing model: 0.2s, forge model load: 22.1s).
[LORA] Loaded D:\AI\Forge\webui\models\Lora\Anime_Furry_Style_Flux.safetensors for KModel-UNet with 304 keys at weight 0.7 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 7723.54 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 7184.00 MB, Model Require: 5153.49 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 1006.51 MB, All loaded to GPU.
Moving model(s) has taken 24.04 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 9411.13 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1911.42 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 7144.03 MB, Model Require: 6246.84 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -126.81 MB, CPU Swap Loaded (blocked method): 1435.50 MB, GPU Loaded: 4811.34 MB
Moving model(s) has taken 148.36 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [01:12<00:00, 7.28s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2125.69 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7134.06 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5950.19 MB, All loaded to GPU.
Moving model(s) has taken 54.55 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [02:20<00:00, 14.03s/it]
Environment vars changed: {'stream': True, 'inference_memory': 1024.0, 'pin_shared_memory': False}:20<00:00, 5.18s/it]
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Current free memory is 6974.41 MB ... Unload model IntegratedAutoencoderKL Done.
[LORA] Loaded D:\AI\Forge\webui\models\Lora\Anime_Furry_Style_Flux.safetensors for KModel-UNet with 304 keys at weight 0.9 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 7817.77 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 7135.05 MB, Model Require: 5225.98 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 885.07 MB, All loaded to GPU.
Moving model(s) has taken 244.25 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 9411.08 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1900.58 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 7130.06 MB, Model Require: 6246.80 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -140.74 MB, CPU Swap Loaded (blocked method): 1435.50 MB, GPU Loaded: 4811.30 MB
Moving model(s) has taken 687.08 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [01:06<00:00, 6.69s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2127.72 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7128.09 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5944.22 MB, All loaded to GPU.
Moving model(s) has taken 426.07 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [09:14<00:00, 55.42s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [09:13<00:00, 5.14s/it]
I had a similar problem a few months ago (long models moving) and I solved it with the "gpu_for_t5" extension (https://github.com/Juqowel/GPU_For_T5) while putting cpu. However, after a while the problem resolved itself and the extension did not affect anything else.
I have now tried this extension again and it helped, however I want to find out what is the reason for such a big difference in speed, as I didn't find a direct answer or didn't understand it here: #1591
The text was updated successfully, but these errors were encountered: