The way to use the second GPU for the JointTextEncoder #1591

Juqowel · 2024-08-30T00:20:16Z

Juqowel
Aug 30, 2024

I'm already did simple, but VERY BAD implementation for youself. (for 4060ti(16gb) and 3060(12gb))
It looks like:

Saves a lot of time.

First load singe gpu:

Second+ load with prompt changed, single gpu:

First load dual gpu:

Second+ load with prompt changed, dual gpu:

I think it would be great if this feature was added for everyone and with GOOD implementation.

lllyasviel · 2024-08-30T02:43:14Z

lllyasviel
Aug 30, 2024
Maintainer

You can do things in a “Forge” way:

Just create a folder and put this like extensions/what_ever_name_you_like/scripts/whatever_you_like.py

import torch
import gradio as gr
from modules import scripts


class T5onOtherDevice(scripts.Script):
    def title(self):
        return "T5 on Other Device"

    def show(self, is_img2img):
        return scripts.AlwaysVisible

    def ui(self, *args, **kwargs):
        with gr.Accordion(open=False, label=self.title()):
            enabled = gr.Checkbox(label='Enabled', value=False)
        return enabled

    def process(self, p, *script_args, **kwargs):
        self.enabled = script_args
        if not self.enabled:
            # Todo； add some codes to revert the change
            return
        p.sd_model.forge_objects.clip.patcher.load_device=torch.device('cuda:1')
        return

But this code may need a bit more tests

12 replies

Juqowel Sep 9, 2024
Author

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!

That's gguf load logic. Can't do anything with it now.

I disagree

To be clear. (this is 32GB)
With extension and two gpu:

First - load t5 to RAM -> load t5 to VRAM -> clear RAM
Second - load flux(nf4) to RAM -> load flux(nf4) to VRAM -> clear RAM

With extension and gpu + cpu:

First - load t5 to RAM -> stays here
Second - load flux(nf4) to RAM -> load flux(nf4) to VRAM - clear RAM

Better if (without extension):
First - load flux to RAM -> load flux to VRAM - clear RAM
Second - load t5 to RAM -> stays here

On a different note, I noticed that the "GPU Weights" feature does not work when using the extension.

Yes, it's needs for properly work.
But you can test without it by remove memory_management.lowvram_available = False.

Edit: Anyway, I want to rework it in next month.

Iory1998 Sep 9, 2024

Better if (without extension):
First - load flux to RAM -> load flux to VRAM - clear RAM
Second - load t5 to RAM -> stays here

I totally agree.

Yes, it's needs for properly work.
But you can test without it by remove memory_management.lowvram_available = False.
What do you mean?

Edit: Anyway, I want to rework it in next month.
Thank you and good luck. Do your best to work on the extension this month :P

Juqowel Sep 9, 2024
Author

What do you mean?

It's just prevents unnecessary offload. But in your case.. I'm not sure. Is it work without extension faster(s/it)?

Upd: option for control LowVram function. Didn't test, but should work.

Iory1998 Sep 10, 2024

@Juqowel I wrote a post comparing different GGUF quants on reddit and the my surprised, many people didn't know about forcing textencoders into RAM was even possible. I referred them to your extension and it worked well for a few.
I highly request you to give this extension serious development as it would really help people with low VRAM.

Upd: option for control LowVram function. Didn't test, but should work.

I'll update the extension and try it again.

UPDATE
IT WORLED BEAUTIFULLY. Here is what happens. When I launch Forge for the first time, and activate the extension, I see "T5 on Other Device is enabled for txt2img and cpu" message in console but there is no way to see whether "Low RAM" is activated too. IT doesn't activate even when ticked. I had to deactivate and reactivate the extension for it to work with GPU Weights. And when it did, It gave me the best experience with FP16 so far. My ram usage never exceeded 50% in the process.
Could you update the extension so Low VRAM is activated by default? Maybe this would solve the issue. Also, is there a way the whole extension is activated by default?
@lllyasviel Maybe you could consider including this extension by in Forge by default.

Juqowel Sep 11, 2024
Author

@Iory1998 Thanks for testing. I don't have time for this.
The update is ready.

Arnaud3013 · 2024-10-01T00:57:56Z

Arnaud3013
Oct 1, 2024

Really usefull extensions. On dev nf4 (RTX 4070, setting 9500 max vram for model) Great speedup if no model change between generation.
previously:

100%|████████████████████████████████████████████████████████████████████| 5/5 [00:07<00:00,  1.60s/it]
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.██████████| 5/5 [00:07<00:00,  1.29s/it]
[Unload] Trying to free 15315.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 4472.06 MB ... Unload model KModel Current free memory is 10860.76 MB ... Unload model IntegratedAutoencoderKL Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 11020.64 MB, Model Require: 9641.98 MB, Previously Loaded: 0.00 MB, Inference Require: 2781.00 MB, Remaining: -1402.34 MB, CPU Swap Loaded (blocked method): 2886.75 MB, GPU Loaded: 6899.98 MB
Moving model(s) has taken 3.62 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 10901.84 MB for cuda:0 with 0 models keep loaded ... Current free memory is 3613.65 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 11015.63 MB, Model Require: 6246.80 MB, Previously Loaded: 0.00 MB, Inference Require: 2781.00 MB, Remaining: 1987.83 MB, All loaded to GPU.
Moving model(s) has taken 3.24 seconds
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:07<00:00,  1.47s/it]

Now:

100%|████████████████████████████████████████████████████████████████████| 5/5 [00:08<00:00,  1.76s/it]
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.██████████| 5/5 [00:08<00:00,  1.30s/it]
[Unload] Trying to free 10901.84 MB for cuda:0 with 0 models keep loaded ... Current free memory is 10352.01 MB ... Current free memory is 10352.01 MB ... Unload model IntegratedAutoencoderKL Done.
[Memory Management] Target: KModel, Free GPU: 10511.88 MB, Model Require: 6246.80 MB, Previously Loaded: 0.00 MB, Inference Require: 2781.00 MB, Remaining: 1484.08 MB, All loaded to GPU.
Moving model(s) has taken 1.41 seconds
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:07<00:00,  1.47s/it]

On dev Q8 gguf (same gpu)
Previously:

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.23s/it]
[Unload] Trying to free 15315.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 10861.01 MB ... Unload model IntegratedAutoencoderKL Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 11020.89 MB, Model Require: 9641.98 MB, Previously Loaded: 0.00 MB, Inference Require: 2781.00 MB, Remaining: -1402.09 MB, CPU Swap Loaded (blocked method): 2886.75 MB, GPU Loaded: 6899.98 MB
Moving model(s) has taken 1.75 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 18536.36 MB for cuda:0 with 0 models keep loaded ... Current free memory is 3613.90 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 11015.88 MB, Model Require: 12119.51 MB, Previously Loaded: 0.00 MB, Inference Require: 2781.00 MB, Remaining: -3884.63 MB, CPU Swap Loaded (blocked method): 5202.00 MB, GPU Loaded: 6917.51 MB
Moving model(s) has taken 4.48 seconds
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.24s/it]

after:

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.23s/it]
Distilled CFG Scale: 3.5
[Unload] Trying to free 18536.36 MB for cuda:0 with 0 models keep loaded ... Current free memory is 10354.01 MB ... Unload model IntegratedAutoencoderKL Current free memory is 10513.88 MB ... Done.
[Memory Management] Target: KModel, Free GPU: 10513.88 MB, Model Require: 12119.51 MB, Previously Loaded: 0.00 MB, Inference Require: 2781.00 MB, Remaining: -4386.63 MB, CPU Swap Loaded (blocked method): 5680.12 MB, GPU Loaded: 6439.38 MB
Moving model(s) has taken 2.62 seconds
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.24s/it]

But it seems to have some issues with memory management. When i change model, like in x/y/z plot and do some genration/tests, my memory explode. Vram constant no issue here, blocked at 9000 on my 4070, but RAM it's another topic. I've set some virtual ram to be sure to handle model in q8_0, I've 32go physical RAM and 60go of virtual RAM on nvme ssd. without extension no issues, sometimes RAM usage go up to 55go but not more, with extension very often it go up to 90go and crash forge.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The way to use the second GPU for the JointTextEncoder #1591

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

The way to use the second GPU for the JointTextEncoder #1591

Juqowel Aug 30, 2024

Replies: 2 comments · 12 replies

lllyasviel Aug 30, 2024 Maintainer

Juqowel Sep 9, 2024 Author

Iory1998 Sep 9, 2024

Juqowel Sep 9, 2024 Author

Iory1998 Sep 10, 2024

Juqowel Sep 11, 2024 Author

Arnaud3013 Oct 1, 2024

Juqowel
Aug 30, 2024

Replies: 2 comments 12 replies

lllyasviel
Aug 30, 2024
Maintainer

Juqowel Sep 9, 2024
Author

Juqowel Sep 9, 2024
Author

Juqowel Sep 11, 2024
Author

Arnaud3013
Oct 1, 2024