Got OOM message with GTX3060 #101

masahiro-999 · 2022-12-09T21:19:29Z

I've been trying to Stable Diffusion with GPU.

But it failed and I got the OOM message

Is this error message due to insufficient GPU memory?
Is it possible to make it work by adjusting some parameters?
Stable Diffusion 1.4 is running on this GPU in the tensorflow environment. It would be nice if it works with bumblebee too.

it's working fine with :host . It's amazing how easy it is to use neural networks with livebooks!!!

OS Ubunt 22.04 on WSL2
GPU GTX3060(12GB)
Livebook v0.8.0
Elixir v1.14.2
XLA_TARGET=cuda111
CUDA Version: 11.7

05:32:56.019 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.

05:32:56.023 [info] XLA service 0x7fb39437dac0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

05:32:56.023 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6

05:32:56.023 [info] Using BFC allocator.

05:32:56.023 [info] XLA backend allocating 10641368678 bytes on device 0 for BFCAllocator.

05:32:58.662 [info] Start cannot spawn child process: No such file or directory

05:34:00.234 [info] total_region_allocated_bytes_: 10641368576 memory_limit_: 10641368678 available bytes: 102 curr_region_allocation_bytes_: 21282737664

05:34:00.234 [info] Stats: 
Limit:                     10641368678
InUse:                      5530766592
MaxInUse:                   7566778624
NumAllocs:                        3199
MaxAllocSize:                399769600
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

05:34:00.234 [warn] **********___***********************************************************____________________________

05:34:00.234 [error] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 3546709984 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    3.84GiB
              constant allocation:       144B
        maybe_live_out allocation:   768.0KiB
     preallocated temp allocation:    3.30GiB
  preallocated temp fragmentation:       304B (0.00%)
                 total allocation:    7.15GiB
              total fragmentation:   821.0KiB (0.01%)

whole log is
oommessage.log

The text was updated successfully, but these errors were encountered:

seanmor5 · 2022-12-09T21:33:50Z

We are likely being more inefficient than TensorFlow somewhere. This might be related: elixir-nx/nx#1003

One thing you can try is mixed precision in all of the models:

policy = Axon.MixedPrecision.create_policy(compute: :f16)

# do this for every model
{:ok, %{model: clip_model} = clip} = Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"})
clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip, policy)}

Note I haven't tested if this would affect image outputs or not

masahiro-999 · 2022-12-09T22:51:12Z

I tried code like this.
This didn't help. I got same OOM message.

policy = Axon.MixedPrecision.create_policy(compute: :f16)

{:ok, clip} =
  Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"},
    log_params_diff: false
  )
clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip.model, policy)}

{:ok, unet} =
  Bumblebee.load_model({:hf, repository_id, subdir: "unet"},
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )
unet = %{unet | model: Axon.MixedPrecision.apply_policy(unet.model, policy)}

{:ok, vae} =
  Bumblebee.load_model({:hf, repository_id, subdir: "vae"},
    architecture: :decoder,
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )
vae = %{vae | model: Axon.MixedPrecision.apply_policy(vae.model, policy)}

{:ok, safety_checker} =
  Bumblebee.load_model({:hf, repository_id, subdir: "safety_checker"},
    log_params_diff: false
  )
safety_checker = %{safety_checker | model: Axon.MixedPrecision.apply_policy(safety_checker.model, policy)}

xrd · 2022-12-10T00:03:01Z

I see this as well, which is probably expected in that I have only 6 GB.

I will note that I can run things like InvokeAI and do text2img with only 6 GB (and I believe InvokeAI is using the same type of lowered precision to achieve that).

My specs:

 nvidia-smi 
Sat Dec 10 00:00:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   38C    P8     6W / 120W |     15MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14189      G   ...xorg-server-1.20.14/bin/X        9MiB |
|    0   N/A  N/A     14217      G   ...hell-43.1/bin/gnome-shell        2MiB |
+-----------------------------------------------------------------------------+

masahiro-999 · 2022-12-10T06:04:34Z

I set the following Policy and confirmed that the image can be generated for the host client(not cuda).

policy =
  Axon.MixedPrecision.create_policy(
    params: {:f, 16},
    compute: {:f, 32},
    output: {:f, 16}
  )

clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip.model, policy)}
unet = %{unet | model: Axon.MixedPrecision.apply_policy(unet.model, policy)}
vae = %{vae | model: Axon.MixedPrecision.apply_policy(vae.model, policy)}
safety_checker = %{
  safety_checker
  | model: Axon.MixedPrecision.apply_policy(safety_checker.model, policy)
}

serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 10,
    num_images_per_prompt: 1,
    safety_checker: safety_checker,
    safety_checker_featurizer: featurizer,
    compile: [batch_size: 1, sequence_length: 50],
    defn_options: [compiler: EXLA]
  )

OOM occurs when running in cuda.

Looking at the Peak buffers included in the OOM message, the Shape is f32. Is there no policy effect, or is it a memory problem unrelated to the policy?

Peak buffers:
	Buffer 1:
		Size: 1.00GiB
		XLA Label: custom-call
		Shape: f32[2,8,4096,4096]
		==========================

	Buffer 2:
		Size: 144.75MiB
		Entry Parameter Subshape: f32[49408,768]
		==========================

josevalim · 2022-12-10T10:13:05Z

Yes, it can also be that there are places where we could improve the model efficiency. There are some PRs in the diffusers repo and some Twitter threads:

@seanmor5, do you know what we need to do to generate graphs such as this one? huggingface/diffusers#371

krainboltgreene · 2022-12-16T08:53:24Z

Forwarded here from the above issue. Is there anyway for me to give bumblebee more of my memory? Do I need to simply increase the amount of memory I have?

josevalim · 2022-12-16T11:56:20Z

You have 4GB right? That’s currently not enough for SD.

krainboltgreene · 2022-12-16T17:52:44Z

No the VM I run this on has 8GB and the GPU I have has 6GB.

josevalim · 2023-01-12T15:23:08Z

@krainboltgreene we have some experiments that have brought it down to 5GB for a single image. We will be publishing them in the coming weeks.

krainboltgreene · 2023-01-12T18:02:25Z

That is incredible. I have been wanting to dive much deeper into how bumblebee/nx work because I would love to contribute even more to the various APIs. Excited to see the source and learn more.

josevalim · 2023-01-12T21:31:38Z

Opened #147 with a more principled approach.

xrd mentioned this issue Dec 13, 2022

Unable to run recent CUDA-enabled docker image against GPU livebook-dev/livebook#1584

Closed

josevalim mentioned this issue Dec 16, 2022

Out of memory while trying to allocate elixir-nx/xla#32

Closed

jonatanklosko added the note:discussion Details up for discussion label Jan 3, 2023

josevalim closed this as completed Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got OOM message with GTX3060 #101

Got OOM message with GTX3060 #101

masahiro-999 commented Dec 9, 2022

seanmor5 commented Dec 9, 2022

masahiro-999 commented Dec 9, 2022

xrd commented Dec 10, 2022

masahiro-999 commented Dec 10, 2022

josevalim commented Dec 10, 2022

krainboltgreene commented Dec 16, 2022 •

edited

Loading

josevalim commented Dec 16, 2022

krainboltgreene commented Dec 16, 2022

josevalim commented Jan 12, 2023

krainboltgreene commented Jan 12, 2023

josevalim commented Jan 12, 2023

Got OOM message with GTX3060 #101

Got OOM message with GTX3060 #101

Comments

masahiro-999 commented Dec 9, 2022

seanmor5 commented Dec 9, 2022

masahiro-999 commented Dec 9, 2022

xrd commented Dec 10, 2022

masahiro-999 commented Dec 10, 2022

josevalim commented Dec 10, 2022

krainboltgreene commented Dec 16, 2022 • edited Loading

josevalim commented Dec 16, 2022

krainboltgreene commented Dec 16, 2022

josevalim commented Jan 12, 2023

krainboltgreene commented Jan 12, 2023

josevalim commented Jan 12, 2023

krainboltgreene commented Dec 16, 2022 •

edited

Loading