Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got OOM message with GTX3060 #101

Closed
masahiro-999 opened this issue Dec 9, 2022 · 11 comments
Closed

Got OOM message with GTX3060 #101

masahiro-999 opened this issue Dec 9, 2022 · 11 comments
Labels
note:discussion Details up for discussion

Comments

@masahiro-999
Copy link

I've been trying to Stable Diffusion with GPU.

But it failed and I got the OOM message

Is this error message due to insufficient GPU memory?
Is it possible to make it work by adjusting some parameters?
Stable Diffusion 1.4 is running on this GPU in the tensorflow environment. It would be nice if it works with bumblebee too.

it's working fine with :host . It's amazing how easy it is to use neural networks with livebooks!!!

OS Ubunt 22.04 on WSL2
GPU GTX3060(12GB)
Livebook v0.8.0
Elixir v1.14.2
XLA_TARGET=cuda111
CUDA Version: 11.7

05:32:56.019 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.

05:32:56.023 [info] XLA service 0x7fb39437dac0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

05:32:56.023 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6

05:32:56.023 [info] Using BFC allocator.

05:32:56.023 [info] XLA backend allocating 10641368678 bytes on device 0 for BFCAllocator.

05:32:58.662 [info] Start cannot spawn child process: No such file or directory
05:34:00.234 [info] total_region_allocated_bytes_: 10641368576 memory_limit_: 10641368678 available bytes: 102 curr_region_allocation_bytes_: 21282737664

05:34:00.234 [info] Stats: 
Limit:                     10641368678
InUse:                      5530766592
MaxInUse:                   7566778624
NumAllocs:                        3199
MaxAllocSize:                399769600
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

05:34:00.234 [warn] **********___***********************************************************____________________________

05:34:00.234 [error] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 3546709984 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    3.84GiB
              constant allocation:       144B
        maybe_live_out allocation:   768.0KiB
     preallocated temp allocation:    3.30GiB
  preallocated temp fragmentation:       304B (0.00%)
                 total allocation:    7.15GiB
              total fragmentation:   821.0KiB (0.01%)

whole log is
oommessage.log

@seanmor5
Copy link
Contributor

seanmor5 commented Dec 9, 2022

We are likely being more inefficient than TensorFlow somewhere. This might be related: elixir-nx/nx#1003

One thing you can try is mixed precision in all of the models:

policy = Axon.MixedPrecision.create_policy(compute: :f16)

# do this for every model
{:ok, %{model: clip_model} = clip} = Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"})
clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip, policy)}

Note I haven't tested if this would affect image outputs or not

@masahiro-999
Copy link
Author

I tried code like this.
This didn't help. I got same OOM message.

policy = Axon.MixedPrecision.create_policy(compute: :f16)

{:ok, clip} =
  Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"},
    log_params_diff: false
  )
clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip.model, policy)}

{:ok, unet} =
  Bumblebee.load_model({:hf, repository_id, subdir: "unet"},
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )
unet = %{unet | model: Axon.MixedPrecision.apply_policy(unet.model, policy)}

{:ok, vae} =
  Bumblebee.load_model({:hf, repository_id, subdir: "vae"},
    architecture: :decoder,
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )
vae = %{vae | model: Axon.MixedPrecision.apply_policy(vae.model, policy)}

{:ok, safety_checker} =
  Bumblebee.load_model({:hf, repository_id, subdir: "safety_checker"},
    log_params_diff: false
  )
safety_checker = %{safety_checker | model: Axon.MixedPrecision.apply_policy(safety_checker.model, policy)}

@xrd
Copy link

xrd commented Dec 10, 2022

I see this as well, which is probably expected in that I have only 6 GB.

I will note that I can run things like InvokeAI and do text2img with only 6 GB (and I believe InvokeAI is using the same type of lowered precision to achieve that).

My specs:

 nvidia-smi 
Sat Dec 10 00:00:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   38C    P8     6W / 120W |     15MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14189      G   ...xorg-server-1.20.14/bin/X        9MiB |
|    0   N/A  N/A     14217      G   ...hell-43.1/bin/gnome-shell        2MiB |
+-----------------------------------------------------------------------------+

@masahiro-999
Copy link
Author

I set the following Policy and confirmed that the image can be generated for the host client(not cuda).

policy =
  Axon.MixedPrecision.create_policy(
    params: {:f, 16},
    compute: {:f, 32},
    output: {:f, 16}
  )

clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip.model, policy)}
unet = %{unet | model: Axon.MixedPrecision.apply_policy(unet.model, policy)}
vae = %{vae | model: Axon.MixedPrecision.apply_policy(vae.model, policy)}
safety_checker = %{
  safety_checker
  | model: Axon.MixedPrecision.apply_policy(safety_checker.model, policy)
}

serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 10,
    num_images_per_prompt: 1,
    safety_checker: safety_checker,
    safety_checker_featurizer: featurizer,
    compile: [batch_size: 1, sequence_length: 50],
    defn_options: [compiler: EXLA]
  )

OOM occurs when running in cuda.

Looking at the Peak buffers included in the OOM message, the Shape is f32. Is there no policy effect, or is it a memory problem unrelated to the policy?

Peak buffers:
	Buffer 1:
		Size: 1.00GiB
		XLA Label: custom-call
		Shape: f32[2,8,4096,4096]
		==========================

	Buffer 2:
		Size: 144.75MiB
		Entry Parameter Subshape: f32[49408,768]
		==========================

@josevalim
Copy link
Contributor

Yes, it can also be that there are places where we could improve the model efficiency. There are some PRs in the diffusers repo and some Twitter threads:

@seanmor5, do you know what we need to do to generate graphs such as this one? huggingface/diffusers#371

@krainboltgreene
Copy link

krainboltgreene commented Dec 16, 2022

Forwarded here from the above issue. Is there anyway for me to give bumblebee more of my memory? Do I need to simply increase the amount of memory I have?

@josevalim
Copy link
Contributor

You have 4GB right? That’s currently not enough for SD.

@krainboltgreene
Copy link

No the VM I run this on has 8GB and the GPU I have has 6GB.

@jonatanklosko jonatanklosko added the note:discussion Details up for discussion label Jan 3, 2023
@josevalim
Copy link
Contributor

@krainboltgreene we have some experiments that have brought it down to 5GB for a single image. We will be publishing them in the coming weeks.

@krainboltgreene
Copy link

That is incredible. I have been wanting to dive much deeper into how bumblebee/nx work because I would love to contribute even more to the various APIs. Excited to see the source and learn more.

@josevalim
Copy link
Contributor

Opened #147 with a more principled approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
note:discussion Details up for discussion
Projects
None yet
Development

No branches or pull requests

6 participants