Gracefully recover from VRAM out of memory errors (next branch version) #5794

lstein · 2024-02-24T17:33:36Z

What type of PR is this? (check all applicable)

Have you discussed this change with the InvokeAI team?

Yes
No, because: straightforward fix

Have you updated all relevant documentation?

Yes
No

Description

At least on my system, if the model manager runs out of VRAM while moving a model into the GPU, the partial model gets stuck in VRAM and can't easily be removed. This makes the model unusable, and uses precious VRAM.

I encountered this when playing with large language models on the same system, but I suspect it will also happen if a video game is being played. I tried various approaches to recover from this state, including clearing the vram cache, deleting the model object, and running garbage collection, but without success.

This PR avoids the issue by implementing a check for sufficient available VRAM before trying to move a model to a CUDA GPU. If there is insufficient room, it raises a torch.cuda.OutOfMemoryError. This message is propagated to the front end. If more VRAM becomes available later, invocations will begin to work again.

Note: This pull request is against next. The model manager code has changed a bit, so I'm making a separate PR for main.

Related Tickets & Documents

Related Issue #
Closes #

QA Instructions, Screenshots, Recordings

Launch InvokeAI web service and another application that uses a lot of GPU VRAM. For my testing, I used ollama with a large model loaded. Run a generation and see if it generates an out of memory error. Try this repeatedly - should get the same error each time. Now kill the other application to free up VRAM and try to generate an image. It should work!

Merge Plan

Can merge when approved.

Added/updated tests?

Yes
No : please replace this line with details on why tests
have not been included

[optional] Are there any post deployment tasks we need to perform?

Lincoln Stein added 2 commits February 24, 2024 11:25

clear out VRAM when an OOM occurs

72d7b47

recover gracefuly from GPU out of memory errors (next version)

222cb38

lstein requested review from blessedcoolant, GreggHelt2, brandonrising, RyanJDick and hipsterusername as code owners February 24, 2024 17:33

github-actions bot added python PRs that change python files backend PRs that change backend files labels Feb 24, 2024

lstein changed the title ~~Bugfix/model manager2/out of memory handling~~ Gracefully recover from VRAM out of memory errors (next branch version) Feb 24, 2024

hipsterusername approved these changes Feb 26, 2024

View reviewed changes

psychedelicious added 2 commits February 26, 2024 17:30

chore: ruff

1b83c9c

fix(mm): fix ModelCacheBase method name

9ae4acd

psychedelicious merged commit 3ccb4e6 into next Feb 26, 2024
7 of 8 checks passed

psychedelicious deleted the bugfix/model-manager2/out-of-memory-handling branch February 26, 2024 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully recover from VRAM out of memory errors (next branch version) #5794

Gracefully recover from VRAM out of memory errors (next branch version) #5794

lstein commented Feb 24, 2024

Gracefully recover from VRAM out of memory errors (next branch version) #5794

Gracefully recover from VRAM out of memory errors (next branch version) #5794

Conversation

lstein commented Feb 24, 2024

What type of PR is this? (check all applicable)

Have you discussed this change with the InvokeAI team?

Have you updated all relevant documentation?

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Merge Plan

Added/updated tests?

[optional] Are there any post deployment tasks we need to perform?