-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Cache API Flaw #7513
Labels
bug
Something isn't working
Comments
RyanJDick
added a commit
that referenced
this issue
Jan 3, 2025
…e the door was left open for this to happen.
11 tasks
4 tasks
RyanJDick
added a commit
that referenced
this issue
Jan 7, 2025
…#7522) ## Summary This is an unplanned fix between PR3 and PR4 in the sequence of partial loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround' documented in #7513. In other words, to work around a flaw in the model cache API, this fix allows models to be loaded into VRAM _even if_ they have been dropped from the RAM cache. This PR also adds an info log each time that this workaround is hit. In a future PR (#7509), we will eliminate the places in the application code that are capable of triggering this condition. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7513 ## QA Instructions - Set RAM cache limit to a small value. E.g. `ram: 4` - Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB. This will trigger the error condition. - Before the fix, this test configuration would cause a `KeyError`. After the fix, we should see an info-level log explaining that the condition was hit, but that generation should continue successfully. ## Merge Plan No special instructions. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_
RyanJDick
added a commit
that referenced
this issue
Jan 7, 2025
…e the door was left open for this to happen.
RyanJDick
added a commit
that referenced
this issue
Jan 7, 2025
## Summary This PR enables RAM/VRAM cache size limits to be determined dynamically based on availability. **Config Changes** This PR modifies the app configs in the following ways: - A new `device_working_mem_gb` config was added. This is the amount of non-model working memory to keep available on the execution device (i.e. GPU) when using dynamic cache limits. It default to 3GB. - The `ram` and `vram` configs now default to `None`. If these configs are set, they will take precedence over the dynamic limits. **Note: Some users may have previously overriden the `ram` and `vram` values in their `invokeai.yaml`. They will need to remove these configs to enable the new dynamic limit feature.** **Working Memory** In addition to the new `device_working_mem_gb` config described above, memory-intensive operations can estimate the amount of working memory that they will need and request it from the model cache. This is currently applied to the VAE decoding step for all models. In the future, we may apply this to other operations as we work out which ops tend to exceed the default working memory reservation. **Mitigations for #7513 This PR includes some mitigations for the issue described in #7513. Without these mitigations, it would occur with higher frequency when dynamic RAM limits are used and the RAM is close to maxed-out. ## Limitations / Future Work - Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE decoding requires more working VRAM than available, the best we can do is keep the full model on the CPU, but we will still hit an OOM error. In the future, we could detect this ahead of time and switch to running inference on the CPU for those ops. - There is often a non-negligible amount of VRAM 'reserved' by the torch CUDA allocator, but not used by any allocated tensors. We may be able to tune the torch CUDA allocator to work better for our use case. Reference: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf - There may be some ops that require high working memory that haven't been updated to request extra memory yet. We will update these as we uncover them. - If a model is 'locked' in VRAM, it won't be partially unloaded if a later model load requests extra working memory. This should be uncommon, but I can think of cases where it would matter. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7505 ## QA Instructions Run a variety of models near the cache limits to ensure that model switching works properly for the following configurations: - [x] CUDA, `enable_partial_loading=true`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved in another process so there is limited RAM/VRAM remaining, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=false`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, ram/vram limits set (these should take precedence over the dynamic limits) - [x] MPS, all other default (i.e. dynamic memory limits) - [x] CPU, all other default (i.e. dynamic memory limits) ## Merge Plan - [x] Merge #7505 first and change target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This issue is accurate as of v5.5.0.
There is a longstanding flaw in the design of the model cache API. It is the source of several issues and design quirks in the ModelCache, so I am taking the time to document it here so that it can be linked from the source code and referenced by devs.
This issue became particularly evident during the development of the partial model loading feature.
Typical Usage
Typical usage of the model cache API looks like this:
Flaw
The main problem is that after calling
context.models.load(...)
the ModelCache has no way of knowing whether:As a result of this flaw, the following reasonable-looking code could fail:
If this flaw is not handled carefully, the following error will occur:
Current Workaround
The current workaround for this issue is to allow models to be locked in VRAM even if they have been dropped from the RAM cache.
While this works, it is problematic in the context of partial model loading, because the model cache does not have the ability to partially-unload this model if it needs more VRAM.
Other Workarounds
Increase
ram
cache sizeIncreasing the
ram
config ininvokeai.yaml
increases the size of the RAM cache and decreases the probability that a model will get ejected between loading to RAM and loading to VRAM.Avoid the offending model access pattern
Of course, avoiding the offending model access pattern in invocation code will prevent this issue from being encountered.
In practice, it is sometimes inconvenient and less performant to do this. I.e. it may require accessing the model info from the DB multiple times or loading the model into the CPU multiple times.
Reference Counting
In the past, we tried to workaround this flaw in the API design by doing manual reference counting on the CPU-loaded models to determine which could be safely dropped from the RAM cache. This approach was very error-prone and the source of several bugs. It also placed some burden on the invocation author to understand how this reference counting was happening under the hood if they wanted to achieve optimal memory behaviour.
Handle in
ModelCache.lock()
orLoadedModelWithoutConfig
It is tempting to workaround this issue by catching the
KeyError
and re-inserting the model into the cache (since we still have a reference to it). This feels risky. You would then have multipleCacheRecord
s floating around that reference the same underlying model.Solution
A proper solution to this problem should achieve the following:
The text was updated successfully, but these errors were encountered: