Model Cache API Flaw #7513

RyanJDick · 2025-01-03T17:04:10Z

This issue is accurate as of v5.5.0.

There is a longstanding flaw in the design of the model cache API. It is the source of several issues and design quirks in the ModelCache, so I am taking the time to document it here so that it can be linked from the source code and referenced by devs.

This issue became particularly evident during the development of the partial model loading feature.

Typical Usage

Typical usage of the model cache API looks like this:

class SampleInvocation(BaseInvocation):
    model: ModelField = InputField(...)

    def invoke(self, context: InvocationContext):
        # context.models.load(...) does the following:
        # 1. Looks up the model info from the DB.
        # 2. Loads the model onto the CPU, using the ModelCache.
        # 3. Retuns a `LoadedModel` containing both the model metadata and the CPU-loaded model.
        model_info = context.models.load(self.model.model_id)

        # A context manager is used to load the model into VRAM using the
        # model cache and 'lock' it there for the lifetime of the context
        # manager.
        with model_info.model_on_device() as (cached_weights, model):
        # Or using the deprecated API:
        # with model_info as model:
            y = model.forward(...)

Flaw

The main problem is that after calling context.models.load(...) the ModelCache has no way of knowing whether:

this model needs to be kept in the RAM cache, because it will soon be loaded into VRAM, or
this model is only used on the CPU and can be freed from the RAM cache to make room for other models

As a result of this flaw, the following reasonable-looking code could fail:

class SampleInvocation(BaseInvocation):
    model_1: ModelField = InputField(...)
    model_2: ModelField = InputField(...)

    def invoke(self, context: InvocationContext):
	    model_1_info = context.models.load(self.model_1.model_id)
	    # Loading model_2 into the CPU could cause model_1 to be ejected
	    # from the RAM cache to make room for model_2.
	    model_2_info = context.models.load(self.model_2.model_id)

		# If model_1 was ejected from the RAM cache, then attempting to
		# load it into VRAM will fail.
		with (model_1_info as model_1, model_2_info as model_2):
			# Do something with model_1 and model_2...
			...

If this flaw is not handled carefully, the following error will occur:

[2025-01-03 10:25:25,390]::[InvokeAI]::ERROR --> Error while invoking session 93e35ea6-4ed8-4535-b7ff-f4d6d3918afd, invocation 36faba26-ce91-483f-b7fe-59bcda83e07f (compel): 'a67e6e91-6b2a-4d38-b69a-db8c27ee98a8:tokenizer'
[2025-01-03 10:25:25,390]::[InvokeAI]::ERROR --> Traceback (most recent call last):
  File "/home/ryan/src/InvokeAI/invokeai/app/services/session_processor/session_processor_default.py", line 129, in run_node
    output = invocation.invoke_internal(context=context, services=self._services)
  File "/home/ryan/src/InvokeAI/invokeai/app/invocations/baseinvocation.py", line 300, in invoke_internal
    output = self.invoke(context)
  File "/home/ryan/.pyenv/versions/3.10.12/envs/InvokeAI_3.10.12/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ryan/src/InvokeAI/invokeai/app/invocations/compel.py", line 81, in invoke
    with (
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/load_base.py", line 60, in __enter__
    self._cache.lock(self._cache_record.key, None)
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 205, in lock
    cache_entry = self._cached_models[key]
KeyError: 'a67e6e91-6b2a-4d38-b69a-db8c27ee98a8:tokenizer'

Current Workaround

The current workaround for this issue is to allow models to be locked in VRAM even if they have been dropped from the RAM cache.

While this works, it is problematic in the context of partial model loading, because the model cache does not have the ability to partially-unload this model if it needs more VRAM.

Other Workarounds

Increase `ram` cache size

Increasing the ram config in invokeai.yaml increases the size of the RAM cache and decreases the probability that a model will get ejected between loading to RAM and loading to VRAM.

Avoid the offending model access pattern

Of course, avoiding the offending model access pattern in invocation code will prevent this issue from being encountered.

In practice, it is sometimes inconvenient and less performant to do this. I.e. it may require accessing the model info from the DB multiple times or loading the model into the CPU multiple times.

Reference Counting

In the past, we tried to workaround this flaw in the API design by doing manual reference counting on the CPU-loaded models to determine which could be safely dropped from the RAM cache. This approach was very error-prone and the source of several bugs. It also placed some burden on the invocation author to understand how this reference counting was happening under the hood if they wanted to achieve optimal memory behaviour.

Handle in `ModelCache.lock()` or `LoadedModelWithoutConfig`

It is tempting to workaround this issue by catching the KeyError and re-inserting the model into the cache (since we still have a reference to it). This feels risky. You would then have multiple CacheRecords floating around that reference the same underlying model.

Solution

A proper solution to this problem should achieve the following:

Separate the model info DB lookup from model loading
Enable the user to 'lock' the model on any device.

The text was updated successfully, but these errors were encountered:

…e the door was left open for this to happen.

…m the RAM cache (related: #7513).

…#7522) ## Summary This is an unplanned fix between PR3 and PR4 in the sequence of partial loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround' documented in #7513. In other words, to work around a flaw in the model cache API, this fix allows models to be loaded into VRAM _even if_ they have been dropped from the RAM cache. This PR also adds an info log each time that this workaround is hit. In a future PR (#7509), we will eliminate the places in the application code that are capable of triggering this condition. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7513 ## QA Instructions - Set RAM cache limit to a small value. E.g. `ram: 4` - Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB. This will trigger the error condition. - Before the fix, this test configuration would cause a `KeyError`. After the fix, we should see an info-level log explaining that the condition was hit, but that generation should continue successfully. ## Merge Plan No special instructions. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

…e the door was left open for this to happen.

## Summary This PR enables RAM/VRAM cache size limits to be determined dynamically based on availability. **Config Changes** This PR modifies the app configs in the following ways: - A new `device_working_mem_gb` config was added. This is the amount of non-model working memory to keep available on the execution device (i.e. GPU) when using dynamic cache limits. It default to 3GB. - The `ram` and `vram` configs now default to `None`. If these configs are set, they will take precedence over the dynamic limits. **Note: Some users may have previously overriden the `ram` and `vram` values in their `invokeai.yaml`. They will need to remove these configs to enable the new dynamic limit feature.** **Working Memory** In addition to the new `device_working_mem_gb` config described above, memory-intensive operations can estimate the amount of working memory that they will need and request it from the model cache. This is currently applied to the VAE decoding step for all models. In the future, we may apply this to other operations as we work out which ops tend to exceed the default working memory reservation. **Mitigations for #7513 This PR includes some mitigations for the issue described in #7513. Without these mitigations, it would occur with higher frequency when dynamic RAM limits are used and the RAM is close to maxed-out. ## Limitations / Future Work - Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE decoding requires more working VRAM than available, the best we can do is keep the full model on the CPU, but we will still hit an OOM error. In the future, we could detect this ahead of time and switch to running inference on the CPU for those ops. - There is often a non-negligible amount of VRAM 'reserved' by the torch CUDA allocator, but not used by any allocated tensors. We may be able to tune the torch CUDA allocator to work better for our use case. Reference: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf - There may be some ops that require high working memory that haven't been updated to request extra memory yet. We will update these as we uncover them. - If a model is 'locked' in VRAM, it won't be partially unloaded if a later model load requests extra working memory. This should be uncommon, but I can think of cases where it would matter. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7505 ## QA Instructions Run a variety of models near the cache limits to ensure that model switching works properly for the following configurations: - [x] CUDA, `enable_partial_loading=true`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved in another process so there is limited RAM/VRAM remaining, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=false`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, ram/vram limits set (these should take precedence over the dynamic limits) - [x] MPS, all other default (i.e. dynamic memory limits) - [x] CPU, all other default (i.e. dynamic memory limits) ## Merge Plan - [x] Merge #7505 first and change target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

RyanJDick added the bug Something isn't working label Jan 3, 2025

RyanJDick added a commit that referenced this issue Jan 3, 2025

Reduce the likelihood of encountering #7513 by elminating places wher…

caab87b

…e the door was left open for this to happen.

RyanJDick mentioned this issue Jan 4, 2025

Partial Loading PR5: Dynamic cache ram/vram limits #7509

Merged

11 tasks

RyanJDick added a commit that referenced this issue Jan 6, 2025

Allow models to be locked in VRAM, even if they have been dropped fro…

cd268ff

…m the RAM cache (related: #7513).

RyanJDick mentioned this issue Jan 6, 2025

Partial Loading PR 3.5: Fix pre-mature model drops from the RAM cache #7522

Merged

4 tasks

RyanJDick added a commit that referenced this issue Jan 6, 2025

Allow models to be locked in VRAM, even if they have been dropped fro…

c579a21

…m the RAM cache (related: #7513).

RyanJDick added a commit that referenced this issue Jan 7, 2025

Reduce the likelihood of encountering #7513 by elminating places wher…

71b97ce

…e the door was left open for this to happen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Cache API Flaw #7513

Model Cache API Flaw #7513

RyanJDick commented Jan 3, 2025 •

edited

Loading

Model Cache API Flaw #7513

Model Cache API Flaw #7513

Comments

RyanJDick commented Jan 3, 2025 • edited Loading

Typical Usage

Flaw

Current Workaround

Other Workarounds

Increase ram cache size

Avoid the offending model access pattern

Reference Counting

Handle in ModelCache.lock() or LoadedModelWithoutConfig

Solution

RyanJDick commented Jan 3, 2025 •

edited

Loading

Increase `ram` cache size

Handle in `ModelCache.lock()` or `LoadedModelWithoutConfig`