Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Cache API Flaw #7513

Open
RyanJDick opened this issue Jan 3, 2025 · 0 comments
Open

Model Cache API Flaw #7513

RyanJDick opened this issue Jan 3, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@RyanJDick
Copy link
Collaborator

RyanJDick commented Jan 3, 2025

This issue is accurate as of v5.5.0.

There is a longstanding flaw in the design of the model cache API. It is the source of several issues and design quirks in the ModelCache, so I am taking the time to document it here so that it can be linked from the source code and referenced by devs.

This issue became particularly evident during the development of the partial model loading feature.

Typical Usage

Typical usage of the model cache API looks like this:

class SampleInvocation(BaseInvocation):
    model: ModelField = InputField(...)

    def invoke(self, context: InvocationContext):
        # context.models.load(...) does the following:
        # 1. Looks up the model info from the DB.
        # 2. Loads the model onto the CPU, using the ModelCache.
        # 3. Retuns a `LoadedModel` containing both the model metadata and the CPU-loaded model.
        model_info = context.models.load(self.model.model_id)

        # A context manager is used to load the model into VRAM using the
        # model cache and 'lock' it there for the lifetime of the context
        # manager.
        with model_info.model_on_device() as (cached_weights, model):
        # Or using the deprecated API:
        # with model_info as model:
            y = model.forward(...)

Flaw

The main problem is that after calling context.models.load(...) the ModelCache has no way of knowing whether:

  1. this model needs to be kept in the RAM cache, because it will soon be loaded into VRAM, or
  2. this model is only used on the CPU and can be freed from the RAM cache to make room for other models

As a result of this flaw, the following reasonable-looking code could fail:

class SampleInvocation(BaseInvocation):
    model_1: ModelField = InputField(...)
    model_2: ModelField = InputField(...)

    def invoke(self, context: InvocationContext):
	    model_1_info = context.models.load(self.model_1.model_id)
	    # Loading model_2 into the CPU could cause model_1 to be ejected
	    # from the RAM cache to make room for model_2.
	    model_2_info = context.models.load(self.model_2.model_id)

		# If model_1 was ejected from the RAM cache, then attempting to
		# load it into VRAM will fail.
		with (model_1_info as model_1, model_2_info as model_2):
			# Do something with model_1 and model_2...
			...

If this flaw is not handled carefully, the following error will occur:

[2025-01-03 10:25:25,390]::[InvokeAI]::ERROR --> Error while invoking session 93e35ea6-4ed8-4535-b7ff-f4d6d3918afd, invocation 36faba26-ce91-483f-b7fe-59bcda83e07f (compel): 'a67e6e91-6b2a-4d38-b69a-db8c27ee98a8:tokenizer'
[2025-01-03 10:25:25,390]::[InvokeAI]::ERROR --> Traceback (most recent call last):
  File "/home/ryan/src/InvokeAI/invokeai/app/services/session_processor/session_processor_default.py", line 129, in run_node
    output = invocation.invoke_internal(context=context, services=self._services)
  File "/home/ryan/src/InvokeAI/invokeai/app/invocations/baseinvocation.py", line 300, in invoke_internal
    output = self.invoke(context)
  File "/home/ryan/.pyenv/versions/3.10.12/envs/InvokeAI_3.10.12/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ryan/src/InvokeAI/invokeai/app/invocations/compel.py", line 81, in invoke
    with (
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/load_base.py", line 60, in __enter__
    self._cache.lock(self._cache_record.key, None)
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 205, in lock
    cache_entry = self._cached_models[key]
KeyError: 'a67e6e91-6b2a-4d38-b69a-db8c27ee98a8:tokenizer'

Current Workaround

The current workaround for this issue is to allow models to be locked in VRAM even if they have been dropped from the RAM cache.

While this works, it is problematic in the context of partial model loading, because the model cache does not have the ability to partially-unload this model if it needs more VRAM.

Other Workarounds

Increase ram cache size

Increasing the ram config in invokeai.yaml increases the size of the RAM cache and decreases the probability that a model will get ejected between loading to RAM and loading to VRAM.

Avoid the offending model access pattern

Of course, avoiding the offending model access pattern in invocation code will prevent this issue from being encountered.

In practice, it is sometimes inconvenient and less performant to do this. I.e. it may require accessing the model info from the DB multiple times or loading the model into the CPU multiple times.

Reference Counting

In the past, we tried to workaround this flaw in the API design by doing manual reference counting on the CPU-loaded models to determine which could be safely dropped from the RAM cache. This approach was very error-prone and the source of several bugs. It also placed some burden on the invocation author to understand how this reference counting was happening under the hood if they wanted to achieve optimal memory behaviour.

Handle in ModelCache.lock() or LoadedModelWithoutConfig

It is tempting to workaround this issue by catching the KeyError and re-inserting the model into the cache (since we still have a reference to it). This feels risky. You would then have multiple CacheRecords floating around that reference the same underlying model.

Solution

A proper solution to this problem should achieve the following:

  • Separate the model info DB lookup from model loading
  • Enable the user to 'lock' the model on any device.
@RyanJDick RyanJDick added the bug Something isn't working label Jan 3, 2025
RyanJDick added a commit that referenced this issue Jan 3, 2025
…e the door was left open for this to happen.
RyanJDick added a commit that referenced this issue Jan 6, 2025
RyanJDick added a commit that referenced this issue Jan 6, 2025
RyanJDick added a commit that referenced this issue Jan 7, 2025
…#7522)

## Summary

This is an unplanned fix between PR3 and PR4 in the sequence of partial
loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround'
documented in #7513. In
other words, to work around a flaw in the model cache API, this fix
allows models to be loaded into VRAM _even if_ they have been dropped
from the RAM cache.

This PR also adds an info log each time that this workaround is hit. In
a future PR (#7509), we will eliminate the places in the application
code that are capable of triggering this condition.

## Related Issues / Discussions

- #7492 
- #7494
- #7500 
- #7513

## QA Instructions

- Set RAM cache limit to a small value. E.g. `ram: 4`
- Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB.
This will trigger the error condition.
- Before the fix, this test configuration would cause a `KeyError`.
After the fix, we should see an info-level log explaining that the
condition was hit, but that generation should continue successfully.

## Merge Plan

No special instructions.

## Checklist

- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
RyanJDick added a commit that referenced this issue Jan 7, 2025
…e the door was left open for this to happen.
RyanJDick added a commit that referenced this issue Jan 7, 2025
## Summary

This PR enables RAM/VRAM cache size limits to be determined dynamically
based on availability.

**Config Changes**

This PR modifies the app configs in the following ways:
- A new `device_working_mem_gb` config was added. This is the amount of
non-model working memory to keep available on the execution device (i.e.
GPU) when using dynamic cache limits. It default to 3GB.
- The `ram` and `vram` configs now default to `None`. If these configs
are set, they will take precedence over the dynamic limits. **Note: Some
users may have previously overriden the `ram` and `vram` values in their
`invokeai.yaml`. They will need to remove these configs to enable the
new dynamic limit feature.**

**Working Memory**

In addition to the new `device_working_mem_gb` config described above,
memory-intensive operations can estimate the amount of working memory
that they will need and request it from the model cache. This is
currently applied to the VAE decoding step for all models. In the
future, we may apply this to other operations as we work out which ops
tend to exceed the default working memory reservation.

**Mitigations for #7513

This PR includes some mitigations for the issue described in
#7513. Without these
mitigations, it would occur with higher frequency when dynamic RAM
limits are used and the RAM is close to maxed-out.

## Limitations / Future Work

- Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE
decoding requires more working VRAM than available, the best we can do
is keep the full model on the CPU, but we will still hit an OOM error.
In the future, we could detect this ahead of time and switch to running
inference on the CPU for those ops.
- There is often a non-negligible amount of VRAM 'reserved' by the torch
CUDA allocator, but not used by any allocated tensors. We may be able to
tune the torch CUDA allocator to work better for our use case.
Reference:
https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf
- There may be some ops that require high working memory that haven't
been updated to request extra memory yet. We will update these as we
uncover them.
- If a model is 'locked' in VRAM, it won't be partially unloaded if a
later model load requests extra working memory. This should be uncommon,
but I can think of cases where it would matter.

## Related Issues / Discussions

- #7492 
- #7494 
- #7500 
- #7505 

## QA Instructions

Run a variety of models near the cache limits to ensure that model
switching works properly for the following configurations:
- [x] CUDA, `enable_partial_loading=true`, all other configs default
(i.e. dynamic memory limits)
- [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved
in another process so there is limited RAM/VRAM remaining, all other
configs default (i.e. dynamic memory limits)
- [x] CUDA, `enable_partial_loading=false`, all other configs default
(i.e. dynamic memory limits)
- [x] CUDA, ram/vram limits set (these should take precedence over the
dynamic limits)
- [x] MPS, all other default (i.e. dynamic memory limits)
- [x] CPU, all other default (i.e. dynamic memory limits) 

## Merge Plan

- [x] Merge #7505 first and change target branch to main

## Checklist

- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant