Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU #7494

RyanJDick · 2024-12-23T19:32:40Z

Summary

This PR adds utilities to support partial loading of models from CPU to GPU. The new utilities are not yet being used by the ModelCache, so there should be no functional behavior changes in this PR.

Detailed changes:

Add autocast modules that are designed to wrap common torch.nn.Modules and enable them to run with automatic device casting. E.g. a linear layer on the CPU can be executed with an input tensor on the GPU by streaming the weights to the GPU at runtime.
Add unit tests for the aforementioned autocast modules to verify that they work for all supported quantization formats (GGUF, BnB NF4, BnB LLM.int8()).
Add CachedModelWithPartialLoad and CachedModelOnlyFullLoad classes to manage partial loading at the model level.

Alternative Implementations

Several options were explored for supporting inference on partially-loaded models. The pros/cons of the explored options are summarized here for reference. In the end, wrapper modules were selected as the best overall solution for our use case.

Option 1: Re-implement the .forward() methods of modules to add support for device conversions

This is the option implemented in this PR.
This approach is the most manual of the three, but as a result offers the broadest compatibility with unusual model types. It is manual in that we have to explicitly add support for all module types that we wish to support. Fortunately, the list of foundational module types is relatively small (e.g. the current set of implemented layers covers all but 0.04 MB of the full FLUX model.).

Option 2: Implement a custom Tensor type that casts tensors to a target_device each time the tensor is used

This approach has the nice property that it is injected at the tensor level, and the model does not need to be modified in any way.
One challenge with this approach is handling interactions with other custom tensor types (e.g. GGMLTensor). This problem is solvable, but definitely introduces a layer of complexity. (There are likely to also be some similar issues with interactions with the BnB quantization, but I didn't get as far as testing BnB.)

Option 3: Override the __torch_function__ dispatch calls globally and cast all params to the execution device.

This approach is nice and simple: just apply a global context manager and all operations will happen on the compute device regardless of the device of the participating tensors.
Challenges:
- Overriding the __torch_function__ dispatch calls introduces some overhead even if the tensors are already on the correct device.
- It is difficult to manage the autocasting context manager. E.g. it is tempting to apply it to the model's .forward(...) method, but we use some models with non-standard entrypoints. And we don't want to end up with nested autocasting context managers.
- BnB applies quantization side effects when a param is moved to the GPU - this interacts in unexpected ways with a global context manager.
- CPU tensors that get used in more than one operation would be cast to the device multiple times rather than casting once and re-using the on-device tensor.

QA Instructions

Most of the changes in this PR should not impact active code, and thus should not cause any changes to behavior. The main risks come from bumping the bitsandbytes dependency and some minor modifications to the bitsandbytes quantization code.

Regression test bitsandbytes NF4 quantization
Regression test bitsandbytes LLM.int8() quantization
Regression test on MacOS (to ensure that there are no lingering bitsandbytes import errors)

I also tested the new utilities for inference on full models in another branch to validate that there were not major issues. This functionality will be tested more thoroughly in a future PR.

Merge Plan

Partial Loading PR1: Tidy ModelCache #7492 should be merged first so that the target branch can be updated to main.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

… loading/saving for LLM.int8 and promises improved speed on some HW.

…s. This is in preparation for wrapping it to support streaming of weights from cpu to gpu.

…near8bitLt layers.

…nearNF4 layers.

…g the new autocast modules.

…tent buffers.

…for models that cannot or should not be partially loaded.

…ercise it.

…ak unit test memory.

…s, 3) model patching (#7500) ## Summary This PR is the third in a sequence of PRs working towards support for partial loading of models onto the compute device (for low-VRAM operation). This PR updates the LoRA patching code so that the following features can cooperate fully: - Partial loading of weights onto the GPU - Quantized layers / weights - Model patches (e.g. LoRA) Note that this PR does not yet enable partial loading. It adds support in the model patching code so that partial loading can be enabled in a future PR. ## Technical Design Decisions The layer patching logic has been integrated into the custom layers (via `CustomModuleMixin`) rather than keeping it in a separate set of wrapper layers, as before. This has the following advantages: - It makes it easier to calculate the modified weights on the fly and then reuse the normal forward() logic. - In the future, it makes it possible to pass original parameters that have been cast to the device down to the LoRA calculation without having to re-cast (but the current implementation hasn't fully taken advantage of this yet). ## Know Limitations 1. I haven't fully solved device management for patch types that require the original layer value to calculate the patch. These aren't very common, and are not compatible with some quantized layers, so leaving this for future if there's demand. 2. There is a small speed regression for models that have CPU bottlenecks. This seems to be caused by slightly slower method resolution on the custom layers sub-classes. The regression does not show up on larger models, like FLUX, that are almost entirely GPU-limited. I think this small regression is tolerable, but if we decide that it's not, then the slowdown can easily be reclaimed by optimizing other CPU operations (e.g. if we only sent every 2nd progress image, we'd see a much more significant speedup). ## Related Issues / Discussions - #7492 - #7494 ## QA Instructions Speed tests: - Vanilla SD1 speed regression - Before: 3.156s (8.78 it/s) - After: 3.54s (8.35 it/s) - Vanilla SDXL speed regression - Before: 6.23s (4.46 it/s) - After: 6.45s (4.31 it/s) - Vanilla FLUX speed regression - Before: 12.02s (2.27 it/s) - After: 11.91s (2.29 it/s) LoRA tests with default configuration: - [x] SD1: A handful of LoRA variants - [x] SDXL: A handful of LoRA variants - [x] flux non-quantized: multiple lora variants - [x] flux bnb-quantized: multiple lora variants - [x] flux ggml-quantized: muliple lora variants - [x] flux non-quantized: FLUX control LoRA - [x] flux bnb-quantized: FLUX control LoRA - [x] flux ggml-quantized: FLUX control LoRA LoRA tests with sidecar patching forced: - [x] SD1: A handful of LoRA variants - [x] SDXL: A handful of LoRA variants - [x] flux non-quantized: multiple lora variants - [x] flux bnb-quantized: multiple lora variants - [x] flux ggml-quantized: muliple lora variants - [x] flux non-quantized: FLUX control LoRA - [x] flux bnb-quantized: FLUX control LoRA - [x] flux ggml-quantized: FLUX control LoRA Other: - [x] Smoke testing of IP-Adapter, ControlNet All tests repeated on: - [x] cuda - [x] cpu (only test SD1, because larger models are prohibitively slow) - [x] mps (skipped FLUX tests, because my Mac doesn't have enough memory to run them in a reasonable amount of time) ## Merge Plan No special instructions. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

…#7522) ## Summary This is an unplanned fix between PR3 and PR4 in the sequence of partial loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround' documented in #7513. In other words, to work around a flaw in the model cache API, this fix allows models to be loaded into VRAM _even if_ they have been dropped from the RAM cache. This PR also adds an info log each time that this workaround is hit. In a future PR (#7509), we will eliminate the places in the application code that are capable of triggering this condition. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7513 ## QA Instructions - Set RAM cache limit to a small value. E.g. `ram: 4` - Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB. This will trigger the error condition. - Before the fix, this test configuration would cause a `KeyError`. After the fix, we should see an info-level log explaining that the condition was hit, but that generation should continue successfully. ## Merge Plan No special instructions. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

## Summary This PR adds support for partial loading of models onto the GPU. This enables models to run with much lower peak VRAM requirements (e.g. full FLUX dev with 8GB of VRAM). The partial loading feature is enabled behind a new config flag: `enable_partial_loading=True`. This flag defaults to `False`. **Note about performance:** The `ram` and `vram` config limits are still applied when `enable_partial_loading=True` is set. This can result in significant slowdowns compared to the 'old' behaviour. Consider the case where the VRAM limit is set to `vram=0.75` (GB) and we are trying to run an 8GB model. When `enable_partial_loading=False`, we attempt to load the entire model into VRAM, and if it fits (no OOM error) then it will run at full speed. When `enable_partial_loading=True`, since we have the option to partially load the model we will only load 0.75 GB into VRAM and leave the remaining 7.25 GB in RAM. This will cause inference to be much slower than before. To workaround this, it is important that your `ram` and `vram` configs are carefully tuned. In a future PR, we will add the ability to dynamically set the RAM/VRAM limits based on the available memory / VRAM. ## Related Issues / Discussions - #7492 - #7494 - #7500 ## QA Instructions Tests with `enable_partial_loading=True`, `vram=2`, on CUDA device: For all tests, we expect model memory to stay below 2 GB. Peak working memory will be higher. - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Tests with `enable_partial_loading=True`, and hack to force all models to load 10%, on CUDA device: - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Tests with `enable_partial_loading=False`, `vram=30`: We expect no change in behaviour when `enable_partial_loading=False`. - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Other platforms: - [x] No change in behavior on MPS, even if `enable_partial_loading=True`. - [x] No change in behavior on CPU-only systems, even if `enable_partial_loading=True`. ## Merge Plan - [x] Merge #7500 first, and change the target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

## Summary This PR enables RAM/VRAM cache size limits to be determined dynamically based on availability. **Config Changes** This PR modifies the app configs in the following ways: - A new `device_working_mem_gb` config was added. This is the amount of non-model working memory to keep available on the execution device (i.e. GPU) when using dynamic cache limits. It default to 3GB. - The `ram` and `vram` configs now default to `None`. If these configs are set, they will take precedence over the dynamic limits. **Note: Some users may have previously overriden the `ram` and `vram` values in their `invokeai.yaml`. They will need to remove these configs to enable the new dynamic limit feature.** **Working Memory** In addition to the new `device_working_mem_gb` config described above, memory-intensive operations can estimate the amount of working memory that they will need and request it from the model cache. This is currently applied to the VAE decoding step for all models. In the future, we may apply this to other operations as we work out which ops tend to exceed the default working memory reservation. **Mitigations for #7513 This PR includes some mitigations for the issue described in #7513. Without these mitigations, it would occur with higher frequency when dynamic RAM limits are used and the RAM is close to maxed-out. ## Limitations / Future Work - Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE decoding requires more working VRAM than available, the best we can do is keep the full model on the CPU, but we will still hit an OOM error. In the future, we could detect this ahead of time and switch to running inference on the CPU for those ops. - There is often a non-negligible amount of VRAM 'reserved' by the torch CUDA allocator, but not used by any allocated tensors. We may be able to tune the torch CUDA allocator to work better for our use case. Reference: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf - There may be some ops that require high working memory that haven't been updated to request extra memory yet. We will update these as we uncover them. - If a model is 'locked' in VRAM, it won't be partially unloaded if a later model load requests extra working memory. This should be uncommon, but I can think of cases where it would matter. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7505 ## QA Instructions Run a variety of models near the cache limits to ensure that model switching works properly for the following configurations: - [x] CUDA, `enable_partial_loading=true`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved in another process so there is limited RAM/VRAM remaining, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=false`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, ram/vram limits set (these should take precedence over the dynamic limits) - [x] MPS, all other default (i.e. dynamic memory limits) - [x] CPU, all other default (i.e. dynamic memory limits) ## Merge Plan - [x] Merge #7505 first and change target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

github-actions bot added python PRs that change python files Root backend PRs that change backend files python-tests PRs that change python tests python-deps PRs that change python dependencies labels Dec 23, 2024

RyanJDick force-pushed the ryan/model-offload-2-partial-load-utils branch from 9a4e61f to c207408 Compare December 23, 2024 23:30

RyanJDick force-pushed the ryan/model-offload-1-tidy branch from 6b2a3b2 to 55b13c1 Compare December 24, 2024 14:23

Base automatically changed from ryan/model-offload-1-tidy to main December 24, 2024 14:30

RyanJDick added 13 commits December 24, 2024 14:32

Bump bitsandbytes. The new verson contains improvements to state_dict…

65fcbf5

… loading/saving for LLM.int8 and promises improved speed on some HW.

Add torch module autocast utilities.

fe0ef2c

Add torch module autocast unit test for GGUF-quantized models.

97d56f7

Simplify the state management in InvokeLinear8bitLt and add unit test…

3f99039

…s. This is in preparation for wrapping it to support streaming of weights from cpu to gpu.

Add CustomInvokeLinear8bitLt layer for device streaming with InvokeLi…

1b56020

…near8bitLt layers.

Add CustomInvokeLinearNF4 to enable CPU -> GPU streaming for InvokeLi…

dc54e87

…nearNF4 layers.

Add CachedModelWithPartialLoad to manage partially-loaded models usin…

0a8fc74

…g the new autocast modules.

Make CachedModelWithPartialLoad work with models that have non-persis…

c6795a1

…tent buffers.

Add CachedModelOnlyFullLoad to mirror the CachedModelWithPartialLoad …

f8ab414

…for models that cannot or should not be partially loaded.

Fix bitsandbytes imports to avoid ImportErrors on MacOS.

f8a6acc

Reduce peak memory used for unit tests.

a83a999

Workaround a weird quirk of QuantState.to() and add a unit test to ex…

7214d49

…ercise it.

Skip flaky test when running on Github Actions, and further reduce pe…

0fc5387

…ak unit test memory.

RyanJDick force-pushed the ryan/model-offload-2-partial-load-utils branch from c207408 to 0fc5387 Compare December 24, 2024 14:32

RyanJDick marked this pull request as ready for review December 24, 2024 15:51

RyanJDick requested review from lstein, blessedcoolant, brandonrising and hipsterusername as code owners December 24, 2024 15:51

hipsterusername approved these changes Dec 24, 2024

View reviewed changes

RyanJDick merged commit 6bf5b74 into main Dec 27, 2024
29 checks passed

RyanJDick deleted the ryan/model-offload-2-partial-load-utils branch December 27, 2024 14:20

This was referenced Dec 29, 2024

Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching #7500

Merged

Partial Loading PR4: Enable partial loading (behind config flag) #7505

Merged

This was referenced Jan 2, 2025

Partial Loading PR5: Dynamic cache ram/vram limits #7509

Merged

Partial Loading PR 3.5: Fix pre-mature model drops from the RAM cache #7522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU #7494

Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU #7494

RyanJDick commented Dec 23, 2024 •

edited

Loading

Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU #7494

Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU #7494

Conversation

RyanJDick commented Dec 23, 2024 • edited Loading

Summary

Alternative Implementations

QA Instructions

Merge Plan

Checklist

RyanJDick commented Dec 23, 2024 •

edited

Loading