-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static KV cache status: How to use it? Does it work for all models? #33270
Comments
Some useful resources from the docs:
|
@oobabooga the links @zucchini-nlp shared should enable what you want with respect to having a static memory. I'm not 100% sure Also on the topic of memory: we've merged two PRs recently that should lower memory requirements, regardless of the cache type. No action required from a user's point of view:
|
I actually see quite big difference in peak memory usage between 4.42.3 and 4.44.2, and also 4.44.2 is 20-30 percent faster. |
The documentation is not clear. The first link recommends doing model.generation_config.cache_implementation = "static" The second one recommends passing a kwarg to # simply pass the cache implementation="static"
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="static") The third one defines How do I explicitly specify the maximum sequence length for the static cache before using |
For complete flexibility, you should instantiate a regarding difficulty in parsing information: what do you think would help? Perhaps separating the docs into basic and advanced usage? |
So allocations only happen once per I have tested the speed with/without static cache for a 7B model, and could not find a speed improvement by using
I think that the exact clarifications in your last comment would help if included in the documentation. My main interest in static cache is the fact that when ExLlama (v1) got introduced, I could fit a lot more context in my 3090 than with AutoGPTQ + transformers. But maybe the excess memory usage was unrelated to static cache, as I don't see a VRAM difference with or without |
Yep, if you want to fit longer context I would recommend the |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@oobabooga yeah, |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
#34247 fixed this so closing! |
I see that there are many PRs about StaticCache, but I couldn't find a clear documentation on how to use it.
What I want
To not have Transformers allocate memory dynamically for the KV cache when using
model.generate()
, as that leads to increased memory usage (due to garbage collection not happening fast/often enough) and worse performance.To use that by default always, for every model, for every supported quantization backend (AutoAWQ, AutoGPTQ, AQLM, bitsandbytes, etc).
Who can help?
Maybe @gante
The text was updated successfully, but these errors were encountered: