-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better documentation around how unloading is triggered #511
Comments
For some additional context, I'm using the Python backend in Triton. An example model that triggers unloading has custom dependencies via conda pack and has a file tree like so:
I'm not sure whether using models in this way messes with how ModelMesh computes usage. |
Hi @dsgibbons Maybe the
Since most of our models have the same size, setting it to the correct value eliminated the WARN log you are seeing and helped modelmesh make better model allocation decisions. |
Thank you for linking your reply @GolanLevy. I'd still love to see some formal documentation for this, as it seems like critical information that shouldn't require trawling through the issue tracker. I'll see how I go this week. I hope I'll eventually understand ModelMesh well enough to submit a PR to address this issue. |
When loading some models, I receive the WARN log:
Memory over-allocation due to under-prediction of model size...
(which stems from here) followed by the INFO log:Eviction triggered for model ...
(I couldn't find exactly where this comes from). This unloading happens despite it being the only model on a large machine with 64GB RAM, 40GB VRAM and all of the k8s resource limits being set to max.I've tried to piece together how to avoid this from various GitHub issues (e.g., this one) but would really appreciate some clear documentation around how unloading is triggered in ModelMesh. Even variables such as MODELSIZE_MULTIPLIER as referenced by this reply aren't properly documented, and I can't find where they are used in either the
modelmesh
or themodelmesh-serving
source code.Could the documentation please be updated to formally describe how models are prioritized and subsequently unloaded with more discussion around the various configurations that we can alter on a per runtime/per isvc basis? I'm happy to contribute by helping to update the documentation, but I don't fully understand the underlying design decisions.
The text was updated successfully, but these errors were encountered: