-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about functioning of ModelMesh #46
Comments
Hi @OvervCW, these are all great questions :)
In general it tries to fill the available capacity with models that are "most likely" to be used and there's an assumption that there's a relatively high chance that newly-created Predictors will be used soon after. Thus they are assigned a "last used" timestamp of one hour in the past, meaning they won't take the place of any models which have been used more recently than that (those in "active" use), but if the LRU age of the cache as a whole is older than this then loading of a single copy will be triggered. Of course if the Predictor isn't used and enough others are relative to the available capacity, its model copy will be unloaded at some point. There's also currently a rule that any "recently used" model will get at least two copies loaded (assuming there's at least two pods). This won't include the newly created Predictors, but if/when they actually receive an inference request, loading of a second copy will be triggered. However, this behaviour has turned out to cause more aggressive scaling than desired in some of our production deployments, so I am working on a change to make loading of this second copy dependent on more than one use over some smallish window of time rather than last-used time alone. In general the goal was to minimize the amount of configuration/tuning needed so that things work reasonably well for the most common usage scenarios, but there's definitely a lot of room to improve the current behaviour.
In general it's up to the particular model server and/or its adapter to estimate/report this back to model mesh once the model has been loaded. However, in many cases this isn't straightforward since the memory requirement could depend on a number of factors. For Triton in particular we haven't found a good approach, and it can vary a lot based on the back-end being used and the kind of usage. So it's currently quite a crude/conservative estimate (in this case apparently not conservative enough), based on the model's size on disk - just this size multiplied by a constant factor. The default value of this multiple for Triton is 1.25, but it can be overridden via the Another parameter which may be useful (in addition to apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: triton-2.x
annotations:
maxLoadingConcurrency: "2"
spec:
// ...
builtInAdapter:
serverType: triton
runtimeManagementPort: 8001
memBufferBytes: 134217728
modelLoadingTimeoutMillis: 90000
env:
MODELSIZE_MULTIPLIER: "1.5"
DEFAULT_MODELSIZE: "104857600" // in bytes
Currently it does not do so automatically. You can set a number of replicas per In practice this has not been a problem for us, since we have a large number of models and the number of replicas is generally set large enough to allow the small number that are very heavily used to scale out sufficiently (i.e. to have a copy in every one). So all of the autoscaling and model churn happens within a more static pool of resources. |
Thank you for the answers! In our case the combined load on our models varies widely throughout the day and month, so we do have a need to scale the number of runtime instances accordingly. Since it is not possible to set up a regular horizontal pod autoscaler for this purpose, I expect that we will be implementing our own scaling component that updates the |
The questions asked in this thread are very useful and help to answer some of our doubts as well.
Hi @OvervCW , why it is not possible to setup a HPA here? We are looking for autoscaling of the servingruntime as well. As it is a deployment and modelmesh itself is exposing metrics, is it possible to use native HPA to scale on custom metics? @njhill
|
@lizzzcai The controller not only creates deployments for the servingruntimes, but it will also overwrite any changes made to those deployments afterwards, including the number of replicas. That means that when you create a HPA, it will try to update the number of replicas and fail.
ModelMesh exposes the right metrics if you want to scale on model pressure, but not if you want to scale on GPU utilization. I suppose it depends on your situation which (if not both) you'll want to scale on. The Triton runtime exposes its own Prometheus metrics with more GPU-specific data like memory consumption and utilization. Unfortunately it's not possible to create a ServiceMonitor for those since the port for these metrics is not included in the Pod/Service created by ModelMesh. This matters, because once metrics are in Prometheus, it's easy to scale on them using the Prometheus Metrics adapter. I've decided to extend the Triton runtime image with my own script that samples (more detailed) GPU utilization, and I'm writing my own controller that connects to the pods and collects this data to calculate the desired number of replicas. |
@lizzzcai this isn't currently supported but would be fairly straightfoward to add the functionality
No, only LRU is supported and a fair amount of the logic exploits the LRU ordering for making approximations, etc.
@OvervCW PRs are very welcome :) I presume by your other statement that you've seen the prometheus metrics exposed by model-mesh itself. |
is there any development regarding HPA in modelmesh ?? @njhill |
Any updates over here ?? |
@Agarwal-Saurabh -- there is a PR in progress for kserve/modelmesh-serving#329 |
I'm currently trying out ModelMesh to see if it's a good solution for the following problems:
It seems like ModelMesh handles (1) through its local and global LRU caching. To test this behavior, I created about 20 Predictors resources, expecting the models to only be loaded once the first inference request comes in. However, it will try to load all of the models in memory immediately after creating the Predictors. Is this expected behavior? Are there ways to configure the LRU parameters?
I was testing the 20 Predictors with 2 NVIDIA Triton runtime instances (using the CPU and 1 GiB of RAM) and noticed that the pods would continuously be OOMKilled. It seemed like ModelMesh was trying to load more models into memory than the available capacity. This happened even when changing the
memBufferBytes
to 950 MiB. While debugging this, I realized that I never had to specify the required memory for any of the Predictors, and now I wonder how ModelMesh determines if a model will "fit". Does it try to estimate how much memory a model requires by simply trying to load it one first time and then measuring how much memory it consumes?The third thing I'm wondering is how ModelMesh ensures that there are enough runtime instances to host all of the (scaled) models. I understand that it can auto-scale models based on their usage and distributes them across more runtime instances in that case, but how does it scale the runtime instances themselves? I couldn't find any references to "replicas" in the source code of ModelMesh.
So, to summarize my questions:
The text was updated successfully, but these errors were encountered: