Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about functioning of ModelMesh #46

Closed
OvervCW opened this issue Jul 12, 2022 · 9 comments
Closed

Questions about functioning of ModelMesh #46

OvervCW opened this issue Jul 12, 2022 · 9 comments
Labels
question Further information is requested

Comments

@OvervCW
Copy link

OvervCW commented Jul 12, 2022

I'm currently trying out ModelMesh to see if it's a good solution for the following problems:

  1. Only load models into GPUs and serve them when they are actually being used.
  2. Distribute the total set of loaded models across GPUs because they can not all fit in the memory of a single GPU.
  3. Automatically scale up models and GPUs when the workload increases.

It seems like ModelMesh handles (1) through its local and global LRU caching. To test this behavior, I created about 20 Predictors resources, expecting the models to only be loaded once the first inference request comes in. However, it will try to load all of the models in memory immediately after creating the Predictors. Is this expected behavior? Are there ways to configure the LRU parameters?

I was testing the 20 Predictors with 2 NVIDIA Triton runtime instances (using the CPU and 1 GiB of RAM) and noticed that the pods would continuously be OOMKilled. It seemed like ModelMesh was trying to load more models into memory than the available capacity. This happened even when changing the memBufferBytes to 950 MiB. While debugging this, I realized that I never had to specify the required memory for any of the Predictors, and now I wonder how ModelMesh determines if a model will "fit". Does it try to estimate how much memory a model requires by simply trying to load it one first time and then measuring how much memory it consumes?

The third thing I'm wondering is how ModelMesh ensures that there are enough runtime instances to host all of the (scaled) models. I understand that it can auto-scale models based on their usage and distributes them across more runtime instances in that case, but how does it scale the runtime instances themselves? I couldn't find any references to "replicas" in the source code of ModelMesh.

So, to summarize my questions:

  • Why does ModelMesh try to load a model immediately after a Predictor has been defined, even if no inference requests have come in yet?
  • How does ModelMesh determine a model's memory usage and whether it will fit on a given runtime instance?
  • How does ModelMesh scale the number of runtime instances to fit all of the models and their multiple replicas?
@njhill njhill added the question Further information is requested label Jul 21, 2022
@njhill
Copy link
Member

njhill commented Jul 22, 2022

Hi @OvervCW, these are all great questions :)

Why does ModelMesh try to load a model immediately after a Predictor has been defined, even if no inference requests have come in yet?

In general it tries to fill the available capacity with models that are "most likely" to be used and there's an assumption that there's a relatively high chance that newly-created Predictors will be used soon after. Thus they are assigned a "last used" timestamp of one hour in the past, meaning they won't take the place of any models which have been used more recently than that (those in "active" use), but if the LRU age of the cache as a whole is older than this then loading of a single copy will be triggered. Of course if the Predictor isn't used and enough others are relative to the available capacity, its model copy will be unloaded at some point.

There's also currently a rule that any "recently used" model will get at least two copies loaded (assuming there's at least two pods). This won't include the newly created Predictors, but if/when they actually receive an inference request, loading of a second copy will be triggered. However, this behaviour has turned out to cause more aggressive scaling than desired in some of our production deployments, so I am working on a change to make loading of this second copy dependent on more than one use over some smallish window of time rather than last-used time alone.

In general the goal was to minimize the amount of configuration/tuning needed so that things work reasonably well for the most common usage scenarios, but there's definitely a lot of room to improve the current behaviour.

How does ModelMesh determine a model's memory usage and whether it will fit on a given runtime instance?

In general it's up to the particular model server and/or its adapter to estimate/report this back to model mesh once the model has been loaded. However, in many cases this isn't straightforward since the memory requirement could depend on a number of factors. For Triton in particular we haven't found a good approach, and it can vary a lot based on the back-end being used and the kind of usage. So it's currently quite a crude/conservative estimate (in this case apparently not conservative enough), based on the model's size on disk - just this size multiplied by a constant factor.

The default value of this multiple for Triton is 1.25, but it can be overridden via the ServingRuntime spec (see below).

Another parameter which may be useful (in addition to memBufferBytes which you already found), is the default model size. This is used to size unloading buffers and estimate the amount of space to reserve for models before their size is known. I'd recommend setting this to be somewhere between the average and maximum anticipated in-mem model sizes. Checking the code now, it appears its default value is ~1MB which seems too low to me, I'll open a PR to increase this by at least an order of magnitude.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-2.x
  annotations:
    maxLoadingConcurrency: "2"
spec:
  // ...
  builtInAdapter:
    serverType: triton
    runtimeManagementPort: 8001
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    env:
      MODELSIZE_MULTIPLIER: "1.5"
      DEFAULT_MODELSIZE: "104857600"   // in bytes

How does ModelMesh scale the number of runtime instances to fit all of the models and their multiple replicas?

Currently it does not do so automatically. You can set a number of replicas per ServingRuntime (and there's also a global default settable in the ConfigMap). The controller will scale each runtime's deployment to either 0 or N (N being the configured replicas), but only based on whether any Predictors exist that are supported by that runtime.

In practice this has not been a problem for us, since we have a large number of models and the number of replicas is generally set large enough to allow the small number that are very heavily used to scale out sufficiently (i.e. to have a copy in every one). So all of the autoscaling and model churn happens within a more static pool of resources.

@OvervCW
Copy link
Author

OvervCW commented Jul 22, 2022

Thank you for the answers!

In our case the combined load on our models varies widely throughout the day and month, so we do have a need to scale the number of runtime instances accordingly. Since it is not possible to set up a regular horizontal pod autoscaler for this purpose, I expect that we will be implementing our own scaling component that updates the replicas count in the serving runtime resource based on average GPU utilization.

@lizzzcai
Copy link
Member

The questions asked in this thread are very useful and help to answer some of our doubts as well.

Thank you for the answers!

In our case the combined load on our models varies widely throughout the day and month, so we do have a need to scale the number of runtime instances accordingly. Since it is not possible to set up a regular horizontal pod autoscaler for this purpose, I expect that we will be implementing our own scaling component that updates the replicas count in the serving runtime resource based on average GPU utilization.

Hi @OvervCW , why it is not possible to setup a HPA here? We are looking for autoscaling of the servingruntime as well. As it is a deployment and modelmesh itself is exposing metrics, is it possible to use native HPA to scale on custom metics? @njhill

  • For the model memory, in the previous KServe trainedModel approach, the user is able to specify the model memory, is it possible to provide an option in ModelMesh, like by adding an annotation.

  • For the model management, is it possible to provide it as strategies and user can specify it in the servingRuntime, like LRU or LFU etc.

@OvervCW
Copy link
Author

OvervCW commented Jul 28, 2022

why it is not possible to setup a HPA here?

@lizzzcai The controller not only creates deployments for the servingruntimes, but it will also overwrite any changes made to those deployments afterwards, including the number of replicas. That means that when you create a HPA, it will try to update the number of replicas and fail.

As it is a deployment and modelmesh itself is exposing metrics

ModelMesh exposes the right metrics if you want to scale on model pressure, but not if you want to scale on GPU utilization. I suppose it depends on your situation which (if not both) you'll want to scale on.

The Triton runtime exposes its own Prometheus metrics with more GPU-specific data like memory consumption and utilization. Unfortunately it's not possible to create a ServiceMonitor for those since the port for these metrics is not included in the Pod/Service created by ModelMesh. This matters, because once metrics are in Prometheus, it's easy to scale on them using the Prometheus Metrics adapter.

I've decided to extend the Triton runtime image with my own script that samples (more detailed) GPU utilization, and I'm writing my own controller that connects to the pods and collects this data to calculate the desired number of replicas.

@lizzzcai
Copy link
Member

@OvervCW Thanks for your clarification. @njhill instead of user writing controller for it maybe Kserve team can think of providing more flexibility to the user to scale the servingruntime, or support it natively.

@njhill
Copy link
Member

njhill commented Jul 29, 2022

For the model memory, in the previous KServe trainedModel approach, the user is able to specify the model memory, is it possible to provide an option in ModelMesh, like by adding an annotation.

@lizzzcai this isn't currently supported but would be fairly straightfoward to add the functionality

For the model management, is it possible to provide it as strategies and user can specify it in the servingRuntime, like LRU or LFU etc.

No, only LRU is supported and a fair amount of the logic exploits the LRU ordering for making approximations, etc.

Unfortunately it's not possible to create a ServiceMonitor for those since the port for these metrics is not included in the Pod/Service created by ModelMesh. This matters, because once metrics are in Prometheus, it's easy to scale on them using the Prometheus Metrics adapter.

@OvervCW PRs are very welcome :)

I presume by your other statement that you've seen the prometheus metrics exposed by model-mesh itself.

@Agarwal-Saurabh
Copy link

is there any development regarding HPA in modelmesh ?? @njhill

@Agarwal-Saurabh
Copy link

Any updates over here ??

@ckadner
Copy link
Member

ckadner commented Apr 13, 2023

is there any development regarding HPA in modelmesh ?? @njhill

@Agarwal-Saurabh -- there is a PR in progress for kserve/modelmesh-serving#329

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants