Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelMesh model-loading decision process #82

Closed
GolanLevy opened this issue Feb 2, 2023 · 10 comments
Closed

ModelMesh model-loading decision process #82

GolanLevy opened this issue Feb 2, 2023 · 10 comments

Comments

@GolanLevy
Copy link

Hi,

We are currently examining mm capabilities and see if it can run many (similar) models, around 100k, each weighs about 500MB on disk.
We are running it on a K8s cluster with GPU pods (g4dn.xlarge) and using the model serving controller to orchestrate the model registration and servingRuntime creation.

The main issue that we are currently experiencing is that the runtime machines are constantly being killed when they require more and more memory (classic OOMKilled) while we run predictions that require loading of different models.
How can we troubleshoot that? We do see that the triton machines are requesting for more memory without a limit, but we could not find the piece of code which is supposed to manage that, i.e., decide which model to load on which instance, and which model to unload in order to evict memory for new models.

So basically my questions are:
(1) Where is the code that manages the loading/unloading decision- runtime-adapter/mm container/somewhere else? Which class/function?
(2) How does it work? I understood that the model size is inferred via some heuristics that try to predict the size in advance. I could not find something decisive in the docs.

This is our servingRuntime configuration:

piVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  annotations:
    maxLoadingConcurrency: "2"
  labels:
    app.kubernetes.io/instance: modelmesh-controller
    app.kubernetes.io/managed-by: modelmesh-controller
    app.kubernetes.io/name: modelmesh-controller
    name: modelmesh-serving-triton-2.x-SR
  name: triton-2.x
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
  containers:
    - args:
        - -c
        - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" "--log-verbose=1" '
      command:
        - /bin/sh
      image: nvcr.io/nvidia/tritonserver:22.09-py3
      livenessProbe:
        exec:
          command:
            - curl
            - --fail
            - --silent
            - --show-error
            - --max-time
            - "9"
            - http://localhost:8000/v2/health/live
        initialDelaySeconds: 5
        periodSeconds: 30
        timeoutSeconds: 10
      name: triton
      resources:
        limits:
          cpu: 2000m
          memory: 4Gi
        requests:
          cpu: 2000m
          memory: 4Gi
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  multiModel: true
  protocolVersions:
    - grpc-v2
  supportedModelFormats:
    - name: triton
      version: "2"

Thanks!

@njhill
Copy link
Member

njhill commented Feb 18, 2023

@GolanLevy apologies again for taking so long to respond here I am a bit underwater with different projects at the moment.

The runtime adapter reports the capacity for loaded models during startup. This is based on the container's (triton in this case) allocated memory, with some fixed overhead value subtracted. This fixed overhead can be configured via the builtInAdapter.memBufferBytes field in the ServingRuntime spec: https://github.com/kserve/kserve/blob/master/pkg/apis/serving/v1alpha1/servingruntime_types.go#L179

The accounting is done based on this and the model sizes that are reported by the adapter when models are loaded. Since for Triton (like many runtimes), there's no way to determine how much space the model takes in memory, it's crudely estimated based on the model size on disk multiplied by a constant factor... default is 1.25 but it can be overridden.

Before the model is loaded, a prediction is made of the model size.. this is based on the size of other models of the same type if there are enough of them loaded, otherwise based on a default model size which is also something that can be configured. This is used to "make room" for the model and then the accounting is adjusted once the (disk) measured size is reported post-loading.

To override each of these you can set MODELSIZE_MULTIPLIER and DEFAULT_MODELSIZE environment variables in the builtInAdapter.env field of the ServingRuntime spec.

Also some space is reserved as a buffer so that new models can begin to load while others are evicted/unloading. It's used only for this unloading space.

Part of the reason you may need to make these values more conservative to avoid OOMs is that modelmesh assumes that you account for additional memory needed for inferencing (i.e. in addition to the model weights) in these fixed overheads and multiplers.

I hope this helps.

@GolanLevy
Copy link
Author

Thanks @njhill .
To deal with this issue, we increased triton's container memory to 12Gb.
Then we encountered a similar issue, this time with the GPU's memory.
The models were loaded and filled up the GPU's memory to the point where about 10Mb was left, not enough for inferencing.
To deal with that, we increased the suggested values (DEFAULT_MODELSIZE = 1Gib, MODELSIZE_MULTIPLIER =2) and it seemed to fix our issue, but we are not completely sure why.

Our questions:

  1. What is the CPU memory used for if the models are eventually loaded to the GPU?
  2. How do we reserve GPU memory for inferencing? Is there a way to specify a fixed amount (say, 2Gib) for that?
  3. Can you please elaborate on memBufferBytes? We are not sure which value this field is substracted from, and how we should estimate this value. If we want to be as conservative as possible, should we increase or decrease it?

@njhill
Copy link
Member

njhill commented Mar 4, 2023

Hi @GolanLevy, ModelMesh's memory accounting is one-dimensional. Typically the location where your models are loaded and run will be the limiting factor and so you should have modelmesh track the mem usage corresponding to that.

So in the GPU case it should really be tracking GPU memory use w.r.t. model placement decisions, and you can hopefully assume that much less CPU memory will be needed and therefore doesn't need to be considered from a resource constraint pov.

  1. What is the CPU memory used for if the models are eventually loaded to the GPU?

Model-mesh only deals with the abstract accounting and doesn't know/care about CPU/GPU distinction. So this would be more a question about how Triton works. My assumption is that when using GPUs only nominal CPU is used.

  1. How do we reserve GPU memory for inferencing? Is there a way to specify a fixed amount (say, 2Gib) for that?

Yes, this is actually the purpose of memBufferBytes and MODELSIZE_MULTIPLIER. The former is a fixed overhead you can specify, which could include this fixed amount for inferencing. The latter is an additional per-model memory overhead (relative to the model size on disk).

  1. Can you please elaborate on memBufferBytes? We are not sure which value this field is substracted from, and how we should estimate this value. If we want to be as conservative as possible, should we increase or decrease it?

This is a good question - it's intended to be the runtime container's total allocated memory (triton container in this case), and so is taken from it's resource.memory value in the ServingRuntime spec. However, this was originally written with CPU in mind so this won't be the value you want if using GPU. Fortunately you should be able to override this by setting the following builtInAdapter env var (in the same way you set MODELSIZE_MULTIPLIER): CONTAINER_MEM_REQ_BYTES.

Regarding "Cache churn threshold exceeded" errors, this just means that you don't have sufficient capacity to handle all of the models in active use at the same time (it shouldn't be the case that a loaded model would be evicted immediately due to other loaded models pushing it out), but this of course could be a symptom of incorrect accounting because of the config issues discussed above.

@GolanLevy
Copy link
Author

GolanLevy commented Mar 9, 2023

Thanks, we will re-examine these parameters more carefully.

Regarding "Cache churn threshold exceeded" errors, this just means that you don't have sufficient capacity to handle all of the models in active use at the same time (it shouldn't be the case that a loaded model would be evicted immediately due to other loaded models pushing it out), but this of course could be a symptom of incorrect accounting because of the config issues discussed above.

Looking at the code, there's a time frame in which the system does not allow more than the total models capacity of the system to be loaded, defined by minChurnAgeMs, which basically cancels an eviction of a model if its age is younger than this parameter.
We were able to overcome this error by changing the configmap:

InternalModelMeshEnvVars:
      - name: CUSTOM_JVM_ARGS
        value: "-Dtas.min_churn_age_ms=0"

which basically disables this feature.

@njhill
Copy link
Member

njhill commented Mar 10, 2023

@GolanLevy yes good investigating to find that, I should probably have mentioned it. However this limiting is there for a reason... you should really increase the effective capacity because it implies the same models are continuously getting evicted and reloaded which would have a very negative impact of performance... kind of equivalent of memory paging/thrashing.

@GolanLevy
Copy link
Author

@njhill Unfortunately, this the common case in our application - many models that are dormant and should be executed a few times every ~15 minutes. Since we cannot have enough machines to load them all at once, we hope that modelmesh will be able to swap them quickly enough.

@ericlu88
Copy link

ericlu88 commented Jun 7, 2023

ModelMesh's memory accounting is one-dimensional. Typically the location where your models are loaded and run will be the limiting factor and so you should have modelmesh track the mem usage corresponding to that.

So in the GPU case it should really be tracking GPU memory use w.r.t. model placement decisions, and you can hopefully assume that much less CPU memory will be needed and therefore doesn't need to be considered from a resource constraint pov.

I'm curious how to setup Modelmesh-serving environment with a MLServer runtime that could track GPU memory instead of CPU side. Could someone shed some light on this? Really appreciate it

@GolanLevy
Copy link
Author

@ericlu88 The solution we found is fairly straightforward:
Set the following environment variables:
CONTAINER_MEM_REQ_BYTES: the GPU memory (this is the CPU memory by default)
MODELSIZE_MULTIPLIER: the ratio between the models size on GPU (examine the diff after loading a model) and the model size on disk.
memBufferBytes: the GPU memory you would like to reserve (mostly for inferencing).

Hence, the number of models you can load on average is (GPU memory - memBufferCache) / (avg(modelSize) * MODELSIZE_MULTIPLIER)

In our case, we see an increase in the CPU memory as well (using tritonserver), and we are not sure why, so make sure you have enough CPU memory for your container. In our case, we need ~6Gb of CPU memory to hold 15Gib of models loaded in the GPU.

@ericlu88
Copy link

ericlu88 commented Jun 8, 2023

@GolanLevy Got it, originally I thought ModelMesh would provide some sort of observability into the the GPU memory usage side, but it seems (based on the discussion in this thread) that it is pretty heuristic based. I've tried your approach and it does seem to start honoring the settings (I've started to see it constantly unloading models to make room for others). Thanks a lot for your input. Though I'd wish ModelMesh would be able to tap into the actual usage of the GPU card, possibly through a plugin to export metric.

@WaterKnight1998
Copy link

WaterKnight1998 commented Jul 24, 2023

@GolanLevy Got it, originally I thought ModelMesh would provide some sort of observability into the the GPU memory usage side, but it seems (based on the discussion in this thread) that it is pretty heuristic based. I've tried your approach and it does seem to start honoring the settings (I've started to see it constantly unloading models to make room for others). Thanks a lot for your input. Though I'd wish ModelMesh would be able to tap into the actual usage of the GPU card, possibly through a plugin to export metric.

This would be very helpfull, I was looking at this service for loading or unloading models automatically based in GPU memory consumption. Now the only thing that I saw is that it computes based on model size on disk.

Maybe is worth trying splitting an A100 80GB into 7 partitions using MIG. Then 7 replicas of triton inference server and just allow one model per replica.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants