-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation about GPU memory #407
Comments
I saw that probably I was setting the configuration in a bad place: kserve/modelmesh#46 (comment) |
I get much better GPU utilization using: apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
annotations:
maxLoadingConcurrency: "2"
labels:
app.kubernetes.io/instance: modelmesh-controller
app.kubernetes.io/managed-by: modelmesh-controller
app.kubernetes.io/name: modelmesh-controller
name: modelmesh-serving-triton-2.x-SR
name: triton-2.x
# namespace: inference-server
spec:
builtInAdapter:
memBufferBytes: 134217728
modelLoadingTimeoutMillis: 90000
runtimeManagementPort: 8001
serverType: triton
env:
- name: CONTAINER_MEM_REQ_BYTES
value: "12884901888" # Works for T4
- name: MODELSIZE_MULTIPLIER
value: "2"
containers:
- args:
- -c
- 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
"--model-repository=/models/_triton_models" "--model-control-mode=explicit"
"--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
"--allow-sagemaker=false" '
command:
- /bin/sh
image: nvcr.io/nvidia/tritonserver:21.06.1-py3
livenessProbe:
exec:
command:
- curl
- --fail
- --silent
- --show-error
- --max-time
- "9"
- http://localhost:8000/v2/health/live
initialDelaySeconds: 5
periodSeconds: 30
timeoutSeconds: 10
name: triton
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 1 |
Hi, it has been almost a year since your question but today I came across your question. First of all thank you for your question that I know to do the model sizing. I hope you have solved your problem but if it is not then based on this issue you should place your env variables inside the buildInAdapter, not inside the containers :)) |
Thank you very much for the incredible project!
First of all, it would be very helpfull that you add a documentation on how to manage GPU memory while using Triton.
I was doing several test but I couldn't understand how the following env parameters works:
CONTAINER_MEM_REQ_BYTES
andMODELSIZE_MULTIPLIER
. I read the following explanation: kserve/modelmesh#82 (comment)I applied the following configuration for T4:
However, I am seeing models being unloaded and loaded while memory is
2522MiB / 15109MiB
. I don't know why I can't get a higher utilization of gpu.The text was updated successfully, but these errors were encountered: