Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation about GPU memory #407

Open
WaterKnight1998 opened this issue Jul 25, 2023 · 3 comments
Open

Documentation about GPU memory #407

WaterKnight1998 opened this issue Jul 25, 2023 · 3 comments
Labels
question Further information is requested

Comments

@WaterKnight1998
Copy link

Thank you very much for the incredible project!

First of all, it would be very helpfull that you add a documentation on how to manage GPU memory while using Triton.

I was doing several test but I couldn't understand how the following env parameters works: CONTAINER_MEM_REQ_BYTES and MODELSIZE_MULTIPLIER. I read the following explanation: kserve/modelmesh#82 (comment)

I applied the following configuration for T4:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    maxLoadingConcurrency: "2"
  labels:
    app.kubernetes.io/instance: modelmesh-controller
    app.kubernetes.io/managed-by: modelmesh-controller
    app.kubernetes.io/name: modelmesh-controller
    name: modelmesh-serving-triton-2.x-SR
  name: triton-2.x
  # namespace: inference-server
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
  containers:
  - args:
    - -c
    - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" '
    command:
    - /bin/sh
    image: nvcr.io/nvidia/tritonserver:21.06.1-py3
    livenessProbe:
      exec:
        command:
        - curl
        - --fail
        - --silent
        - --show-error
        - --max-time
        - "9"
        - http://localhost:8000/v2/health/live
      initialDelaySeconds: 5
      periodSeconds: 30
      timeoutSeconds: 10
    name: triton
    env:
    - name: CONTAINER_MEM_REQ_BYTES
      value: "12884901888" 
    - name: MODELSIZE_MULTIPLIER
      value: "2"
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        cpu: 500m
        memory: 1Gi
        nvidia.com/gpu: 1

However, I am seeing models being unloaded and loaded while memory is 2522MiB / 15109MiB. I don't know why I can't get a higher utilization of gpu.

@WaterKnight1998
Copy link
Author

I saw that probably I was setting the configuration in a bad place: kserve/modelmesh#46 (comment)

@WaterKnight1998
Copy link
Author

I get much better GPU utilization using:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    maxLoadingConcurrency: "2"
  labels:
    app.kubernetes.io/instance: modelmesh-controller
    app.kubernetes.io/managed-by: modelmesh-controller
    app.kubernetes.io/name: modelmesh-controller
    name: modelmesh-serving-triton-2.x-SR
  name: triton-2.x
  # namespace: inference-server
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
    env:
    - name: CONTAINER_MEM_REQ_BYTES
      value: "12884901888" # Works for T4
    - name: MODELSIZE_MULTIPLIER
      value: "2"
  containers:
  - args:
    - -c
    - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" '
    command:
    - /bin/sh
    image: nvcr.io/nvidia/tritonserver:21.06.1-py3
    livenessProbe:
      exec:
        command:
        - curl
        - --fail
        - --silent
        - --show-error
        - --max-time
        - "9"
        - http://localhost:8000/v2/health/live
      initialDelaySeconds: 5
      periodSeconds: 30
      timeoutSeconds: 10
    name: triton
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        cpu: 500m
        memory: 1Gi
        nvidia.com/gpu: 1

@rafvasq rafvasq added the question Further information is requested label Jul 26, 2023
@haiminh2001
Copy link

Thank you very much for the incredible project!

First of all, it would be very helpfull that you add a documentation on how to manage GPU memory while using Triton.

I was doing several test but I couldn't understand how the following env parameters works: CONTAINER_MEM_REQ_BYTES and MODELSIZE_MULTIPLIER. I read the following explanation: kserve/modelmesh#82 (comment)

I applied the following configuration for T4:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    maxLoadingConcurrency: "2"
  labels:
    app.kubernetes.io/instance: modelmesh-controller
    app.kubernetes.io/managed-by: modelmesh-controller
    app.kubernetes.io/name: modelmesh-controller
    name: modelmesh-serving-triton-2.x-SR
  name: triton-2.x
  # namespace: inference-server
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
  containers:
  - args:
    - -c
    - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" '
    command:
    - /bin/sh
    image: nvcr.io/nvidia/tritonserver:21.06.1-py3
    livenessProbe:
      exec:
        command:
        - curl
        - --fail
        - --silent
        - --show-error
        - --max-time
        - "9"
        - http://localhost:8000/v2/health/live
      initialDelaySeconds: 5
      periodSeconds: 30
      timeoutSeconds: 10
    name: triton
    env:
    - name: CONTAINER_MEM_REQ_BYTES
      value: "12884901888" 
    - name: MODELSIZE_MULTIPLIER
      value: "2"
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        cpu: 500m
        memory: 1Gi
        nvidia.com/gpu: 1

However, I am seeing models being unloaded and loaded while memory is 2522MiB / 15109MiB. I don't know why I can't get a higher utilization of gpu.

Hi, it has been almost a year since your question but today I came across your question. First of all thank you for your question that I know to do the model sizing. I hope you have solved your problem but if it is not then based on this issue you should place your env variables inside the buildInAdapter, not inside the containers :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants