This repository shows how to deploy YOLOv5 as an optimized TensorRT engine to Triton Inference Server.
This project based on isarsoft yolov4-triton-tensorrt and Wang Xinyu - TensorRTx
Run the following to get a running TensorRT container with our repo code:
cd tensorrt-triton-yolov5
bash launch_tensorrt.sh
cd tensorrt-triton-yolov5
sudo docker build -t baohuynhbk/tensorrt-20.08-py3-opencv4:latest -f tensorrt.Dockerfile .
Docker will download the TensorRT container. You need to select the version (in this case 20.08) according to the version of Triton that you want to use later to ensure the TensorRT versions match. Matching NGC version tags use the same TensorRT version.
Inside the container the following will run:
bash convert.sh
This will generate a file called yolov5.engine
, which is our serialized TensorRT engine. Together with libmyplugins.so
we can now deploy to Triton Inference Server.
Open an terminal
bash run_triton.sh
Should install tritonclient first:
sudo apt update
sudo apt install libb64-dev
pip install nvidia-pyindex
pip install tritonclient[all]
Open another terminal. This repo contains a python client.
cd triton-deploy/clients/python
python client.py -o data/dog_result.jpg image data/dog.jpg
To benchmark the performance of the model, we can run Tritons Performance Client.
To run the perf_client, install the Triton Python SDK (tritonclient), which ships with perf_client as a preinstalled binary.
# Example
perf_client -m yolov5 -u 127.0.0.1:8221 -i grpc --shared-memory system --concurrency-range 32
Alternatively you can get the Triton Client SDK docker container.
docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:21.03-py3-sdk /bin/bash
cd install/bin
# Example
./perf_client -m yolov5 -u 127.0.0.1:8221 -i grpc --shared-memory system --concurrency-range 4
The following benchmarks were taken on a system with NVIDIA 2080 Ti
GPU.
Concurrency is the number of concurrent clients invoking inference on the Triton server via grpc.
Results are total frames per second (FPS) of all clients combined and average latency in milliseconds for every single respective client.