YOLOv5 on Triton Inference Server with TensorRT

This repository shows how to deploy YOLOv5 as an optimized TensorRT engine to Triton Inference Server.

This project based on isarsoft yolov4-triton-tensorrt and Wang Xinyu - TensorRTx

Build TensorRT engine

Run the following to get a running TensorRT container with our repo code:

cd tensorrt-triton-yolov5
bash launch_tensorrt.sh

Or build docker from Dockerfile

cd tensorrt-triton-yolov5
sudo docker build -t baohuynhbk/tensorrt-20.08-py3-opencv4:latest -f tensorrt.Dockerfile .

Docker will download the TensorRT container. You need to select the version (in this case 20.08) according to the version of Triton that you want to use later to ensure the TensorRT versions match. Matching NGC version tags use the same TensorRT version.

Inside the container the following will run:

bash convert.sh

This will generate a file called yolov5.engine, which is our serialized TensorRT engine. Together with libmyplugins.so we can now deploy to Triton Inference Server.

Deploy to Triton Inference Server

Start Triton Server

Open an terminal

bash run_triton.sh

Client

Should install tritonclient first:

sudo apt update
sudo apt install libb64-dev

pip install nvidia-pyindex
pip install tritonclient[all]

Open another terminal. This repo contains a python client.

cd triton-deploy/clients/python
python client.py -o data/dog_result.jpg image data/dog.jpg

Benchmark

To benchmark the performance of the model, we can run Tritons Performance Client.

To run the perf_client, install the Triton Python SDK (tritonclient), which ships with perf_client as a preinstalled binary.

# Example
perf_client -m yolov5 -u 127.0.0.1:8221 -i grpc --shared-memory system --concurrency-range 32

Alternatively you can get the Triton Client SDK docker container.

docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:21.03-py3-sdk /bin/bash
cd install/bin
# Example
./perf_client -m yolov5 -u 127.0.0.1:8221 -i grpc --shared-memory system --concurrency-range 4

The following benchmarks were taken on a system with NVIDIA 2080 Ti GPU. Concurrency is the number of concurrent clients invoking inference on the Triton server via grpc. Results are total frames per second (FPS) of all clients combined and average latency in milliseconds for every single respective client.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
tensorrtx		tensorrtx
triton-deploy		triton-deploy
yolov5		yolov5
.gitignore		.gitignore
README.md		README.md
convert.sh		convert.sh
launch_tensorrt.sh		launch_tensorrt.sh
requirements.txt		requirements.txt
run_triton.sh		run_triton.sh
tensorrt.Dockerfile		tensorrt.Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YOLOv5 on Triton Inference Server with TensorRT

Build TensorRT engine

Or build docker from Dockerfile

Deploy to Triton Inference Server

Start Triton Server

Client

Benchmark

About

Releases

Packages

Languages

huynhbaobk/tensorrt-triton-yolov5

Folders and files

Latest commit

History

Repository files navigation

YOLOv5 on Triton Inference Server with TensorRT

Build TensorRT engine

Or build docker from Dockerfile

Deploy to Triton Inference Server

Start Triton Server

Client

Benchmark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages