Export silero-vad ONNX to TensorRT #209
Replies: 8 comments 2 replies
-
Interesting, 95% of the network consists of either 1D convolutions or Linear layers.
But the main question is, why? |
Beta Was this translation helpful? Give feedback.
-
I'm trying to use silero-vad in real time on an embedded system already using several TensorRt neural networks. Switching from a pytorch model to a TensorRt model normally allows optimization at the inference level (https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/). |
Beta Was this translation helpful? Give feedback.
-
I just tried onnx_simplifier (https://github.com/daquexian/onnx-simplifier). The simplification of the tree solves both the first and the second problem. $ onnxsim files/silero_vad.onnx files/silero_vad_onnxsim.onnx However, a new problem arises: $ trtexec --onnx=files/silero_vad_onnxsim.onnx
...
08/03/2022-16:00:23] [E] Error[4]: [graphShapeAnalyzer.cpp::processCheck::587] Error Code 4: Internal Error (Conv_81: spatial dimension of convolution output cannot be negative (build-time output dimension of axis 2 is -5))
[08/03/2022-16:00:23] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
... |
Beta Was this translation helpful? Give feedback.
-
Problem solved: Using
I couldn't find a solution to solve this problem with the two tools, so I went back to the nvidia documentation (https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/ and https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes) SolutionFirst, simplify the ONNX file: $ onnxsim files/silero_vad.onnx files/silero_vad_onnxsim.onnx Convert using python (or c++) the file to TensorRt: Click here to expandimport tensorrt as trt
from onnx import ModelProto
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
def build_engine(onnx_path, shape):
"""
This is the function to create the TensorRT engine
Args:
onnx_path : Path to onnx_file.
shape : Shape of the input of the ONNX file.
"""
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, builder.create_builder_config() as config, trt.OnnxParser(network, TRT_LOGGER) as parser:
config.max_workspace_size = (256 << 20)
profile = builder.create_optimization_profile();
profile.set_shape("input", (1536,), (1536,), (1536,)) # Or (512,), (1024,), (1536,) if we want something flexible
config.add_optimization_profile(profile)
with open(onnx_path, 'rb') as model:
parser.parse(model.read())
network.get_input(0).shape = shape
engine = builder.build_engine(network, config)
return engine
def save_engine(engine, file_name):
buf = engine.serialize()
with open(file_name, 'wb') as f:
f.write(buf)
onnx_path = "files/silero_vad_onnxsim.onnx"
model = ModelProto()
with open(onnx_path, "rb") as f:
model.ParseFromString(f.read())
shape = [1536] # The value here does not matter, it just has to be large enough to avoid the appearance of a negative dimension.
engine = build_engine(onnx_path, shape=shape)
save_engine(engine, "files/silero_vad.engine") Code to test with silero:New class TrtWrapper in utils_vad.pyimport tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # You need this to init cuda
from collections import OrderedDict
def swap_on_key(d: OrderedDict, key1, key2):
tmp = d[key1]
d[key1] = d[key2]
d[key2] = tmp
class TrtWrapper:
def __init__(self, path):
trt_logger = trt.Logger(trt.ILogger.Severity.INFO)
assert os.path.exists(path)
print("Reading engine from file {}".format(path))
with open(path, "rb") as f, trt.Runtime(trt_logger) as runtime:
self.engine: Optional[trt.ICudaEngine] = runtime.deserialize_cuda_engine(f.read())
self.context: Optional[trt.IExecutionContext] = None
self.memories: Optional[OrderedDict[str, Optional[int]]] = None
self.stream: Optional[cuda.Stream] = None
self.output_buffer = None
def start(self, chunk_duration):
self.context = self.engine.create_execution_context()
self.context.set_binding_shape(self.engine.get_binding_index("input"), (chunk_duration, ))
self.memories = OrderedDict()
for binding in self.engine:
binding_idx = self.engine.get_binding_index(binding)
size = trt.volume(self.context.get_binding_shape(binding_idx))
dtype = trt.nptype(self.engine.get_binding_dtype(binding))
memory = cuda.mem_alloc(np.zeros(size, dtype=dtype).nbytes) # TODO better solution to compute size ?
self.memories[self.engine.get_binding_name(binding_idx)] = memory
if self.engine.get_binding_name(binding_idx) == 'output':
self.output_buffer = cuda.pagelocked_empty(size, dtype)
assert all(self.memories.values())
assert self.output_buffer is not None
self.stream = cuda.Stream()
self.reset_states()
def swap_buffers(self):
swap_on_key(self.memories, 'h0', 'hn')
swap_on_key(self.memories, 'c0', 'cn')
def reset_states(self):
cuda.memcpy_htod_async(self.memories['h0'], np.zeros((2, 1, 64), dtype=np.float32), self.stream)
cuda.memcpy_htod_async(self.memories['c0'], np.zeros((2, 1, 64), dtype=np.float32), self.stream)
def __call__(self, x, sr: int):
x = np.ascontiguousarray(x)
# Transfer input data to the GPU.
cuda.memcpy_htod_async(self.memories['input'], x, self.stream)
# Run inference
self.context.execute_async_v2(bindings=list(self.memories.values()), stream_handle=self.stream.handle)
# swap input/output buffers h and c
self.swap_buffers()
# Transfer prediction output from the GPU.
cuda.memcpy_dtoh_async(self.output_buffer, self.memories['output'], self.stream)
# Synchronize the stream
self.stream.synchronize()
out = torch.tensor(self.output_buffer)[1]
return out
def close(self):
self.context = None
self.memories = None
self.stream = None
self.output_buffer = None Code for testSAMPLING_RATE = 16000
chunk_duration = 1536
from utils_vad import get_speech_timestamps, read_audio, TrtWrapper
import time
from pprint import pprint
import onnxruntime as ort
import numpy as np
# model = OnnxWrapper('files/silero_vad.onnx')
# model = OnnxWrapper('files/silero_vad_onnxsim.onnx')
model = TrtWrapper('files/silero_vad.engine')
model.start(chunk_duration) # Use start to init buffer and cuda
file = 'my_file.wav'
wav = read_audio(file, sampling_rate=SAMPLING_RATE)
# get speech timestamps from full audio file
t0 = time.time()
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE, window_size_samples=chunk_duration, min_silence_duration_ms=0)
print(time.time() - t0)
pprint(speech_timestamps) Execution timeUsing the
|
Beta Was this translation helpful? Give feedback.
-
One audio chunk should take about ~1ms on one CPU thread. By a simple calculation, a 4 minute audio file should take about:
The fact that this takes ~40 times slower is strange.
There is very little reason to run VAD on CUDA, because there is very little to gain. |
Beta Was this translation helpful? Give feedback.
-
I wasn't clear in my results, I edited the post to correct that. The results that were displayed were those for running the network 100 times over 4min of video, so we were at 0.5ms / frame.
For my project, the audio data is already on the gpu, since it goes through other networks. According to the documentation "Using batching or GPU can also improve performance considerably.. Having already the data on GPU and the GPU being supposed to accelerate the processing times a bit more, I wanted to try to see what it could give, even if the performance on silero CPU is already remarkable. Since I managed to run silero on TensorRT, I think the issue is complete. Thanks for the answers and for the very good work done on silero-vad 😃 |
Beta Was this translation helpful? Give feedback.
-
Seems in line with our bechmarks, albeit we did not test on GPU.
Well, batching would work for multiple streams at the same time better. You can find more details in the discussion via this link. Basically each batch element in one separate stream. In any case many thanks for your input on this conversion.
Another questions is whether the model outputs are similar. I will create a copy of this ticket as a discussion. |
Beta Was this translation helpful? Give feedback.
-
I build a trt engine for silero_vad_v4 using batch = 64 |
Beta Was this translation helpful? Give feedback.
-
Export silero-vad ONNX to TensorRT
I am trying to translate the supplied ONXX network (
files/silero_vad.onnx
) to TensorRT (.trt). I tried two tools:trtexec
from nvidia (https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec)onnx2trté
(https://github.com/onnx/onnx-tensorrt)With both tools, I get the following error:
From what I could find, the problem would come from the fact that TensorRt does not currently manage tensor2D (https://forums.developer.nvidia.com/t/ishufflelayer-applied-to-shape-tensor-must-have-0-or-1-reshape-dimensions-dimensions-were-1-2/200183). A solution proposed in response is to use
polygraphy
with surgeon:Now if I apply one of my two tools, the problem seems to be solved, but another problem arises further on:
I haven't found a solution to this problem yet, does anyone have any idea how to solve this problem?
Beta Was this translation helpful? Give feedback.
All reactions