Export silero-vad ONNX to TensorRT #209

EnlNovius · 2022-08-03T12:52:34Z

EnlNovius
Aug 3, 2022

Export silero-vad ONNX to TensorRT

I am trying to translate the supplied ONXX network (files/silero_vad.onnx) to TensorRT (.trt). I tried two tools:

trtexec from nvidia (https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec)
onnx2trté (https://github.com/onnx/onnx-tensorrt)

With both tools, I get the following error:

$ trtexec --onnx=files/silero_vad.onnx
...
[08/03/2022-14:44:43] [E] Error[4]: [shuffleNode.cpp::symbolicExecute::392] Error Code 4: Internal Error (Reshape_17: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:773: While parsing node number 27 [Pad -> "152"]:
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:775: input: "129"
...
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:776: --- End node ---
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:779: ERROR: ModelImporter.cpp:180 In function parseGraph:
[6] Invalid Node - Pad_27
[shuffleNode.cpp::symbolicExecute::392] Error Code 4: Internal Error (Reshape_17: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])
[08/03/2022-14:44:43] [E] Failed to parse onnx file

From what I could find, the problem would come from the fact that TensorRt does not currently manage tensor2D (https://forums.developer.nvidia.com/t/ishufflelayer-applied-to-shape-tensor-must-have-0-or-1-reshape-dimensions-dimensions-were-1-2/200183). A solution proposed in response is to use polygraphy with surgeon:

$  polygraphy surgeon sanitize --fold-constants files/silero_vad.onnx -o files/silero_vad_surgeon.onnx

Now if I apply one of my two tools, the problem seems to be solved, but another problem arises further on:

$ trtexec --onnx=files/silero_vad_surgeon.onnx
...
[08/03/2022-14:46:22] [E] Error[4]: If_33_OutputLayer: IIfConditionalOutputLayer inputs must have the same shape.
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:773: While parsing node number 10 [If -> "158"]:
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:775: input: "157"
...
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:776: --- End node ---
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:779: ERROR: ModelImporter.cpp:180 In function parseGraph:
[6] Invalid Node - If_33
If_33_OutputLayer: IIfConditionalOutputLayer inputs must have the same shape.
[08/03/2022-14:46:22] [E] Failed to parse onnx file

I haven't found a solution to this problem yet, does anyone have any idea how to solve this problem?

snakers4 · 2022-08-03T13:04:00Z

snakers4
Aug 3, 2022
Maintainer

TensorRt does not currently manage tensor2D

Interesting, 95% of the network consists of either 1D convolutions or Linear layers.
The network contains an internal normalization layer (with a padding), which most likely is the cause of the problem, since it always gave us grief during exports.
This is one of the reasons we decided to keep PyTorch and ONNX formats.

I am trying to translate the supplied ONXX network (files/silero_vad.onnx) to TensorRT (.trt).

But the main question is, why?

0 replies

EnlNovius · 2022-08-03T13:32:56Z

EnlNovius
Aug 3, 2022
Author

But the main question is, why?

I'm trying to use silero-vad in real time on an embedded system already using several TensorRt neural networks. Switching from a pytorch model to a TensorRt model normally allows optimization at the inference level (https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/).

0 replies

EnlNovius · 2022-08-03T14:10:59Z

EnlNovius
Aug 3, 2022
Author

I just tried onnx_simplifier (https://github.com/daquexian/onnx-simplifier). The simplification of the tree solves both the first and the second problem.

$ onnxsim files/silero_vad.onnx files/silero_vad_onnxsim.onnx

However, a new problem arises:

$ trtexec --onnx=files/silero_vad_onnxsim.onnx
...
08/03/2022-16:00:23] [E] Error[4]: [graphShapeAnalyzer.cpp::processCheck::587] Error Code 4: Internal Error (Conv_81: spatial dimension of convolution output cannot be negative (build-time output dimension of axis 2 is -5))
[08/03/2022-16:00:23] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
...

0 replies

EnlNovius · 2022-08-04T15:02:19Z

EnlNovius
Aug 4, 2022
Author

Problem solved:

Using onnx2trt I got the following error:

[2022-08-04 09:00:13   ERROR] 4: [network.cpp::validate::2965] Error Code 4: Internal Error (Network has dynamic or shape inputs, but no optimization profile has been defined.)

I couldn't find a solution to solve this problem with the two tools, so I went back to the nvidia documentation (https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/ and https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes)

Solution

First, simplify the ONNX file:

$ onnxsim files/silero_vad.onnx files/silero_vad_onnxsim.onnx

Convert using python (or c++) the file to TensorRt:

Click here to expand

import tensorrt as trt
from onnx import ModelProto
import tensorrt as trt 

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)

def build_engine(onnx_path, shape):
    """
    This is the function to create the TensorRT engine
    Args:
      onnx_path : Path to onnx_file. 
      shape : Shape of the input of the ONNX file. 
    """
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, builder.create_builder_config() as config, trt.OnnxParser(network, TRT_LOGGER) as parser:
        config.max_workspace_size = (256 << 20)
        
        profile = builder.create_optimization_profile();
        profile.set_shape("input", (1536,), (1536,), (1536,))  # Or (512,), (1024,), (1536,) if we want something flexible

        config.add_optimization_profile(profile)

        with open(onnx_path, 'rb') as model:
            parser.parse(model.read())
        network.get_input(0).shape = shape
        engine = builder.build_engine(network, config)
        return engine

def save_engine(engine, file_name):
    buf = engine.serialize()
    with open(file_name, 'wb') as f:
        f.write(buf)
 
onnx_path = "files/silero_vad_onnxsim.onnx"
 
model = ModelProto()
with open(onnx_path, "rb") as f:
    model.ParseFromString(f.read())

shape = [1536] # The value here does not matter, it just has to be large enough to avoid the appearance of a negative dimension. 
engine = build_engine(onnx_path, shape=shape)
save_engine(engine, "files/silero_vad.engine")

Code to test with silero:

New class TrtWrapper in utils_vad.py

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # You need this to init cuda
from collections import OrderedDict

def swap_on_key(d: OrderedDict, key1, key2):
    tmp = d[key1]
    d[key1] = d[key2]
    d[key2] = tmp


class TrtWrapper:
    def __init__(self, path):
        trt_logger = trt.Logger(trt.ILogger.Severity.INFO)
        assert os.path.exists(path)
        print("Reading engine from file {}".format(path))
        with open(path, "rb") as f, trt.Runtime(trt_logger) as runtime:
            self.engine: Optional[trt.ICudaEngine] = runtime.deserialize_cuda_engine(f.read())

        self.context: Optional[trt.IExecutionContext] = None
        self.memories: Optional[OrderedDict[str, Optional[int]]] = None
        self.stream: Optional[cuda.Stream] = None
        self.output_buffer = None

    def start(self, chunk_duration):
        self.context = self.engine.create_execution_context()
        self.context.set_binding_shape(self.engine.get_binding_index("input"), (chunk_duration, ))

        self.memories = OrderedDict()
        for binding in self.engine:
            binding_idx = self.engine.get_binding_index(binding)
            size = trt.volume(self.context.get_binding_shape(binding_idx))
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))

            memory = cuda.mem_alloc(np.zeros(size, dtype=dtype).nbytes)  # TODO better solution to compute size ?
            self.memories[self.engine.get_binding_name(binding_idx)] = memory

            if self.engine.get_binding_name(binding_idx) == 'output':
                self.output_buffer = cuda.pagelocked_empty(size, dtype)

        assert all(self.memories.values())
        assert self.output_buffer is not None

        self.stream = cuda.Stream()

        self.reset_states()

    def swap_buffers(self):
        swap_on_key(self.memories, 'h0', 'hn')
        swap_on_key(self.memories, 'c0', 'cn')

    def reset_states(self):
        cuda.memcpy_htod_async(self.memories['h0'], np.zeros((2, 1, 64), dtype=np.float32), self.stream)
        cuda.memcpy_htod_async(self.memories['c0'], np.zeros((2, 1, 64), dtype=np.float32), self.stream)

    def __call__(self, x, sr: int):
        x = np.ascontiguousarray(x)

        # Transfer input data to the GPU.
        cuda.memcpy_htod_async(self.memories['input'], x, self.stream)

        # Run inference
        self.context.execute_async_v2(bindings=list(self.memories.values()), stream_handle=self.stream.handle)

        # swap input/output buffers h and c
        self.swap_buffers()

        # Transfer prediction output from the GPU.
        cuda.memcpy_dtoh_async(self.output_buffer, self.memories['output'], self.stream)

        # Synchronize the stream
        self.stream.synchronize()

        out = torch.tensor(self.output_buffer)[1]

        return out

    def close(self):
        self.context = None
        self.memories = None
        self.stream = None
        self.output_buffer = None

Code for test

SAMPLING_RATE = 16000
chunk_duration = 1536

from utils_vad import get_speech_timestamps, read_audio, TrtWrapper
import time
from pprint import pprint

import onnxruntime as ort
import numpy as np

# model = OnnxWrapper('files/silero_vad.onnx')
# model = OnnxWrapper('files/silero_vad_onnxsim.onnx')

model = TrtWrapper('files/silero_vad.engine')
model.start(chunk_duration) # Use start to init buffer and cuda

file = 'my_file.wav'

wav = read_audio(file, sampling_rate=SAMPLING_RATE)

# get speech timestamps from full audio file
t0 = time.time()
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE, window_size_samples=chunk_duration, min_silence_duration_ms=0)
print(time.time() - t0)
pprint(speech_timestamps)

Execution time

Using the Code for test on a 4 minutes audio file, average time of get_speech_timestamps function (average on 100 calls):

Model	Average time to process an audio file of 4min
torch.jit.load('files/silero_vad.jit')	4.34s
OnnxWrapper('files/silero_vad.onnx')	1.11s
OnnxWrapper('files/silero_vad_onnxsim.onnx')	0.93s
TrtWrapper('files/silero_vad.engine')	1.03s

1 reply

EnlNovius Aug 5, 2022
Author

By the way, not really related to the subject but it may be useful to someone. Here is a small piece of code to generate a subtitle file (.srt) to display voice detection on a video from the output of get_speech_timestamps:

Click here to expand

import datetime as dt

def time_to_str(t: dt.time):
    return t.strftime('%H:%M:%S,%f')[:-3]

tmp_dt = dt.datetime.today().replace(hour=0, minute=0, second=0, microsecond=0)
string_fmt = "{}\n{} --> {}\n{}\n\n"

string = ""
for i, vd in enumerate(speech_timestamps):
    tbeg = (tmp_dt + dt.timedelta(seconds=vd['start']/SAMPLING_RATE)).time()
    tend = (tmp_dt + dt.timedelta(seconds=vd['end']/SAMPLING_RATE)).time()
    text = 'voice detected ' + str(i)
    
    string += string_fmt.format(i, time_to_str(tbeg), time_to_str(tend), text)

print(string)
    
with open('my_output_name.srt', 'w') as f:
    f.write(string)

Output:

0
00:00:00,450 --> 00:00:01,470
voice detected 0

1
00:00:01,698 --> 00:00:03,390
voice detected 1

2
00:00:03,522 --> 00:00:06,078
voice detected 2

...

snakers4 · 2022-08-05T04:40:54Z

snakers4
Aug 5, 2022
Maintainer

on a 4 minutes audio file

One audio chunk should take about ~1ms on one CPU thread.
ONNX was similar.

By a simple calculation, a 4 minute audio file should take about:

4 mins * 60 seconds per minute * 16000 samples per second / 1536 samples per chunk * 1 ms per chunk / 1000 ms per second = 2.5 seconds

The fact that this takes ~40 times slower is strange.

import pycuda.autoinit # You need this to init cuda

There is very little reason to run VAD on CUDA, because there is very little to gain.
The model is very fast, most likely it will just incur an overhead to copy to and from the GPU.

0 replies

EnlNovius · 2022-08-05T12:19:33Z

EnlNovius
Aug 5, 2022
Author

The fact that this takes ~40 times slower is strange.

I wasn't clear in my results, I edited the post to correct that. The results that were displayed were those for running the network 100 times over 4min of video, so we were at 0.5ms / frame.

There is very little reason to run VAD on CUDA, because there is very little to gain.
The model is very fast, most likely it will just incur an overhead to copy to and from the GPU.

For my project, the audio data is already on the gpu, since it goes through other networks. According to the documentation "Using batching or GPU can also improve performance considerably.. Having already the data on GPU and the GPU being supposed to accelerate the processing times a bit more, I wanted to try to see what it could give, even if the performance on silero CPU is already remarkable.

Since I managed to run silero on TensorRT, I think the issue is complete. Thanks for the answers and for the very good work done on silero-vad 😃

0 replies

snakers4 · 2022-08-05T12:28:35Z

snakers4
Aug 5, 2022
Maintainer

so we were at 0.5ms / frame.

Seems in line with our bechmarks, albeit we did not test on GPU.

"Using batching or GPU can also improve performance considerably.. Having already the data on GPU and the GPU being supposed to accelerate the processing times a bit more,

Well, batching would work for multiple streams at the same time better. You can find more details in the discussion via this link. Basically each batch element in one separate stream.

In any case many thanks for your input on this conversion.
Hope someone finds it useful for their usecase.

Since I managed to run silero on TensorRT

Another questions is whether the model outputs are similar.

I will create a copy of this ticket as a discussion.

1 reply

EnlNovius Aug 5, 2022
Author

From what I could test on my short example, the results are the same for the ONNX, ONNXSIM and TensorRT Engine (the speech timestamps are identical and the value of the tensors at the output of the network are identical to the 4th decimal place, I did not display more precision).

It is interesting to note that going through onnxsim seems to have slightly improved the speed of the code without loss of results.

zxOnVacation · 2023-05-19T07:14:10Z

zxOnVacation
May 19, 2023

I build a trt engine for silero_vad_v4 using batch = 64
In T4 device, a 1 hour audio just cost 1.4s to infer.
batch input is awesome!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export silero-vad ONNX to TensorRT #209

{{title}}

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Export silero-vad ONNX to TensorRT #209

EnlNovius Aug 3, 2022