Improve the responsiveness of voices by trimming the leading silence #17614

gexgd0419 · 2025-01-12T16:32:00Z

Is your feature request related to a problem? Please describe.

This is related to #13284.

#17592 closed that issue by making SAPI5 voices output via WASAPI. This did improve the responsiveness, but we can improve it even further by removing the leading silence part.

Take Microsoft Zira Desktop (SAPI5) as an example. When speaking at 1X speed, the leading silence is 100ms long. When speaking at its maximum rate (3X speed), the leading silence becomes about 30ms long. If we can remove the leading silence, it will respond even faster.

Other voices such as OneCore voices also have a few milliseconds leading silence.

Describe the solution you'd like

We can detect and remove the silence audio part in WavePlayer, either in the Python part or in the C++ part. As eSpeak, OneCore and SAPI5 (plus MSSP) all use WavePlayer now, they can all benefit from this. The synthesizer may need to tell WavePlayer when the audio will start or end, so that WavePlayer can locate the "leading silence" part more easily.

Describe alternatives you've considered

Create a stand-alone module for detecting and removing the silence audio part, either in Python or in C++. The synthesizers should pass the audio data to this module before feeding it to WavePlayer.

Additional context

I'm not sure what is the best approach to implement this.

The text was updated successfully, but these errors were encountered:

Adriani90 · 2025-01-12T18:48:27Z

Cc: @michaelDCurran

cary-rowen · 2025-01-14T03:01:00Z

cc @jcsteh
You might also be able to provide some implementation insights

gexgd0419 · 2025-01-14T08:46:25Z

Only the leading silence part should be removed. Silence in other parts, such as between sentences, shouldn't be touched.

So the question is how to determine the starting point of each utterance.

If we add another function to tell WavePlayer the starting point, all synthesizers have to be modified to take advantage of this feature. So is there a function that most synthesizers will call before speaking or after speaking is completed?

WavePlayer has a function called idle. I'm not quite sure how it should be used, but it seems that idle is usually called when speaking is completed. So maybe we can use idle to set the starting point: assume that the audio sent by the first feed after idle is the beginning of a new utterance, and perform leading silence removal on that.

Also, where should the silence removal logic be put? Can audio-processing related features be added to WavePlayer, or should they be in separate modules? Should the logic be written in Python or C++? (C++ is theoretically faster, but usually the leading silence isn't long, so Python may also be acceptable)

gerald-hartig added component/speech p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority triaged Has been triaged, issue is waiting for implementation. performance labels Jan 13, 2025

gexgd0419 linked a pull request Jan 24, 2025 that will close this issue

Add leading silence detection and removal logic #17648

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the responsiveness of voices by trimming the leading silence #17614

Improve the responsiveness of voices by trimming the leading silence #17614

gexgd0419 commented Jan 12, 2025

Adriani90 commented Jan 12, 2025

cary-rowen commented Jan 14, 2025

gexgd0419 commented Jan 14, 2025

Improve the responsiveness of voices by trimming the leading silence #17614

Improve the responsiveness of voices by trimming the leading silence #17614

Comments

gexgd0419 commented Jan 12, 2025

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Adriani90 commented Jan 12, 2025

cary-rowen commented Jan 14, 2025

gexgd0419 commented Jan 14, 2025