Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the responsiveness of voices by trimming the leading silence #17614

Open
gexgd0419 opened this issue Jan 12, 2025 · 3 comments · May be fixed by #17648
Open

Improve the responsiveness of voices by trimming the leading silence #17614

gexgd0419 opened this issue Jan 12, 2025 · 3 comments · May be fixed by #17648
Labels
component/speech p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority performance triaged Has been triaged, issue is waiting for implementation.

Comments

@gexgd0419
Copy link
Contributor

Is your feature request related to a problem? Please describe.

This is related to #13284.

#17592 closed that issue by making SAPI5 voices output via WASAPI. This did improve the responsiveness, but we can improve it even further by removing the leading silence part.

Take Microsoft Zira Desktop (SAPI5) as an example. When speaking at 1X speed, the leading silence is 100ms long. When speaking at its maximum rate (3X speed), the leading silence becomes about 30ms long. If we can remove the leading silence, it will respond even faster.

Other voices such as OneCore voices also have a few milliseconds leading silence.

Describe the solution you'd like

We can detect and remove the silence audio part in WavePlayer, either in the Python part or in the C++ part. As eSpeak, OneCore and SAPI5 (plus MSSP) all use WavePlayer now, they can all benefit from this. The synthesizer may need to tell WavePlayer when the audio will start or end, so that WavePlayer can locate the "leading silence" part more easily.

Describe alternatives you've considered

Create a stand-alone module for detecting and removing the silence audio part, either in Python or in C++. The synthesizers should pass the audio data to this module before feeding it to WavePlayer.

Additional context

I'm not sure what is the best approach to implement this.

@Adriani90
Copy link
Collaborator

Cc: @michaelDCurran

@gerald-hartig gerald-hartig added component/speech p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority triaged Has been triaged, issue is waiting for implementation. performance labels Jan 13, 2025
@cary-rowen
Copy link
Contributor

cc @jcsteh
You might also be able to provide some implementation insights

@gexgd0419
Copy link
Contributor Author

Only the leading silence part should be removed. Silence in other parts, such as between sentences, shouldn't be touched.

So the question is how to determine the starting point of each utterance.

If we add another function to tell WavePlayer the starting point, all synthesizers have to be modified to take advantage of this feature. So is there a function that most synthesizers will call before speaking or after speaking is completed?

WavePlayer has a function called idle. I'm not quite sure how it should be used, but it seems that idle is usually called when speaking is completed. So maybe we can use idle to set the starting point: assume that the audio sent by the first feed after idle is the beginning of a new utterance, and perform leading silence removal on that.

Also, where should the silence removal logic be put? Can audio-processing related features be added to WavePlayer, or should they be in separate modules? Should the logic be written in Python or C++? (C++ is theoretically faster, but usually the leading silence isn't long, so Python may also be acceptable)

@gexgd0419 gexgd0419 linked a pull request Jan 24, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/speech p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority performance triaged Has been triaged, issue is waiting for implementation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants