-
-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add leading silence detection and removal logic #17648
base: master
Are you sure you want to change the base?
Conversation
@gexgd0419 |
In the current implementation, no. As long as the synth driver calls But if separating the logic out of |
I like this very much, I just ran a simple test without any metrics though. |
@gexgd0419 |
Hi, I would like to see this from a critical point, having previous experience in TTS stuff: Cheers. |
Hi @rmcpantoja |
Hi. This is amazing work, thank you so much. |
@mzanm Unfortunately, this is because I haven't found a way that can distinguish "leading silence" accurately. Audio data are sent in chunks to the In your case, the synthesizer call My goal is to require as little change to existing synthesizers as possible, but the NVDA speech synthesizer structure doesn't seem to have the right tool for me to check for "the beginning of an utterance". I might be able to monitor calls to So there are two general directions.
Which one do you prefer? Do you have better idea about distinguishing the "leading silence"? |
@mzanm I think I found the cause here: |
@gexgd0419 Some sapi5 synthesizers have longer trailing silences, for example the slovenian Ebralec or finnish Mikropuhe. |
As I haven't found a way to check the command list in WavePlayer without some kind of modification to the synth driver code, I'm going to try a different method: if the leading silence is longer than, say 200ms, then let the silence pass instead of trimming it. Because longer silence is more likely to be introduced by a |
@gexgd0419 I am happy to test it. I am now running it from source. |
@zstanecic I haven't pushed the changes yet... |
@gexgd0419 I know. I am just notifying you. By the way... the good thing will be to have configurable silence treshold in advanced settings for experienced users... |
Yes, maybe a configurable threshold may be better. But if my code can have access to the command list to check whether there's a This is mainly because If the logic is put in
If the logic is put in another module:
I want to know what NV Access thinks about the idea of putting audio processing logic into |
I have an idea, but not sure if it'll work. Is it possible to have some kind of flag that determines if next utterance should be trimmed then in speech.speech.speak, could iterate through the speechSequence and if there's a text command or ratehr string that has no BreakCommand before it then it should set that flag and therefore trim leading silence of next speech. This shouldn't require changes to synthesizers possibly. Or maybe you could register to the pre_speech extension point in nvwave and look at the sequence that way... but probably both methods wouldn't work if speech sequences are queued. |
@mzanm Thank you for the info. But as you said, pre_speech might be too early, as the speech might be queued. Maybe I can modify the SpeechManager and insert the logic before |
Link to issue number:
Closes #17614
Summary of the issue:
Some voices output a leading silence part before the actual speech voice. By removing the silence part, the delay between keypress and user hearing the audio will be shorter, therefore make the voices more responsive.
Description of user facing changes
Users may find the voices more responsive. All voices using NVDA's
WavePlayer
will be affected, including eSpeak-NG, OneCore, SAPI5, and some third-party voice add-ons.This should only affect the leading silence parts. Silence between sentences or at punctuation marks are not changed, but this may depend on how the voice uses
WavePlayer
.Description of development approach
I wrote a header-only library
silenceDetect.h
innvdaHelper/local
. It supports most wave formats (8/16/24/32-bit integer and 32/64-bit float wave), and uses a simple algorithm: check each sample to see if it's outside threshold range (currently hard-coded to +/- 1/2^10 or 0.0009765625). It uses template-related code and requires C++ 20 standard.The
WasapiPlayer
inwasapi.cpp
is updated to handle silence. A new member function,startTrimmingLeadingSilence
, and the exported versionwasPlay_startTrimmingLeadingSilence
, is added, to set or clear theisTrimmingLeadingSilence
flag. IfisTrimmingLeadingSilence
is true, the next chunk fed in will have its leading silence removed. When non-silence is detected,isTrimmingLeadingSilence
will be reset to false. So every time a new utterance is about to be spoken,startTrimmingLeadingSilence
should be called.In
nvwave.py
,wasPlay_startTrimmingLeadingSilence(self._player, True)
will be called when:idle
is called;_idleCheck
determines that the player is idle.Usually voices will call
idle
when an utterance is completed, so that audio ducking can work correctly, so hereidle
is used to mark the starting point of the next utterance. If a voice doesn't useidle
this way, then this logic might be messed up.As long as the synthesizer uses
idle
as intended, the synthesizer's code doesn't need to be modified to benefit from this feature.Other possible ways/things that may worth considering (but hasn't been implemented):
WavePlayer
. The drawback is that every voice synthesizer module needs to be modified to utilize a separate module.Testing strategy:
These are the delay results I got using 1X speed. "Improved delay" means the delay after applying this PR.
If the speech rate is higher than 1X, the original delay may be shorter, because the leading silence is also shortened during the speed-up process. When there's no leading silence, changing the rate does not affect the delay much.
Considering the margin of error, we can say that the delay of eSpeak NG, OneCore and SAPI5 voices are now at the same level.
Known issues with pull request:
The silence introduced by
BreakCommand
, when at the beginning of an utterance, will also be trimmed.Code Review Checklist:
@coderabbitai summary