-
-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve the responsiveness of onecore voices and sapi voices #13284
Comments
Yes, windows Sapi5 is noticeably more responsive on some screen readers, e.g. ZDSR |
Yeah, if the silence could be trimmed, then sapi, or even one core would be
as responsive as eloquence or espeak
…On Fri, 28 Jan 2022 at 04:15, Rowen ***@***.***> wrote:
Yes, windows Sapi5 is noticeably more responsive on some screen readers,
e.g. ZDSR
—
Reply to this email directly, view it on GitHub
<#13284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AT2FKJUCFM62JNHCH7XKYD3UYIC6JANCNFSM5M7BHI6A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I agree, SAPI 5 and one core is somehow crazy fast on ZDSR. |
I don't know what the bleeps they did, but it's possible that they too are
trimming the silence from the beginning
…On Fri, 28 Jan 2022 at 13:16, Mazen ***@***.***> wrote:
I agree, SAPI 5 and one core is somehow crazy fast on ZDSR.
—
Reply to this email directly, view it on GitHub
<#13284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AT2FKJU7SF2TLCXVRNZTBODUYKCJ7ANCNFSM5M7BHI6A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
While I'm an ESpeak user and not using Onecore very frequently, I find OneCore pretty responsive with NVDA. It would be helpful if findings about slow responsiveness are supported by measurable evidence. |
I'm focusing here on sapi5. I mentioned one core because I used a program
called sapi unifier to port the one core voices into sapi5
…On Sat, 29 Jan 2022 at 13:26, Leonard de Ruijter ***@***.***> wrote:
While I'm an ESpeak user and not using Onecore very frequently, I find
OneCore pretty responsive with NVDA. It would be helpful if findings about
slow responsiveness are supported by measurable evidence.
—
Reply to this email directly, view it on GitHub
<#13284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AT2FKJUZJX2L2FLBYEVWBIDUYPMIJANCNFSM5M7BHI6A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This is an audio from anyaubio, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer. |
the link is broken
…On Mon, 31 Jan 2022 at 08:56, DPY ***@***.***> wrote:
This is an audio from anyaubio
<http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk>, listen to it to
get an idea of how well zdsr supports the speed of the sapi5 speech
synthesizer.
—
Reply to this email directly, view it on GitHub
<#13284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks for reminding the above link has been re-edited |
yeah it doesn't take me there for some reason. No matter, lets concentrate
on nvda, cause this is what we're working with right?
…On Mon, 31 Jan 2022 at 14:38, DPY ***@***.***> wrote:
the link is broken
… <#m_-7055658538261792368_>
On Mon, 31 Jan 2022 at 08:56, DPY *@*.*> wrote: This is an audio from
anyaubio http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk
<http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk>, listen to it to
get an idea of how well zdsr supports the speed of the sapi5 speech
synthesizer. — Reply to this email directly, view it on GitHub <#13284
(comment)
<#13284 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A
<https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A>
. You are receiving this because you authored the thread.Message ID: @.*>
Thanks for reminding the above link has been re-edited
—
Reply to this email directly, view it on GitHub
<#13284 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AT2FKJQZM5XU2TGEONDY6EDUY2GFBANCNFSM5M7BHI6A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
yes |
This comment was marked as resolved.
This comment was marked as resolved.
I did some tests and found following after some minutes of using:
taking eSpeak as reference, the expected behavior is to have all synths at the same performance level. I tested with NVDA alpha-28179,345154a6 (2023.2.0.28179), WASAPI enabled, by using arrow keys in browse mode in Google Chrome 112, which is very responsive. As you can see, even on this machine there is a noticeable performance difference, so speaking about low end machines, the performance degradation between synths might be much more obvious. cc: @jcsteh, @michaelDCurran |
While it's possible there is some silence at the start of the audio buffer returned by these voices, it's also possible (I'd guess more likely) that these voices just take longer to synthesise speech. In that case, there's really nothing that can be done; the performance optimisation would need to happen in the voice itself. For OneCore at least, if you already have a way to measure the time between key press and actual audio output, I'd suggest comparing with Narrator. That will give you an indication of whether this is something specific to NVDA or whether the voice itself is slow to respond. |
Narrator performance is worse than NVDA, |
Is that true for OneCore with ZDSR even with the latest responsiveness and WASAPI changes in alpha? SAPI5 is a different case, as NVDA uses SAPI5's own audio output rather than NVDA's audio output. It's possible that switching to nvwave + WASAPI for SAPI5 might improve responsiveness, but I'm not sure. |
Are there any responsiveness issues remaining now that NVDA uses WASAPI? |
Note that NVDA still doesn't use nvwave for SAPI5, so there won't be a change for SAPI5 now in terms of audio. However, the other responsiveness changes in the last few months might have some impact. |
Frankly, there are no noticeable changes. |
Given that there has been at least a measurable 10 to 30 ms improvement in responsiveness in NVDA in the last few months, not accounting for WASAPI, the fact that you're seeing "no noticeable changes" would suggest you're seeing a delay which is significantly larger than 30 ms with OneCore. That certainly doesn't match my experience, nor does it match #13284 (comment). That further suggests that there is a significant difference on your system as compared to mine and others. As it stands, this issue isn't actionable. To get any further here, we're going to need precise information about which OneCore voice you're using, the rate it's configured at, probably audio recordings demonstrating the performance issue you're seeing, etc. |
Hello. I can confirm, that sapi5 in nvda is not as performant as in other places, and yes, this is because of sapi5 outputting sound itself. I am sure this will be improved, if sapi5 will go through nvda itself. |
Regarding the performance improvements for the SAPI5 speech synthesizer, I've attempted a solution to directly obtain audio data:
|
Hi @jcsteh |
This project might help you measure the latency during each step when using an SAPI5 voice. The included TestTTSEngine can create voices that forward data to your installed SAPI5 voices and trim the leading silence part before outputting the audio. You can check how much this can improve the responsiveness. If you use the TestTTSClient.exe, then you can see the log generated during speaking, and check the latency of each step. The code I used to test the delay between keypress and audio output is not included yet. But I plan to include it later. |
In the documentation of ISpAudio, Microsoft says:
You can notice the "serialization" performed by SAPI if you open two TTS clients and make them speak at the same time: only one of them can speak, and the other has to wait. Cross-process serialization might increase the delay. Below are what I found using that test program on my system. I found that if you let SAPI output audio to a memory stream, the "serialization" is bypassed, and the delay between the client calls Speak and receives the first chunk of audio data is usually less than 10 ms. But if you let SAPI output audio to the default device, the delay can increase to about 50~100 ms. As for the leading silence duration, if you are using one of the built-in voices, at normal rate it's about 100 ms, and at the maximum rate it decreases to about 30~50 ms. For example, this is a log I got when outputting to the default device. Log (output to default device)
You can see that there's more than 100 ms delay before each If you output to a memory stream, the extra delay will be gone. Log (output to memory stream)
And that was fast. But yes, the audio has to be output to an audio device in order to be heard, and the output process introduces more delay, so the final delay won't be that good. We can only hope that WASAPI introduces less delay than WinMM which SAPI5 uses internally. EDIT: Tried outputting to the default device again after my computer fan started spinning, and the delay became smaller! So this can be affected by many things, including the active power plan and the resource usage of other applications. But there was still about 80 ms delay. |
To make SAPI 5 voices able to use NVDA's own wave player (which uses WASAPI), we can try the following steps. First, write an implementation class of COM interface from comtypes import COMObject
from objidl import IStream
class AudioStream(COMObject):
_com_interfaces_ = [IStream]
def __init__(self, fmt):
self._writtenBytes = 0
wfx = fmt.GetWaveFormatEx() # SpWaveFormatEx
self._player = nvwave.WavePlayer(
channels=wfx.Channels,
samplesPerSec=wfx.SamplesPerSec,
bitsPerSample=wfx.BitsPerSample,
outputDevice=config.conf["speech"]["outputDevice"],
)
def ISequentialStream_RemoteWrite(self, this, pv, cb, pcbWritten):
# audio processing...
self._player.feed(pv, cb)
self._writtenBytes += cb
if pcbWritten:
pcbWritten[0] = cb
return 0
def IStream_RemoteSeek(self, this, dlibMove, dwOrigin, plibNewPosition):
if dwOrigin == 1 and dlibMove.QuadPart == 0:
# SAPI is querying the current position.
if plibNewPosition:
plibNewPosition[0].QuadPart = self._writtenBytes
return 0
return 0x80004001 # E_NOTIMPL is returned in other cases Other methods of Then, when initializing the # ... After setting the voice:
self.tts.AudioOutput = self.tts.AudioOutput # Reset the audio and its format parameters
fmt = self.tts.AudioOutputStream.Format
stream = comtypes.client.CreateObject("SAPI.SpCustomStream") # might be different for MSSP voices
stream.BaseStream = AudioStream(fmt) # set the IStream being wrapped
stream.Format = fmt
self.tts.AudioOutputStream = stream # Set the stream (wrapper) as the output target Now you will be able to hear the voices. Not everything is processed properly in the code above, but I hope that you can get the idea. One of the problems is that continuous reading will be broken, because the Bookmark events become out of sync with the audio stream. We will need to synchronize them ourselves. |
Now this latency tester project supports measuring the delay between keypress and audio output, so I did some tests. Used version: Modifications: Voice: Microsoft Huihui (Chinese, Simplified) Results:
|
Cool, it looks like @gexgd0419 has made some real progress on this and has given a test result. So far I'd be interested to hear what NV Access has to say about this or any pointers on the way forward. Also @jcsteh’s comments are valuable, can you talk about them? I'm excited about the improved responsiveness |
For example, here's the NVDA log when I pressed the S key, and the detected audio latency is 118.11ms, with the original SAPI5 implementation, but with leading silence trimmed.
From the timestamps in the log, we can get the following timeline:
There's 40ms delay between receiving the keyboard input and issuing the Speak command, and there's 20~30ms delay between writing audio data and outputing audio. So the minimum possible delay of NVDA on my system would be about 70ms, which could be achieved using eSpeak NG voice, or Huihui SAPI5 voice via WASAPI with leading silence trimmed. |
This is a little tangential, but 40 ms is unexpectedly high there. I would expect something more like 20 ms or less, though it may be worse if you're running on battery. Also, this raises another problem: handling typed characters seems to be pretty slow. If I do this in input help, I see 2 ms or less there. If I do this using the left or right arrow keys in the Run dialog edit field, I get 10 ms or less. This is a result of the optimisation work I did in #14928 and #14708. However, speaking of typed characters doesn't appear to benefit from this. This might be improved if we tweak eventHandler so that it always uses an immediate pump for typedCharacter events, just like we do for gainFocus events. |
I tried to implement WASAPI on SAPI5 (and maybe SAPI4) further, but I think I need some help. The problem is how I can synchronize the bookmark events with the audio stream.
Even worse, there's no guarantee that the bookmark event will happen right between audio for A and B, because audio and events are sent in different threads. But maybe this can be fixed by using Or maybe there's another way. As the current implementation of OneCore voices is already using WavePlayer (and WASAPI), I checked the code and it seemed that all wave data are retrieved at once, instead of being "streamed" in chunks. Related OneCore speech C++ codewinrt::fire_and_forget
speak(
void* originToken,
winrt::hstring text,
std::shared_ptr<winrtSynth> synth,
std::function<ocSpeech_CallbackT> cb
) {
try {
co_await winrt::resume_background();
SpeechSynthesisStream speechStream{ nullptr };
try {
// Wait for the stream to complete
speechStream = co_await synth->SynthesizeSsmlToStreamAsync(text);
}
catch (winrt::hresult_error const& e) {
LOG_ERROR(L"Error " << e.code() << L": " << e.message().c_str());
protectedCallback_(originToken, std::optional<SpeakResult>(), cb);
co_return;
}
const std::uint32_t size = static_cast<std::uint32_t>(speechStream.Size());
std::optional<SpeakResult> result(SpeakResult{
Buffer(size),
createMarkersString_(speechStream.Markers()) // send all markers (bookmarks) in a string
}
);
try {
// Read all data and send it to callback function in one go
co_await speechStream.ReadAsync(result->buffer, size, InputStreamOptions::None);
protectedCallback_(originToken, result, cb);
co_return;
}
catch (winrt::hresult_error const& e) {
LOG_ERROR(L"Error " << e.code() << L": " << e.message().c_str());
protectedCallback_(originToken, std::optional<SpeakResult>(), cb);
co_return;
}
}
// ... catch blocks ...
} Although asynchronous, the audio and all the markers for this entire utterance will be ready when the callback function is called. This is more like @shenguangrong 's approach above, which uses If the delay of OneCore voices is acceptable, then this approach is also feasible. |
Not currently, though it might be possible to add it. However, you should be able to manufacture this already. One way would be to have a dict which maps from chunk id to bookmark id. Chunk id could be a simple counter which you increment for every chunk you feed or it could be something you easily get from SAPI; e.g. a stream position. After you call feed, keep track of the last chunk id in an instance variable. When you get the bookmark event, set
Yeah, this does seem like a source of intermittent timing problems.
That's correct. OneCore doesn't provide a streaming interface, unfortunately.
I don't think it is. We just don't have another choice. It causes unnecessary latency. Segmenting the text could help that, but it's not a true fix, just a workaround. This should always be a last resort and would IMO be an unacceptable regression. |
Actually, you should be able to do this: |
I opened a pull request #17592 as my attempt to fix this. Here's the build artifact files, which includes an installer exe file to install this alpha version. Can this improve the responsiveness of SAPI5 voices? Or does this introduce new bugs? Also I need a way to test the audio ducking feature. But audio ducking requires UIAccess privilege, which requires the program be installed and signed. How can I test audio ducking using an alpha version which is not signed? |
Hi @gexgd0419 Glad to see this PR I will test it later. |
Note that this issue is described as covering both SAPI5 and OneCore, but I don't think #17592 does anything regarding OneCore. |
The author said:
OneCore voices are already using WASAPI, so their responsiveness cannot be improved using the same method. |
Closes nvaccess#13284 Summary of the issue: Currently, SAPI5 and MSSP voices use their own audio output mechanisms, instead of using the WavePlayer (WASAPI) inside NVDA. This may make them less responsive compared to eSpeak and OneCore voices, which are using the WavePlayer, or compared to other screen readers using SAPI5 voices, according to my test result. This also gives NVDA less control of audio output. For example, audio ducking logic inside WavePlayer cannot be applied to SAPI5 voices, so additional code is required to compensate for this. Description of user facing changes SAPI5 and MSSP voices will be changed to use the WavePlayer, which may make them more responsive (have less delay). According to my test result, this can reduce the delay by at least 50ms. This haven't trimmed the leading silence yet. If we do that also, we can expect the delay to be even less. Description of development approach Instead of setting self.tts.audioOutput to a real output device, do the following: create an implementation class SynthDriverAudioStream to implement COM interface IStream, which can be used to stream in audio data from the voices. Use an SpCustomStream object to wrap SynthDriverAudioStream and provide the wave format. Assign the SpCustomStream object to self.tts.AudioOutputStream, so SAPI will output audio to this stream instead. Each time an audio chunk needs to be streamed in, ISequentialStream_RemoteWrite will be called, and we just feed the audio to the player. IStream_RemoteSeek can also be called when SAPI wants to know the current byte position of the stream (dlibMove should be zero and dwOrigin should be STREAM_SEEK_CUR in this case), but it is not used to actually "seek" to a new position. IStream_Commit can be called by MSSP voices to "flush" the audio data, where we do nothing. Other methods are left unimplemented, as they are not used when acting as an audio output stream. Previously, comtypes.client.GetEvents was used to get the event notifications. But those notifications will be routed to the main thread via the main message loop. According to the documentation of ISpNotifySource: Note that both variations of callbacks as well as the window message notification require a window message pump to run on the thread that initialized the notification source. Callback will only be called as the result of window message processing, and will always be called on the same thread that initialized the notify source. However, using Win32 events for SAPI event notification does not require a window message pump. Because the audio data is generated and sent via IStream on a dedicated thread, receiving events on the main thread can make synchronizing events and audio difficult. So here SapiSink is changed to become an implementation of ISpNotifySink. Notifications received via ISpNotifySink are "free-threaded", sent on the original thread instead of being routed to the main thread. To connect the sink, use ISpNotifySource::SetNotifySink. To get the actual event that triggers the notification, use ISpEventSource::GetEvents. Events can contain pointers to objects or memory, so they need to be freed manually. Finally, all audio ducking related code are removed. Now WavePlayer should be able to handle audio ducking when using SAPI5 and MSSP voices.
Is your feature request related to a problem? Please describe.
I'm always frostrated when sapi voices and onecore voices are slow and not responsive
Describe the solution you'd like
The voices should be responsive, so they could be mixed with other languages without an undesirable lag: I.e, using some hacks to unify onecore in sapi. Then they could be mixed, like between a latin voice and a non latin voice for optimal reading of both languages. Currently it's unnecessarily slow and unresponsive, which I kindly suggest that you fix
Describe alternatives you've considered
"""Based on advice from a developer who has some experienced with dsp""": Intercept the buffer from memory which has the audio, trim the silence at the beginning with a script which analises the amount of silence and trim it accordingly, then fead it back to the audio device
Additional context
nothing specific. Contact me if I can clarify some more. Please bare in mind that I'm not a programmer, I'm just a simple citizen. Thanks for your great help nv access! I'm sorry to say that I'm unable to monetarely support you. I wish that this project keeps helping blind people around the world like it always did.
The text was updated successfully, but these errors were encountered: