Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve the responsiveness of onecore voices and sapi voices #13284

Closed
king-dahmanus opened this issue Jan 27, 2022 · 38 comments · Fixed by #17592
Closed

improve the responsiveness of onecore voices and sapi voices #13284

king-dahmanus opened this issue Jan 27, 2022 · 38 comments · Fixed by #17592
Labels
blocked/needs-info The issue can not be progressed until more information is provided. component/speech component/speech-synth-drivers enhancement p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority performance triaged Has been triaged, issue is waiting for implementation.
Milestone

Comments

@king-dahmanus
Copy link

Is your feature request related to a problem? Please describe.

I'm always frostrated when sapi voices and onecore voices are slow and not responsive

Describe the solution you'd like

The voices should be responsive, so they could be mixed with other languages without an undesirable lag: I.e, using some hacks to unify onecore in sapi. Then they could be mixed, like between a latin voice and a non latin voice for optimal reading of both languages. Currently it's unnecessarily slow and unresponsive, which I kindly suggest that you fix

Describe alternatives you've considered

"""Based on advice from a developer who has some experienced with dsp""": Intercept the buffer from memory which has the audio, trim the silence at the beginning with a script which analises the amount of silence and trim it accordingly, then fead it back to the audio device

Additional context

nothing specific. Contact me if I can clarify some more. Please bare in mind that I'm not a programmer, I'm just a simple citizen. Thanks for your great help nv access! I'm sorry to say that I'm unable to monetarely support you. I wish that this project keeps helping blind people around the world like it always did.

@cary-rowen
Copy link
Contributor

Yes, windows Sapi5 is noticeably more responsive on some screen readers, e.g. ZDSR

@king-dahmanus
Copy link
Author

king-dahmanus commented Jan 28, 2022 via email

@mzanm
Copy link
Contributor

mzanm commented Jan 28, 2022

I agree, SAPI 5 and one core is somehow crazy fast on ZDSR.

@king-dahmanus
Copy link
Author

king-dahmanus commented Jan 28, 2022 via email

@LeonarddeR
Copy link
Collaborator

While I'm an ESpeak user and not using Onecore very frequently, I find OneCore pretty responsive with NVDA. It would be helpful if findings about slow responsiveness are supported by measurable evidence.

@king-dahmanus
Copy link
Author

king-dahmanus commented Jan 29, 2022 via email

@dpy013
Copy link
Contributor

dpy013 commented Jan 31, 2022

This is an audio from anyaubio, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer.

@king-dahmanus
Copy link
Author

king-dahmanus commented Jan 31, 2022 via email

@dpy013
Copy link
Contributor

dpy013 commented Jan 31, 2022

the link is broken

On Mon, 31 Jan 2022 at 08:56, DPY @.> wrote: This is an audio from anyaubio http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer. — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A . You are receiving this because you authored the thread.Message ID: @.>
http://anyaudio.net/listen?audio=TWu4HZNSSH0NTk

Thanks for reminding the above link has been re-edited

@king-dahmanus
Copy link
Author

king-dahmanus commented Jan 31, 2022 via email

@dpy013
Copy link
Contributor

dpy013 commented Jan 31, 2022

yeah it doesn't take me there for some reason. No matter, lets concentrate on nvda, cause this is what we're working with right?

On Mon, 31 Jan 2022 at 14:38, DPY @.> wrote: the link is broken … <#m_-7055658538261792368_> On Mon, 31 Jan 2022 at 08:56, DPY @.> wrote: This is an audio from anyaubio http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer. — Reply to this email directly, view it on GitHub <#13284 (comment) <#13284 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A . You are receiving this because you authored the thread.Message ID: @.> Thanks for reminding the above link has been re-edited — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJQZM5XU2TGEONDY6EDUY2GFBANCNFSM5M7BHI6A . You are receiving this because you authored the thread.Message ID: @.**>

yes

@king-dahmanus

This comment was marked as resolved.

@seanbudd seanbudd added enhancement p5 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority triaged Has been triaged, issue is waiting for implementation. p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority and removed p5 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority labels Jun 9, 2022
@Adriani90
Copy link
Collaborator

I did some tests and found following after some minutes of using:

  • eSpeak: <1 ms to 5 ms time between key press and reporting, very rarely 8 ms or more
  • Onecore: 5 to 10 MS time between key press and reporting, rather 8 MS or more, rarely below 8 MS
  • Sapi5: at least 8 MS time between key press and reporting, usually 10 to 15 MS, rarely less than 10 MS.

taking eSpeak as reference, the expected behavior is to have all synths at the same performance level.

I tested with NVDA alpha-28179,345154a6 (2023.2.0.28179), WASAPI enabled, by using arrow keys in browse mode in Google Chrome 112, which is very responsive.
My 64bit Asus ROG strix machine has following configuration:
Procesor 12th Gen Intel(R) Core(TM) i9-12900H, 2500 MHz, 14 core(s), 20 logical treats
installed physical RAM 32,0 GB
Intel(R) IRIS grafic card total capacity 16 gb, VRAM = 128 MB
NVIDIA GeForce RTX 3070 Ti Grafic card, total capacity 24 GB, VRAM = 8 GB

As you can see, even on this machine there is a noticeable performance difference, so speaking about low end machines, the performance degradation between synths might be much more obvious.

cc: @jcsteh, @michaelDCurran

@jcsteh
Copy link
Contributor

jcsteh commented May 4, 2023

While it's possible there is some silence at the start of the audio buffer returned by these voices, it's also possible (I'd guess more likely) that these voices just take longer to synthesise speech. In that case, there's really nothing that can be done; the performance optimisation would need to happen in the voice itself.

For OneCore at least, if you already have a way to measure the time between key press and actual audio output, I'd suggest comparing with Narrator. That will give you an indication of whether this is something specific to NVDA or whether the voice itself is slow to respond.

@cary-rowen
Copy link
Contributor

Narrator performance is worse than NVDA,
I recommend using ZDSR to compare with NVDA, the response speed of zdsr is significantly better than NVDA.
Even if both use SAPI5.

@jcsteh
Copy link
Contributor

jcsteh commented May 4, 2023

Is that true for OneCore with ZDSR even with the latest responsiveness and WASAPI changes in alpha?

SAPI5 is a different case, as NVDA uses SAPI5's own audio output rather than NVDA's audio output. It's possible that switching to nvwave + WASAPI for SAPI5 might improve responsiveness, but I'm not sure.

@seanbudd
Copy link
Member

Are there any responsiveness issues remaining now that NVDA uses WASAPI?

@jcsteh
Copy link
Contributor

jcsteh commented Nov 22, 2023

Note that NVDA still doesn't use nvwave for SAPI5, so there won't be a change for SAPI5 now in terms of audio. However, the other responsiveness changes in the last few months might have some impact.

@cary-rowen
Copy link
Contributor

Frankly, there are no noticeable changes.
I do think there's a lot of room for improvement in NVDA's responsiveness.

@jcsteh
Copy link
Contributor

jcsteh commented Nov 22, 2023

Given that there has been at least a measurable 10 to 30 ms improvement in responsiveness in NVDA in the last few months, not accounting for WASAPI, the fact that you're seeing "no noticeable changes" would suggest you're seeing a delay which is significantly larger than 30 ms with OneCore. That certainly doesn't match my experience, nor does it match #13284 (comment). That further suggests that there is a significant difference on your system as compared to mine and others.

As it stands, this issue isn't actionable. To get any further here, we're going to need precise information about which OneCore voice you're using, the rate it's configured at, probably audio recordings demonstrating the performance issue you're seeing, etc.

@seanbudd seanbudd added the blocked/needs-info The issue can not be progressed until more information is provided. label Nov 22, 2023
@beqabeqa473
Copy link
Contributor

Hello. I can confirm, that sapi5 in nvda is not as performant as in other places, and yes, this is because of sapi5 outputting sound itself. I am sure this will be improved, if sapi5 will go through nvda itself.

@shenguangrong
Copy link

Regarding the performance improvements for the SAPI5 speech synthesizer, I've attempted a solution to directly obtain audio data:

  1. Create necessary SAPI objects via COM interface:
    • Create SpVoice object for speech synthesis
    • Create SpMemoryStream object to capture audio stream
    • Create SpAudioFormat object to control audio format
  2. The core approach is to redirect TTS output to memory:
    • Configure SpAudioFormat audio parameters
    • Set SpMemoryStream as SpVoice's output destination
    • Obtain raw audio data directly from memory stream
    It's important to note that this method retrieves the entire audio data at once, rather than streaming output. This presents several challenges:
    • Need to consider appropriate text segmentation
    • May need to implement strategies for segmented synthesis and playback
    • Further research is required for optimization
    This is just an initial implementation approach, and more in-depth research and improvements will be needed.

@cary-rowen
Copy link
Contributor

Hi @jcsteh
Regarding the performance indicators of different speech synthesizers cc @gexgd0419 has conducted detailed tests in the following comment, which unfortunately is written in Chinese. @gexgd0419 might be able to expose its test methods or code if needed.
gexgd0419/NaturalVoiceSAPIAdapter#1 (comment)

@gexgd0419
Copy link
Contributor

This project might help you measure the latency during each step when using an SAPI5 voice.

The included TestTTSEngine can create voices that forward data to your installed SAPI5 voices and trim the leading silence part before outputting the audio. You can check how much this can improve the responsiveness.

If you use the TestTTSClient.exe, then you can see the log generated during speaking, and check the latency of each step.

The code I used to test the delay between keypress and audio output is not included yet. But I plan to include it later.

@gexgd0419
Copy link
Contributor

gexgd0419 commented Dec 31, 2024

In the documentation of ISpAudio, Microsoft says:

In order to prevent multiple TTS voices or engines from speaking simultaneously, SAPI serializes output to objects which implement the ISpAudio interface. To disable serialization of outputs to an ISpAudio object, place an attribute called "NoSerializeAccess" in the Attributes folder of its object token.

You can notice the "serialization" performed by SAPI if you open two TTS clients and make them speak at the same time: only one of them can speak, and the other has to wait. Cross-process serialization might increase the delay.

Below are what I found using that test program on my system.

I found that if you let SAPI output audio to a memory stream, the "serialization" is bypassed, and the delay between the client calls Speak and receives the first chunk of audio data is usually less than 10 ms. But if you let SAPI output audio to the default device, the delay can increase to about 50~100 ms.

As for the leading silence duration, if you are using one of the built-in voices, at normal rate it's about 100 ms, and at the maximum rate it decreases to about 30~50 ms.

For example, this is a log I got when outputting to the default device.

Log (output to default device)
  Total/ms  Delta/ms Step
     31.27     31.27 Engine output format set, target: (null), final: PCM 16 kHz 16 bits Mono
     31.70      0.44 Client Speak start
     46.49     14.79 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
    857.38    810.89 Client StartStream event
    857.39      0.01 Engine Speak start
    860.38      2.99 Engine audio data written, 20.00 ms / 640 bytes silence skipped
    860.96      0.57 Engine Speak end
  1,273.37    412.41 Client EndStream event
  1,290.59     17.22 Client Speak end
  1,290.59      0.00 Client Speak start
  1,305.11     14.52 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
  1,416.71    111.60 Client StartStream event
  1,416.72      0.01 Engine Speak start
  1,419.98      3.26 Engine audio data written, 31.62 ms / 1012 bytes silence skipped
  1,420.41      0.44 Engine Speak end
  1,773.15    352.74 Client EndStream event
  1,790.41     17.26 Client Speak end
  1,790.41      0.00 Client Speak start
  1,804.96     14.55 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
  1,917.89    112.93 Client StartStream event
  1,917.90      0.01 Engine Speak start
  1,921.22      3.32 Engine audio data written, 31.69 ms / 1014 bytes silence skipped
  1,921.72      0.50 Engine Speak end
  2,325.35    403.63 Client EndStream event
  2,343.40     18.05 Client Speak end
  2,343.41      0.01 Client Speak start
  2,358.35     14.94 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
  2,473.57    115.22 Client StartStream event
  2,473.58      0.01 Engine Speak start
  2,476.68      3.10 Engine audio data written, 31.69 ms / 1014 bytes silence skipped
  2,477.20      0.52 Engine Speak end
  2,900.30    423.10 Client EndStream event
  2,917.03     16.73 Client Speak end
  2,917.04      0.00 Client Speak start
  2,932.07     15.03 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
  3,045.02    112.95 Client StartStream event
  3,045.02      0.01 Engine Speak start
  3,048.18      3.16 Engine audio data written, 31.44 ms / 1006 bytes silence skipped
  3,048.72      0.53 Engine Speak end
  3,482.50    433.78 Client EndStream event
  3,500.74     18.25 Client Speak end

You can see that there's more than 100 ms delay before each StartStream event, when the audio output hasn't even begun. The TTS engine starts synthesizing the voice after Engine Speak start happens.

If you output to a memory stream, the extra delay will be gone.

Log (output to memory stream)
  Total/ms  Delta/ms Step
      0.50      0.50 Engine output format set, target: (null), final: PCM 16 kHz 16 bits Mono
      1.11      0.61 Client Speak start
      1.13      0.02 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
      1.14      0.01 Client StartStream event
      1.15      0.01 Engine Speak start
      4.08      2.93 Client audio data received
      4.09      0.00 Engine audio data written, 20.00 ms / 640 bytes silence skipped
      5.46      1.37 Engine Speak end
      5.46      0.00 Client EndStream event
      5.47      0.01 Client Speak end
      5.49      0.03 Client Speak start
      5.51      0.01 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
      5.51      0.00 Client StartStream event
      5.52      0.01 Engine Speak start
      8.34      2.82 Client audio data received
      8.35      0.00 Engine audio data written, 31.62 ms / 1012 bytes silence skipped
      9.44      1.10 Engine Speak end
      9.45      0.00 Client EndStream event
      9.45      0.01 Client Speak end
      9.47      0.02 Client Speak start
      9.49      0.01 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
      9.49      0.00 Client StartStream event
      9.50      0.01 Engine Speak start
     12.67      3.17 Client audio data received
     12.68      0.01 Engine audio data written, 31.69 ms / 1014 bytes silence skipped
     13.88      1.20 Engine Speak end
     13.88      0.00 Client EndStream event
     13.89      0.01 Client Speak end
     13.91      0.03 Client Speak start
     13.93      0.02 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
     13.94      0.01 Client StartStream event
     13.95      0.01 Engine Speak start
     17.03      3.09 Client audio data received
     17.04      0.01 Engine audio data written, 31.69 ms / 1014 bytes silence skipped
     18.32      1.28 Engine Speak end
     18.32      0.00 Client EndStream event
     18.33      0.01 Client Speak end
     18.35      0.03 Client Speak start
     18.37      0.02 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
     18.38      0.01 Client StartStream event
     18.38      0.01 Engine Speak start
     21.28      2.90 Client audio data received
     21.29      0.00 Engine audio data written, 31.44 ms / 1006 bytes silence skipped
     22.65      1.36 Engine Speak end
     22.65      0.00 Client EndStream event
     22.66      0.01 Client Speak end

And that was fast.

But yes, the audio has to be output to an audio device in order to be heard, and the output process introduces more delay, so the final delay won't be that good. We can only hope that WASAPI introduces less delay than WinMM which SAPI5 uses internally.

EDIT: Tried outputting to the default device again after my computer fan started spinning, and the delay became smaller! So this can be affected by many things, including the active power plan and the resource usage of other applications. But there was still about 80 ms delay.

@gexgd0419
Copy link
Contributor

To make SAPI 5 voices able to use NVDA's own wave player (which uses WASAPI), we can try the following steps.

First, write an implementation class of COM interface IStream, which will receive the audio data. Here the audio data can be processed. After that, feed the audio data to the player.

from comtypes import COMObject
from objidl import IStream

class AudioStream(COMObject):
    _com_interfaces_ = [IStream]

    def __init__(self, fmt):
        self._writtenBytes = 0
        wfx = fmt.GetWaveFormatEx()  # SpWaveFormatEx
        self._player = nvwave.WavePlayer(
            channels=wfx.Channels,
            samplesPerSec=wfx.SamplesPerSec,
            bitsPerSample=wfx.BitsPerSample,
            outputDevice=config.conf["speech"]["outputDevice"],
        )

    def ISequentialStream_RemoteWrite(self, this, pv, cb, pcbWritten):
        # audio processing...
        self._player.feed(pv, cb)
        self._writtenBytes += cb
        if pcbWritten:
            pcbWritten[0] = cb
        return 0

    def IStream_RemoteSeek(self, this, dlibMove, dwOrigin, plibNewPosition):
        if dwOrigin == 1 and dlibMove.QuadPart == 0:
            # SAPI is querying the current position.
            if plibNewPosition:
                plibNewPosition[0].QuadPart = self._writtenBytes
                return 0
        return 0x80004001  # E_NOTIMPL is returned in other cases

Other methods of IStream can be left unimplemented.

Then, when initializing the SpVoice object, create a SAPI.SpCustomStream object to wrap your IStream implementation and the wave format for the stream.

# ... After setting the voice:
self.tts.AudioOutput = self.tts.AudioOutput  # Reset the audio and its format parameters
fmt = self.tts.AudioOutputStream.Format
stream = comtypes.client.CreateObject("SAPI.SpCustomStream")  # might be different for MSSP voices
stream.BaseStream = AudioStream(fmt)  # set the IStream being wrapped
stream.Format = fmt
self.tts.AudioOutputStream = stream  # Set the stream (wrapper) as the output target

Now you will be able to hear the voices. Not everything is processed properly in the code above, but I hope that you can get the idea.

One of the problems is that continuous reading will be broken, because the Bookmark events become out of sync with the audio stream. We will need to synchronize them ourselves.

@gexgd0419
Copy link
Contributor

gexgd0419 commented Jan 1, 2025

Now this latency tester project supports measuring the delay between keypress and audio output, so I did some tests.

Used version:
NVDA: Run from source at current master branch
Narrator: on Win 11 23H2
ZDSR (ZhengDu Screen Reader): Public Welfare version (公益版)

Modifications:
Trimmed: Used the "forwarded" voice created by TestTTSEngine, so the leading silence is removed.
WASAPI: Used my modified version of NVDA, which sends the audio data via NVDA's WavePlayer.

Voice: Microsoft Huihui (Chinese, Simplified)

Results:

Client Voice Delay
NVDA eSpeak NG 73ms
NVDA Huihui OneCore 97ms
NVDA Huihui SAPI5 176ms
NVDA Huihui SAPI5 trimmed 139ms
NVDA Huihui SAPI5 WASAPI 114ms
NVDA Huihui SAPI5 WASAPI trimmed 77ms
Narrator Huihui OneCore 76ms
Narrator Huihui SAPI5 133ms
Narrator Huihui SAPI5 trimmed 118ms
ZDSR 1.5.8.2 Huihui SAPI5 131ms
ZDSR 1.5.8.2 Huihui SAPI5 trimmed 94ms
ZDSR 1.7.0.0 Huihui OneCore 55ms
ZDSR 1.7.0.0 Huihui SAPI5 70ms
ZDSR 1.7.0.0 Huihui SAPI5 trimmed 57ms

@cary-rowen
Copy link
Contributor

Cool, it looks like @gexgd0419 has made some real progress on this and has given a test result.

So far I'd be interested to hear what NV Access has to say about this or any pointers on the way forward.

cc @gerald-hartig @seanbudd

Also @jcsteh’s comments are valuable, can you talk about them?

I'm excited about the improved responsiveness

@gexgd0419
Copy link
Contributor

For example, here's the NVDA log when I pressed the S key, and the detected audio latency is 118.11ms, with the original SAPI5 implementation, but with leading silence trimmed.

IO - inputCore.InputManager.executeGesture (09:27:28.958) - winInputHook (30300):
Input: kb(desktop):s
IO - speech.speech.speak (09:27:28.998) - MainThread (30216):
Speaking [CharacterModeCommand(True), LangChangeCommand ('zh_CN'), 's', EndUtteranceCommand()]
DEBUG - synthDrivers.sapi5.SapiSink.EndStream (09:27:29.509) - MainThread (30216):
TestTTSEngine logged items:
0.00	NVDA Speak preparing
0.04	NVDA Speak start
14.17	Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
41.59	Engine Speak start
45.99	Engine audio data written, 39.38 ms / 1260 bytes silence skipped
46.72	Engine Speak end
49.67	NVDA StartStream
506.67	NVDA EndStream

From the timestamps in the log, we can get the following timeline:

Total Delta Step
0ms - Received keyboard input
40ms 40ms Issued Speak command
44ms 4ms synth.speak() called
86ms 42ms Engine Speak start
90ms 4ms Engine audio data written
94ms 4ms NVDA StartStream
118ms 24ms Audio detected
551ms 433ms NVDA EndStream

There's 40ms delay between receiving the keyboard input and issuing the Speak command, and there's 20~30ms delay between writing audio data and outputing audio. So the minimum possible delay of NVDA on my system would be about 70ms, which could be achieved using eSpeak NG voice, or Huihui SAPI5 voice via WASAPI with leading silence trimmed.

@jcsteh
Copy link
Contributor

jcsteh commented Jan 3, 2025

There's 40ms delay between receiving the keyboard input and issuing the Speak command

This is a little tangential, but 40 ms is unexpectedly high there. I would expect something more like 20 ms or less, though it may be worse if you're running on battery.

Also, this raises another problem: handling typed characters seems to be pretty slow. If I do this in input help, I see 2 ms or less there. If I do this using the left or right arrow keys in the Run dialog edit field, I get 10 ms or less. This is a result of the optimisation work I did in #14928 and #14708. However, speaking of typed characters doesn't appear to benefit from this. This might be improved if we tweak eventHandler so that it always uses an immediate pump for typedCharacter events, just like we do for gainFocus events.

@gexgd0419
Copy link
Contributor

I tried to implement WASAPI on SAPI5 (and maybe SAPI4) further, but I think I need some help.

The problem is how I can synchronize the bookmark events with the audio stream.

IStream receives "streamed" audio data in chunks, rather than waiting for synthesis to complete and receiving all audio data. This approach might reduce startup delay, but here's a problem. Let's say NVDA needs to speak two sentences, A and B, with a bookmark in between. SAPI framework will stream in the audio data for A first, then the bookmark event, then audio data for B. This seems normal, until I found that WavePlayer.feed calls the callback after this audio chunk is played. But the program cannot know that the bookmark exists before it receives the audio for A! So is there a way to tell WavePlayer that I want to insert a callback to be called after the last fed chunk, but without feeding actual audio data?

Even worse, there's no guarantee that the bookmark event will happen right between audio for A and B, because audio and events are sent in different threads. But maybe this can be fixed by using ISpEventSource directly to get the events rather than using the automation compatible event interface.

Or maybe there's another way.

As the current implementation of OneCore voices is already using WavePlayer (and WASAPI), I checked the code and it seemed that all wave data are retrieved at once, instead of being "streamed" in chunks.

Related OneCore speech C++ code
winrt::fire_and_forget
speak(
    void* originToken,
    winrt::hstring text,
    std::shared_ptr<winrtSynth> synth,
    std::function<ocSpeech_CallbackT> cb
) {
    try {
        co_await winrt::resume_background();

        SpeechSynthesisStream speechStream{ nullptr };
        try {
            // Wait for the stream to complete
            speechStream = co_await synth->SynthesizeSsmlToStreamAsync(text);
        }
        catch (winrt::hresult_error const& e) {
            LOG_ERROR(L"Error " << e.code() << L": " << e.message().c_str());
            protectedCallback_(originToken, std::optional<SpeakResult>(), cb);
            co_return;
        }
        const std::uint32_t size = static_cast<std::uint32_t>(speechStream.Size());
        std::optional<SpeakResult> result(SpeakResult{
            Buffer(size),
            createMarkersString_(speechStream.Markers())  // send all markers (bookmarks) in a string
            }
        );
        try {
            // Read all data and send it to callback function in one go
            co_await speechStream.ReadAsync(result->buffer, size, InputStreamOptions::None);
            protectedCallback_(originToken, result, cb);
            co_return;
        }
        catch (winrt::hresult_error const& e) {
            LOG_ERROR(L"Error " << e.code() << L": " << e.message().c_str());
            protectedCallback_(originToken, std::optional<SpeakResult>(), cb);
            co_return;
        }
    }
    // ... catch blocks ...
}

Although asynchronous, the audio and all the markers for this entire utterance will be ready when the callback function is called.

This is more like @shenguangrong 's approach above, which uses SpMemoryStream to store all the audio data first. This might add some delay, but if utterances are short, the extra delay can be very short, only a few milliseconds. You can check the log in this comment when outputting to a memory stream. Engine Speak end (or all audio written) happened only 1~2 ms after the first audio chunk was written.

If the delay of OneCore voices is acceptable, then this approach is also feasible.

@jcsteh
Copy link
Contributor

jcsteh commented Jan 4, 2025

Let's say NVDA needs to speak two sentences, A and B, with a bookmark in between. SAPI framework will stream in the audio data for A first, then the bookmark event, then audio data for B. This seems normal, until I found that WavePlayer.feed calls the callback after this audio chunk is played. But the program cannot know that the bookmark exists before it receives the audio for A! So is there a way to tell WavePlayer that I want to insert a callback to be called after the last fed chunk, but without feeding actual audio data?

Not currently, though it might be possible to add it. However, you should be able to manufacture this already. One way would be to have a dict which maps from chunk id to bookmark id. Chunk id could be a simple counter which you increment for every chunk you feed or it could be something you easily get from SAPI; e.g. a stream position. After you call feed, keep track of the last chunk id in an instance variable. When you get the bookmark event, set map[lastChunkId] = bookmarkId.

Even worse, there's no guarantee that the bookmark event will happen right between audio for A and B, because audio and events are sent in different threads.

Yeah, this does seem like a source of intermittent timing problems.

As the current implementation of OneCore voices is already using WavePlayer (and WASAPI), I checked the code and it seemed that all wave data are retrieved at once, instead of being "streamed" in chunks.

That's correct. OneCore doesn't provide a streaming interface, unfortunately.

If the delay of OneCore voices is acceptable, then this approach is also feasible.

I don't think it is. We just don't have another choice. It causes unnecessary latency. Segmenting the text could help that, but it's not a true fix, just a workaround. This should always be a last resort and would IMO be an unacceptable regression.

@jcsteh
Copy link
Contributor

jcsteh commented Jan 4, 2025

So is there a way to tell WavePlayer that I want to insert a callback to be called after the last fed chunk, but without feeding actual audio data?

Actually, you should be able to do this: player.feed(None, size=0, onDone=someCallback)

@gexgd0419
Copy link
Contributor

I opened a pull request #17592 as my attempt to fix this.

Here's the build artifact files, which includes an installer exe file to install this alpha version. Can this improve the responsiveness of SAPI5 voices? Or does this introduce new bugs?

Also I need a way to test the audio ducking feature. But audio ducking requires UIAccess privilege, which requires the program be installed and signed. How can I test audio ducking using an alpha version which is not signed?

@cary-rowen
Copy link
Contributor

Hi @gexgd0419
Great work!

Glad to see this PR I will test it later.
You can see the doc for creating a self-signed build here.

@github-actions github-actions bot added this to the 2025.1 milestone Jan 10, 2025
@jcsteh
Copy link
Contributor

jcsteh commented Jan 10, 2025

Note that this issue is described as covering both SAPI5 and OneCore, but I don't think #17592 does anything regarding OneCore.

@gexgd0419
Copy link
Contributor

The author said:

I'm focusing here on sapi5. I mentioned one core because I used a program called sapi unifier to port the one core voices into sapi5

OneCore voices are already using WASAPI, so their responsiveness cannot be improved using the same method.

cary-rowen pushed a commit to cary-rowen/nvda that referenced this issue Jan 11, 2025
Closes nvaccess#13284

Summary of the issue:
Currently, SAPI5 and MSSP voices use their own audio output mechanisms, instead of using the WavePlayer (WASAPI) inside NVDA.

This may make them less responsive compared to eSpeak and OneCore voices, which are using the WavePlayer, or compared to other screen readers using SAPI5 voices, according to my test result.

This also gives NVDA less control of audio output. For example, audio ducking logic inside WavePlayer cannot be applied to SAPI5 voices, so additional code is required to compensate for this.

Description of user facing changes
SAPI5 and MSSP voices will be changed to use the WavePlayer, which may make them more responsive (have less delay).

According to my test result, this can reduce the delay by at least 50ms.

This haven't trimmed the leading silence yet. If we do that also, we can expect the delay to be even less.

Description of development approach
Instead of setting self.tts.audioOutput to a real output device, do the following:

create an implementation class SynthDriverAudioStream to implement COM interface IStream, which can be used to stream in audio data from the voices.
Use an SpCustomStream object to wrap SynthDriverAudioStream and provide the wave format.
Assign the SpCustomStream object to self.tts.AudioOutputStream, so SAPI will output audio to this stream instead.
Each time an audio chunk needs to be streamed in, ISequentialStream_RemoteWrite will be called, and we just feed the audio to the player. IStream_RemoteSeek can also be called when SAPI wants to know the current byte position of the stream (dlibMove should be zero and dwOrigin should be STREAM_SEEK_CUR in this case), but it is not used to actually "seek" to a new position. IStream_Commit can be called by MSSP voices to "flush" the audio data, where we do nothing. Other methods are left unimplemented, as they are not used when acting as an audio output stream.

Previously, comtypes.client.GetEvents was used to get the event notifications. But those notifications will be routed to the main thread via the main message loop. According to the documentation of ISpNotifySource:

Note that both variations of callbacks as well as the window message notification require a window message pump to run on the thread that initialized the notification source. Callback will only be called as the result of window message processing, and will always be called on the same thread that initialized the notify source. However, using Win32 events for SAPI event notification does not require a window message pump.

Because the audio data is generated and sent via IStream on a dedicated thread, receiving events on the main thread can make synchronizing events and audio difficult.

So here SapiSink is changed to become an implementation of ISpNotifySink. Notifications received via ISpNotifySink are "free-threaded", sent on the original thread instead of being routed to the main thread.

To connect the sink, use ISpNotifySource::SetNotifySink.
To get the actual event that triggers the notification, use ISpEventSource::GetEvents. Events can contain pointers to objects or memory, so they need to be freed manually.
Finally, all audio ducking related code are removed. Now WavePlayer should be able to handle audio ducking when using SAPI5 and MSSP voices.
@gexgd0419
Copy link
Contributor

Note that this issue is described as covering both SAPI5 and OneCore, but I don't think #17592 does anything regarding OneCore.

@jcsteh Now this PR #17648 aims to improve the responsiveness of all voices that use the WavePlayer, including OneCore and SAPI5 voices, by trimming the leading silence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked/needs-info The issue can not be progressed until more information is provided. component/speech component/speech-synth-drivers enhancement p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority performance triaged Has been triaged, issue is waiting for implementation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.