Performance improvements to sapi5 speech synthesizer #17524

cary-rowen · 2024-12-14T11:21:43Z

Is your feature request related to a problem? Please describe.

The SAPI5 synthesizer in NVDA has noticeable latency between keypress and speech feedback, primarily due to unnecessary silence at the beginning and end of speech segments. This significantly impacts user experience, especially during typing and rapid navigation.
cc @gexgd0419 has previously measured the SAPI5 synthesizer latency, which could provide valuable baseline metrics for this optimization effort. It would be helpful to include their measurement data/methodology to quantify the improvements.

Describe the solution you'd like

Optimize the existing SAPI5 synthesizer by implementing audio stream preprocessing within the current driver. The solution will:

Modify speech output process to:
- Capture synthesized audio in memory stream before playback
- Process audio data to detect and remove silence
- Output the optimized audio stream
Add silence detection algorithm that:
- Analyzes audio frames for silence at start/end
- Uses configurable thresholds for detection
- Preserves the actual speech content
Integrate with existing SAPI5 driver:
- Keep current COM interface interaction
- Add preprocessing as internal step
- Maintain existing configuration options

Describe alternatives you've considered

Existing implementation
- Pro: Simpler implementation
- Con: Less control over audio processing, Limited optimization potential
- Con: Obvious latency and poor user experience
Optimization implementation
- Pro: No additional latency
- Con: More complex implementation

Additional context

Implementation requires careful COM interface handling
Need to test with various SAPI5 voices and settings
Performance impact should be measured
Must maintain compatibility with existing NVDA features

cary-rowen · 2024-12-14T13:21:22Z

My friend cc @shenguangrong is interested in contributing this. Would love to hear community feedback

gexgd0419 · 2024-12-14T13:21:58Z

I noticed that System.Speech.Synthesis.SpeechSynthesizer in C#, which looks like a simple wrapper for SAPI5 TTS APIs, is in fact using its own implementation.

Usually clients should use ISpVoice or the automation version SAPI.SpVoice. The TTS engines that provides SAPI voices, on the other hand, has a COM class that implements ISpTTSEngine. Clients and TTS engines do not interact directly, although TTS engines are loaded in the same process; instead, the SAPI framework handles the requests from clients, manages instances of TTS engines, and passes data and events back and forth.

If you launch multiple SAPI TTS client apps at the same time, you will notice that they cannot speak simultaneously. They are different processes, but when one of them is speaking, others must wait. So SAPI must have implemented some kind of cross-process synchronization, which might increase the delay.

System.Speech.Synthesis.SpeechSynthesizer in C# chooses a different way. Instead of using ISpVoice or SAPI.SpVoice, it interacts with the TTS engine directly through the ISpTTSEngine interface, bypassing the SAPI framework. You can notice the difference if you launch a C# app using SpeechSynthesizer together with some other apps using the standard SAPI interface. The C# app can speak independently of the other apps.

So I think that this method of using the SAPI voices can be tried if you want to decrease latency.

Pros:

More direct access to the TTS engines, which might decrease latency
Ability to speak independently of other SAPI clients

Cons:

More things have to be implemented, such as:
- Creation and initialization of TTS engine objects (for example, ISpObjectWithToken::SetObjectToken should be called on created objects)
- The ISpTTSEngineSite object that should be provided to the engine to exchange data
- Conversion from SSML/XML text to a list of SPVTEXTFRAG
Possible incompatibility with some TTS engines

cary-rowen · 2024-12-14T21:57:20Z

@gexgd0419 Thank you for your valuable comment:

• Possible incompatibility with some TTS engines

I have some concerns about this, it may break compatibility with some TTS engines.

cc @LeonarddeR
Your valuable suggestions may be helpful in this.

shenguangrong · 2024-12-16T04:32:05Z

Regarding the performance improvements for the SAPI5 speech synthesizer, I've attempted a solution to directly obtain audio data:

Create necessary SAPI objects via COM interface:
• Create SpVoice object for speech synthesis
• Create SpMemoryStream object to capture audio stream
• Create SpAudioFormat object to control audio format
The core approach is to redirect TTS output to memory:
• Configure SpAudioFormat audio parameters
• Set SpMemoryStream as SpVoice's output destination
• Obtain raw audio data directly from memory stream
It's important to note that this method retrieves the entire audio data at once, rather than streaming output. This presents several challenges:
• Need to consider appropriate text segmentation
• May need to implement strategies for segmented synthesis and playback
• Further research is required for optimization
This is just an initial implementation approach, and more in-depth research and improvements will be needed.

gexgd0419 · 2024-12-16T05:13:34Z

There's SpCustomStream that accepts a custom IStream implementation, so if we can write our own implementation of IStream COM interface in Python, streaming data may still be possible.

I also found that writing the voice to a wave file through SpFileStream does not need to wait for other SAPI clients to complete speaking, so maybe synchronization doesn't happen when outputting to a file/memory stream.

SaschaCowley · 2024-12-16T23:26:22Z

Duplicate of #13284

SaschaCowley marked this as a duplicate of #13284 Dec 16, 2024

SaschaCowley closed this as completed Dec 16, 2024

github-actions bot added this to the 2025.1 milestone Dec 16, 2024

gerald-hartig removed this from the 2025.1 milestone Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements to sapi5 speech synthesizer #17524

Performance improvements to sapi5 speech synthesizer #17524

cary-rowen commented Dec 14, 2024

cary-rowen commented Dec 14, 2024

gexgd0419 commented Dec 14, 2024

cary-rowen commented Dec 14, 2024

shenguangrong commented Dec 16, 2024

gexgd0419 commented Dec 16, 2024

SaschaCowley commented Dec 16, 2024

Performance improvements to sapi5 speech synthesizer #17524

Performance improvements to sapi5 speech synthesizer #17524

Comments

cary-rowen commented Dec 14, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

cary-rowen commented Dec 14, 2024

gexgd0419 commented Dec 14, 2024

cary-rowen commented Dec 14, 2024

shenguangrong commented Dec 16, 2024

gexgd0419 commented Dec 16, 2024

SaschaCowley commented Dec 16, 2024