Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements to sapi5 speech synthesizer #17524

Closed
cary-rowen opened this issue Dec 14, 2024 · 6 comments
Closed

Performance improvements to sapi5 speech synthesizer #17524

cary-rowen opened this issue Dec 14, 2024 · 6 comments

Comments

@cary-rowen
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The SAPI5 synthesizer in NVDA has noticeable latency between keypress and speech feedback, primarily due to unnecessary silence at the beginning and end of speech segments. This significantly impacts user experience, especially during typing and rapid navigation.
cc @gexgd0419 has previously measured the SAPI5 synthesizer latency, which could provide valuable baseline metrics for this optimization effort. It would be helpful to include their measurement data/methodology to quantify the improvements.

Describe the solution you'd like

Optimize the existing SAPI5 synthesizer by implementing audio stream preprocessing within the current driver. The solution will:

  1. Modify speech output process to:

    • Capture synthesized audio in memory stream before playback
    • Process audio data to detect and remove silence
    • Output the optimized audio stream
  2. Add silence detection algorithm that:

    • Analyzes audio frames for silence at start/end
    • Uses configurable thresholds for detection
    • Preserves the actual speech content
  3. Integrate with existing SAPI5 driver:

    • Keep current COM interface interaction
    • Add preprocessing as internal step
    • Maintain existing configuration options

Describe alternatives you've considered

  1. Existing implementation

    • Pro: Simpler implementation
    • Con: Less control over audio processing, Limited optimization potential
    • Con: Obvious latency and poor user experience
  2. Optimization implementation

    • Pro: No additional latency
    • Con: More complex implementation

Additional context

  • Implementation requires careful COM interface handling
  • Need to test with various SAPI5 voices and settings
  • Performance impact should be measured
  • Must maintain compatibility with existing NVDA features
@cary-rowen
Copy link
Contributor Author

My friend cc @shenguangrong is interested in contributing this. Would love to hear community feedback

@gexgd0419
Copy link
Contributor

I noticed that System.Speech.Synthesis.SpeechSynthesizer in C#, which looks like a simple wrapper for SAPI5 TTS APIs, is in fact using its own implementation.

Usually clients should use ISpVoice or the automation version SAPI.SpVoice. The TTS engines that provides SAPI voices, on the other hand, has a COM class that implements ISpTTSEngine. Clients and TTS engines do not interact directly, although TTS engines are loaded in the same process; instead, the SAPI framework handles the requests from clients, manages instances of TTS engines, and passes data and events back and forth.

If you launch multiple SAPI TTS client apps at the same time, you will notice that they cannot speak simultaneously. They are different processes, but when one of them is speaking, others must wait. So SAPI must have implemented some kind of cross-process synchronization, which might increase the delay.

System.Speech.Synthesis.SpeechSynthesizer in C# chooses a different way. Instead of using ISpVoice or SAPI.SpVoice, it interacts with the TTS engine directly through the ISpTTSEngine interface, bypassing the SAPI framework. You can notice the difference if you launch a C# app using SpeechSynthesizer together with some other apps using the standard SAPI interface. The C# app can speak independently of the other apps.

So I think that this method of using the SAPI voices can be tried if you want to decrease latency.

Pros:

  • More direct access to the TTS engines, which might decrease latency
  • Ability to speak independently of other SAPI clients

Cons:

  • More things have to be implemented, such as:
  • Possible incompatibility with some TTS engines

@cary-rowen
Copy link
Contributor Author

@gexgd0419 Thank you for your valuable comment:

• Possible incompatibility with some TTS engines

I have some concerns about this, it may break compatibility with some TTS engines.

cc @LeonarddeR
Your valuable suggestions may be helpful in this.

@shenguangrong
Copy link

Regarding the performance improvements for the SAPI5 speech synthesizer, I've attempted a solution to directly obtain audio data:

  1. Create necessary SAPI objects via COM interface:
    • Create SpVoice object for speech synthesis
    • Create SpMemoryStream object to capture audio stream
    • Create SpAudioFormat object to control audio format
  2. The core approach is to redirect TTS output to memory:
    • Configure SpAudioFormat audio parameters
    • Set SpMemoryStream as SpVoice's output destination
    • Obtain raw audio data directly from memory stream
    It's important to note that this method retrieves the entire audio data at once, rather than streaming output. This presents several challenges:
    • Need to consider appropriate text segmentation
    • May need to implement strategies for segmented synthesis and playback
    • Further research is required for optimization
    This is just an initial implementation approach, and more in-depth research and improvements will be needed.

@gexgd0419
Copy link
Contributor

There's SpCustomStream that accepts a custom IStream implementation, so if we can write our own implementation of IStream COM interface in Python, streaming data may still be possible.

I also found that writing the voice to a wave file through SpFileStream does not need to wait for other SAPI clients to complete speaking, so maybe synchronization doesn't happen when outputting to a file/memory stream.

@SaschaCowley
Copy link
Member

Duplicate of #13284

@SaschaCowley SaschaCowley marked this as a duplicate of #13284 Dec 16, 2024
@github-actions github-actions bot added this to the 2025.1 milestone Dec 16, 2024
@gerald-hartig gerald-hartig removed this from the 2025.1 milestone Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants