Add leading silence detection and removal logic #17648

gexgd0419 · 2025-01-24T07:47:07Z

Link to issue number:

Summary of the issue:

Some voices output a leading silence part before the actual speech voice. By removing the silence part, the delay between keypress and user hearing the audio will be shorter, therefore make the voices more responsive.

Description of user facing changes

Users may find the voices more responsive. All voices using NVDA's WavePlayer will be affected, including eSpeak-NG, OneCore, SAPI5, and some third-party voice add-ons.

This should only affect the leading silence parts. Silence between sentences or at punctuation marks are not changed, but this may depend on how the voice uses WavePlayer.

Description of development approach

I wrote a header-only library silenceDetect.h in nvdaHelper/local. It supports most wave formats (8/16/24/32-bit integer and 32/64-bit float wave), and uses a simple algorithm: check each sample to see if it's outside threshold range (currently hard-coded to +/- 1/2^10 or 0.0009765625). It uses template-related code and requires C++ 20 standard.

The WasapiPlayer in wasapi.cpp is updated to handle silence. A new member function, startTrimmingLeadingSilence, and the exported version wasPlay_startTrimmingLeadingSilence, is added, to set or clear the isTrimmingLeadingSilence flag. If isTrimmingLeadingSilence is true, the next chunk fed in will have its leading silence removed. When non-silence is detected, isTrimmingLeadingSilence will be reset to false. So every time a new utterance is about to be spoken, startTrimmingLeadingSilence should be called.

In nvwave.py, startTrimmingLeadingSilence() will be called when:

the player is initialized;
the player is stopped;
idle is called;
_idleCheck determines that the player is idle.

Usually voices will call idle when an utterance is completed, so that audio ducking can work correctly, so here idle is used to mark the starting point of the next utterance. If a voice doesn't use idle this way, then this logic might be messed up.

As long as the synthesizer uses idle as intended, the synthesizer's code doesn't need to be modified to benefit from this feature.

As leading silence can also be introduced by a BreakCommand at the beginning of the speech sequence, WavePlayer will check the speech sequence first; if there's a BreakCommand at the beginning, the leading silence will not be trimmed for the current utterance. To check the exact speech sequence that is about to be spoken, a new extension point, pre_synthSpeak, is added in synthDriverHandler, which will be invoked just before SpeechManager calls getSynth().speak(). The existing pre_speech is called before SpeechManager processes and queues the speech sequence, so pre_synthSpeak is needed to provide a more accurate sequence.

When the purpose of a WavePlayer is not SPEECH, it does not trim the leading silence by default, because of the way playWaveFile works (it calls idle after every chunk). Users of WavePlayer will still be able to enable/disable automatic trimming by calling enableTrimmingLeadingSilence, or to initiate trimming manually for the next audio section by calling startTrimmingLeadingSilence.

Other possible ways/things that may worth considering (but hasn't been implemented):

Put silence detection/removal logic in a separate module instead of in WavePlayer. The drawback is that every voice synthesizer module needs to be modified to utilize a separate module.
Use another audio library, such as PyDub, to detect/remove silence.
Add a setting item to turn this feature on or off.
Add a public API function to allow a synthesizer to opt out of this feature.

Testing strategy:

These are the delay results I got using 1X speed. "Improved delay" means the delay after applying this PR.

Voice	Original Delay	Improved Delay
eSpeak NG	89ms	69ms
Huihui OneCore	192ms	74ms
Huihui SAPI5	166ms	57ms

If the speech rate is higher than 1X, the original delay may be shorter, because the leading silence is also shortened during the speed-up process. When there's no leading silence, changing the rate does not affect the delay much.

Considering the margin of error, we can say that the delay of eSpeak NG, OneCore and SAPI5 voices are now at the same level.

Known issues with pull request:

None

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

@coderabbitai summary

zstanecic · 2025-01-24T14:09:44Z

@gexgd0419
I am slightly confused... is your change requiring modifications to synth drivers?

gexgd0419 · 2025-01-24T14:18:34Z

is your change requiring modifications to synth drivers?

In the current implementation, no. As long as the synth driver calls idle after each utterance, it will work, because the logic is integrated into WavePlayer.

But if separating the logic out of WavePlayer is preferred, then the synth driver will have to be aware of the separate module.

cary-rowen · 2025-01-24T14:23:54Z

I like this very much, I just ran a simple test without any metrics though.
To put it simply, using the Natural voice adapter and sapi5, in the past, NVDA would not have any voice feedback when I pressed keys quickly. Now this situation has been significantly improved.

zstanecic · 2025-01-24T15:07:01Z

@gexgd0419
@cary-rowen
I am not so patient... Yes, i wanted to test it from the source, and i did, so there are my findings:
Compared to the current alpha and stable version of NVDA, i see a drastic improvement when it comes to using various speech synths.
So... to conclude things, it is a very nice work!
It works with any tts, even with RHVoice in the NVDA addon which i am using now.
There was a time, when typing echo slowed me down a bot. Now, i can have turned on typing echo and listening to speech while typing.

rmcpantoja · 2025-01-24T16:10:01Z

Hi, I would like to see this from a critical point, having previous experience in TTS stuff:
Some minor delay in these voices don't come in the voices itself; it does after preprocessing steps. For example, text pre-processing, or a text to phoneme backend, language analyzers or other tools, that are applied before synthesize speech with a provided text. Fortunately, the performance of these resources is very few milliseconds, so that wouldn't be a problem.
And, most of the voices are built by already pre-processed audio, if we are talking of unit selection or Hidden Markov Model voices, in the free or comercial field. That means, technically, silences are trimmed before building these voices from their databases. However, a particular case where certain voices are with small silences are the traditional Microsoft Hidden Markov Model voices, which they are no longer supporting it. In this case, your changes are useful for these certain voices that can improve performance during navigation.
But, anyways, let me try it in these days, and retract my comment.

Cheers.

zstanecic · 2025-01-24T16:23:13Z

Hi @rmcpantoja
This statement in your comment is not so true. RHVoice is also a hidden markov model based.
We don't do anything with silences when building the voices. We need these to detect pauses in the utterances in the training process.
Also, we use the Google's vad library to detect silence sometimes.
Ivona tts is a concatenative tts, and Microsoft natural tts is a concatenative one.
When using such voices, i hear noticeable differences in terms of speed and responsiveness.
To conclude things, all voices get the benefits from this feature, and we don't have impact on resulting speech, that means tiere are no drawbacks.

mzanm · 2025-01-24T23:50:39Z

Hi. This is amazing work, thank you so much.
One issue is that some synthesizers implement the break command by adding silence I believe or similar, for example when enabling 'delayed descriptions of characters on cursor movement' on the speech panel and when using Espeak and possibly other synths. When moving between characters, the phonetic description will be spoken almost immediately, Which is unexpected as it's supposed to be 1000 milliseconds. It's usually way less than that anyway as it gets affected by fast speech rates in some synthesizers, but here it's near-instant no matter the speech rate on affected synths. Interestingly this issue does not occur on OneCore or SAPI5 with the default David voice, but does additionally occur with the Tiflotecnia voices synthesizer.

gexgd0419 · 2025-01-25T00:58:19Z

@mzanm Unfortunately, this is because I haven't found a way that can distinguish "leading silence" accurately.

Audio data are sent in chunks to the WavePlayer. My algorithm only checks the beginning of each chunk for silence, so silence at middle or at the end of a chunk won't be trimmed. I do need to know whether the chunk is the first chunk of the utterance, so when WavePlayer.idle is called, the next chunk will be marked as the first chunk of the next utterance, so that only that chunk is trimmed.

In your case, the synthesizer call WavePlayer.idle before the chunk that contains silence, and the next chunk starts with silence, so the silence is removed. Synthesizers that don't call WavePlayer.idle don't have this issue.

My goal is to require as little change to existing synthesizers as possible, but the NVDA speech synthesizer structure doesn't seem to have the right tool for me to check for "the beginning of an utterance". I might be able to monitor calls to SynthDriver.speak, or insert a bookmark at the beginning of each utterance, but this requires changes to the synthesizers nonetheless. Because the current implementation is at the WavePlayer level, which doesn't have access to the parent synthesizer object.

So there are two general directions.

Ways that require no changes to some synthesizers to work, but might break other synthesizers.
Ways that break no synthesizers, but require changes to all existing synthesizers to let them opt-in.

Which one do you prefer? Do you have better idea about distinguishing the "leading silence"?

gexgd0419 · 2025-01-25T01:30:27Z

@mzanm I think I found the cause here: BreakCommand at the beginning of an utterance. The "delayed descriptions of characters on cursor movement" feature splits the character and its description into two utterances, and puts a BreakCommand at the beginning of the second utterance. As a result, the silence introduced by BreakCommand is trimmed, because my code cannot distinguish whether the leading silence is from a BreakCommand; it doesn't have access to the original command sequence. So in fact, all synthesizers are affected by this issue. OneCore and SAPI5 voices may just have longer "trailing silence" at the end of the first utterance so it seems normal...

zstanecic · 2025-01-25T10:15:55Z

@gexgd0419 Some sapi5 synthesizers have longer trailing silences, for example the slovenian Ebralec or finnish Mikropuhe.
This is normal, and even this feature works like in the Zdsr screen reader.

gexgd0419 · 2025-01-25T10:27:01Z

As I haven't found a way to check the command list in WavePlayer without some kind of modification to the synth driver code, I'm going to try a different method: if the leading silence is longer than, say 200ms, then let the silence pass instead of trimming it. Because longer silence is more likely to be introduced by a BreakCommand. But even then, silence shorter than 200ms would still be trimmed. What would be an appropriate default value for the time threshold?

zstanecic · 2025-01-25T10:40:36Z

@gexgd0419 I am happy to test it. I am now running it from source.

gexgd0419 · 2025-01-25T10:59:13Z

@zstanecic I haven't pushed the changes yet...

zstanecic · 2025-01-25T11:01:26Z

@gexgd0419 I know. I am just notifying you. By the way... the good thing will be to have configurable silence treshold in advanced settings for experienced users...

gexgd0419 · 2025-01-25T11:23:00Z

Yes, maybe a configurable threshold may be better.

But if my code can have access to the command list to check whether there's a BreakCommand at the beginning, I wouldn't have to check the silence length.

This is mainly because WavePlayer is designed to play audio only, not to process audio, so all the things given to WavePlayer is audio-related.

If the logic is put in WavePlayer:

Can affect many synthesizers without requiring changes to their code.
The code has no access to other information, such as the original speak command list, so the process might not be accurate.
Might confuse future WavePlayer users why outputting silence does not work.

If the logic is put in another module:

Can make the synthesizers provide whatever information we need, so the silence detection/removal could be more accurate.
Requires changes to all existing synthesizers. Without changes, synthesizers will still work as usual, but the leading silence won't be trimmed.

I want to know what NV Access thinks about the idea of putting audio processing logic into WavePlayer.

mzanm · 2025-01-25T14:27:45Z

I have an idea, but not sure if it'll work. Is it possible to have some kind of flag that determines if next utterance should be trimmed then in speech.speech.speak, could iterate through the speechSequence and if there's a text command or ratehr string that has no BreakCommand before it then it should set that flag and therefore trim leading silence of next speech. This shouldn't require changes to synthesizers possibly. Or maybe you could register to the pre_speech extension point in nvwave and look at the sequence that way... but probably both methods wouldn't work if speech sequences are queued.

gexgd0419 · 2025-01-25T15:29:21Z

@mzanm Thank you for the info. But as you said, pre_speech might be too early, as the speech might be queued. Maybe I can modify the SpeechManager and insert the logic before getSynth().speak(seq).

This will be called by SpeechManager just before calling getSynth().speak(). The existing extension point `pre_speech` is called before SpeechManager processes the speech sequence, so a new one is added to get the exact speech sequence that will be spoken immediately by the synth driver.

…ted by BreakCommand

gexgd0419 · 2025-01-26T15:00:38Z

To check the exact speech sequence that is about to be spoken, I chose to add a new extension point, pre_synthSpeak, in synthDriverHandler, which will be invoked just before SpeechManager calls getSynth().speak().

@mzanm Now the removed silence when using "delayed descriptions of characters on cursor movement" should be fixed.

However, this makes WavePlayer depend on speech-related modules.

@SaschaCowley @seanbudd What do you think of the idea of adding audio processing features to WavePlayer? It might not be what it's designed to do, but modifying WavePlayer is one of the easiest ways to "fix" things for all synthesizers, especially third-party synthesizers.

zstanecic · 2025-01-26T19:08:43Z

@gexgd0419,
@mzanm
@SaschaCowley
I can confirm that problem with break command is fixed, for synthesizers which use it.
I think that it is a good approach, because of the fact that drivers do not need to be modified.

user_docs/en/userGuide.md

source/gui/settingsDialogs.py

SaschaCowley

Thanks @gexgd0419, nice work!

Qchristensen

Great work, thank you.

wmhn1872265132 · 2025-02-06T06:16:35Z

Since this feature is now allowed to be disabled, can it be merged during the 2025.1 development cycle?

zstanecic · 2025-02-06T08:39:58Z

I also agree with @wmhn1872265132
This feature is a great improvement, and if we have more time for testing in alpha, apparently we have, it will be good to have it there.
I am running this from source and i see a massive improvement with various synths.

user_docs/en/changes.md

source/nvwave.py

gexgd0419 · 2025-02-07T14:57:31Z

I removed WaveFormat::convertFrom because it is not used in the code. Currently, the threshold is hard-coded to be 1 / 2^10 or 0.0009765625, which can be calculated at compile time, so no need to convert custom threshold value at the moment.

Still, I think that I might be over-engineering this a little bit...

LeonarddeR · 2025-02-11T06:02:32Z

It looks like this feature is breaking the letter "p" when at the beginning of a speech sequence.

gexgd0419 · 2025-02-11T07:16:54Z

@LeonarddeR What is the voice and the speech rate you are using?

LeonarddeR · 2025-02-11T07:18:34Z

I'm using espeak wit rate boost on, rate is 30 but I've also seen it on lower rates. I'm using the default max voice with the Dutch language.

beqabeqa473 · 2025-02-11T07:23:27Z

When using some synths, like russian newfon, first letter is everytime cut if consonant.

zstanecic · 2025-02-11T07:35:54Z

@beqabeqa473 @LeonarddeR @gexgd0419
The problem is not happening with RHVoice.

gexgd0419 · 2025-02-11T08:19:44Z

The threshold is hard-coded to 1 / 2^10 or 0.0009765625. Maybe it's not small enough, so some extra audio is trimmed.

Should I make the threshold value adjustable via settings? Or just try another hard-coded value?

zstanecic · 2025-02-11T08:23:34Z

Hi!
@gexgd0419
Make it adjustable. It is best option, so that users can adjust it.

gexgd0419 · 2025-02-11T08:30:53Z

@LeonarddeR @beqabeqa473 I think that this may also be related to the volume level in NVDA (not the system volume level). Because the speech audio will first be limited by the NVDA volume level, then pass through my silence detector, then limited by the system volume level. If the volume is low enough before passing through the silence detector then more audio might be trimmed.

Could you test if it would become better when using a higher volume level in NVDA?

Danstiv · 2025-02-11T12:53:01Z

This feature affects the behavior of tones.beep, even if turned off in advanced settings.
STR:

Play some short 30-40 milliseconds beep, remember exactly how it sounds.
Wait a while, perhaps one minute is enough.
Play Beep Again, the very first beep will be cut off, and subsequent without interventions.

gexgd0419 · 2025-02-11T14:42:02Z

@Danstiv Confirmed that the first beep wave will be chopped off a little bit, and this does not happen without this feature.

Danstiv · 2025-02-11T16:06:05Z

It seems to affect all players, including the Espeak NG player, if several tens of seconds have passed after the last use.

gexgd0419 added 2 commits January 24, 2025 14:49

Add leading silence detection and removal logic

bd5eece

Call startTrimmingLeadingSilence in stop and _idleCheck

f42ded7

gexgd0419 mentioned this pull request Jan 24, 2025

improve the responsiveness of onecore voices and sapi voices #13284

Closed

Add changelog entry

4157888

gexgd0419 marked this pull request as ready for review January 24, 2025 15:15

gexgd0419 requested a review from a team as a code owner January 24, 2025 15:15

gexgd0419 requested a review from SaschaCowley January 24, 2025 15:15

gexgd0419 marked this pull request as draft January 25, 2025 02:30

Add generateSilence in SilenceDetect

8ac5326

gexgd0419 added 2 commits January 26, 2025 21:42

Check speech sequence in nvwave.py to see if leading silence is inser…

55062da

…ted by BreakCommand

gexgd0419 requested a review from SaschaCowley February 6, 2025 01:22

SaschaCowley reviewed Feb 6, 2025

View reviewed changes

user_docs/en/userGuide.md Outdated Show resolved Hide resolved

source/gui/settingsDialogs.py Outdated Show resolved Hide resolved

Change anchor

648ebd2

SaschaCowley approved these changes Feb 6, 2025

View reviewed changes

Qchristensen approved these changes Feb 6, 2025

View reviewed changes

SaschaCowley modified the milestones: 2025.2, 2025.1 Feb 7, 2025

seanbudd reviewed Feb 7, 2025

View reviewed changes

user_docs/en/changes.md Outdated Show resolved Hide resolved

source/nvwave.py Outdated Show resolved Hide resolved

source/nvwave.py Outdated Show resolved Hide resolved

seanbudd and others added 4 commits February 7, 2025 15:18

Update user_docs/en/changes.md

52f7252

Move speech queue checking into WavePlayer class

a91b6c6

Pre-commit auto-fix

1bcb48e

Remove unused function WaveFormat::convertFrom

d54df2e

gexgd0419 requested a review from seanbudd February 11, 2025 03:19

seanbudd approved these changes Feb 11, 2025

View reviewed changes

seanbudd merged commit 1dc0f76 into nvaccess:master Feb 11, 2025
5 checks passed

gexgd0419 deleted the trim-leading-silence branch February 11, 2025 12:51

Add leading silence detection and removal logic #17648

Add leading silence detection and removal logic #17648

Conversation

gexgd0419 commented Jan 24, 2025 • edited Loading

Link to issue number:

Summary of the issue:

Description of user facing changes

Description of development approach

Testing strategy:

Known issues with pull request:

Code Review Checklist:

zstanecic commented Jan 24, 2025

gexgd0419 commented Jan 24, 2025

cary-rowen commented Jan 24, 2025

zstanecic commented Jan 24, 2025

rmcpantoja commented Jan 24, 2025

zstanecic commented Jan 24, 2025

mzanm commented Jan 24, 2025

gexgd0419 commented Jan 25, 2025

gexgd0419 commented Jan 25, 2025

zstanecic commented Jan 25, 2025

gexgd0419 commented Jan 25, 2025

zstanecic commented Jan 25, 2025

gexgd0419 commented Jan 25, 2025

zstanecic commented Jan 25, 2025

gexgd0419 commented Jan 25, 2025

mzanm commented Jan 25, 2025

gexgd0419 commented Jan 25, 2025

gexgd0419 commented Jan 26, 2025

zstanecic commented Jan 26, 2025

SaschaCowley left a comment

Choose a reason for hiding this comment

Qchristensen left a comment

Choose a reason for hiding this comment

wmhn1872265132 commented Feb 6, 2025

zstanecic commented Feb 6, 2025

gexgd0419 commented Feb 7, 2025

LeonarddeR commented Feb 11, 2025

gexgd0419 commented Feb 11, 2025

LeonarddeR commented Feb 11, 2025

beqabeqa473 commented Feb 11, 2025 • edited Loading

zstanecic commented Feb 11, 2025

gexgd0419 commented Feb 11, 2025

zstanecic commented Feb 11, 2025

gexgd0419 commented Feb 11, 2025

Danstiv commented Feb 11, 2025

gexgd0419 commented Feb 11, 2025

Danstiv commented Feb 11, 2025

gexgd0419 commented Jan 24, 2025 •

edited

Loading

beqabeqa473 commented Feb 11, 2025 •

edited

Loading