Integrate a High-Quality Text-to-Speech Engine for Natural Speech Output #17933

AlexMelw · 2025-04-10T08:38:05Z

AlexMelw
Apr 10, 2025

Hi NVDA Developers and Community,

I'd like to propose exploring the potential integration of modern, high-quality, open-source AI-powered Text-to-Speech (TTS) technology into the NVDA screen reader.

Motivation:

Currently, NVDA utilizes various TTS engines. While functional, many available synthesizers sound robotic and lack the natural intonation of human speech. This can lead to listening fatigue, especially during extended use.

Recent advancements in AI have produced TTS models capable of generating remarkably human-like audio. Leveraging such open-source models could significantly enhance the user experience for NVDA users by providing more natural and pleasant voice output.

Proposal:

Investigate the feasibility of integrating a suitable open-source, high-quality, AI-based TTS engine into NVDA as an alternative or potentially even a future replacement for some existing options. The primary goal is to offer users a much more natural, expressive, and less fatiguing voice.

Considerations / Potential Challenges:

Model Size: Due to the potentially large size of voice model, it may be more efficient to offer them as optional module, downloaded from online storage post-installation, rather than bundling them directly with NVDA.
Offline Capability: NVDA requires reliable offline functionality. The chosen TTS solution must run entirely locally without needing online access.
Model Selection: Identifying the right open-source TTS model is key. Factors include voice quality, performance, resource usage, language support, ease of integration, and active maintenance. Examples of modern architectures/models to potentially investigate include VITS, Piper, Bark, Tortoise TTS, etc., focusing on those optimized for efficient local execution.
Licensing: Ensuring the license of the chosen TTS model and its dependencies is compatible with NVDA's project license (GPL). Many AI models use permissive licenses like MIT or Apache 2.0, but this needs careful verification.

This is an initial idea to spark discussion. Would the community find value in having a more natural, AI-powered voice option? What are the perceived technical hurdles, and are there specific open-source TTS models that the community thinks might be suitable candidates for investigation?

Looking forward to hearing your thoughts!

seanbudd · 2025-04-14T01:23:51Z

seanbudd
Apr 14, 2025
Maintainer

Is there any you could specifically suggest?
Unless we have a concrete model being proposed that can be run on local hardware and has a compatible license there's no real action we can take.

0 replies

valiant8086 · 2025-04-15T23:45:55Z

valiant8086
Apr 15, 2025

I know nothing about licenses looks like the base project might be MIT? But I think Kokoro might be somewhat useful in this project. Here's an audio file I generated with it, which if course shows us nothing at all about the responsiveness or the CPU load.
rrraaarrrooowww.zip

Here's a link to a github page about it.

It's not immediately apparent how to use the backend directly, but I think this project has the right idea. Plus, the am adam voice is my favorite, and is what I used in the linked audio clip

1 reply

seanbudd Apr 22, 2025
Maintainer

Kokoro doesn't seem appropriate. It doesn't support many languages, and doesn't have much activity on it's GitHub.
I fear it will become abandoned.
Piper seems more appropriate, for comparison, with more voices and more GitHub activity

AlexMelw · 2025-04-18T12:33:44Z

AlexMelw
Apr 18, 2025
Author

Piper TTS looks promising for this project. However, I've heard that in order for it to sound better, one needs to clean/normalize the text first.

1 reply

gerald-hartig Apr 22, 2025
Maintainer

Have you tried the Piper NVDA add-on for performance and quality?

audiogamer22 · 2025-04-24T22:41:58Z

audiogamer22
Apr 24, 2025

I think the more choices we have to choose from; the better.

Personally I'm not a huge fan of piper, but that could be just because I'm used to eliquence.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integrate a High-Quality Text-to-Speech Engine for Natural Speech Output #17933

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Integrate a High-Quality Text-to-Speech Engine for Natural Speech Output #17933

Uh oh!

Uh oh!

AlexMelw Apr 10, 2025

Replies: 4 comments · 2 replies

Uh oh!

seanbudd Apr 14, 2025 Maintainer

Uh oh!

valiant8086 Apr 15, 2025

Uh oh!

seanbudd Apr 22, 2025 Maintainer

Uh oh!

AlexMelw Apr 18, 2025 Author

Uh oh!

gerald-hartig Apr 22, 2025 Maintainer

Uh oh!

audiogamer22 Apr 24, 2025

AlexMelw
Apr 10, 2025

Replies: 4 comments 2 replies

seanbudd
Apr 14, 2025
Maintainer

valiant8086
Apr 15, 2025

seanbudd Apr 22, 2025
Maintainer

AlexMelw
Apr 18, 2025
Author

gerald-hartig Apr 22, 2025
Maintainer

audiogamer22
Apr 24, 2025