Skip to content

Latest commit

 

History

History
91 lines (81 loc) · 15.6 KB

README.md

File metadata and controls

91 lines (81 loc) · 15.6 KB

🗣️ Open TTS Tracker

A one stop shop to track all open-access/ source TTS models as they come out. Feel free to make a PR for all those that aren't linked here.

This is aimed as a resource to increase awareness for these models and to make it easier for researchers, developers, and enthusiasts to stay informed about the latest advancements in the field.

Note

This repo will only track open source/access codebase TTS models. More motivation for everyone to open-source! 🤗

Name GitHub Weights License Fine-tune Languages Paper Demo Issues
Amphion Repo 🤗 Hub MIT No Multilingual Paper 🤗 Space
AI4Bharat Repo 🤗 Hub MIT Yes Indic Paper Demo
Bark Repo 🤗 Hub MIT No Multilingual Paper 🤗 Space
EmotiVoice Repo GDrive Apache 2.0 Yes ZH + EN Not Available Not Available Separate GUI agreement
Glow-TTS Repo GDrive MIT Yes English Paper GH Pages
GPT-SoVITS Repo 🤗 Hub MIT Yes Multilingual Not Available Not Available
HierSpeech++ Repo GDrive MIT No KR + EN Paper 🤗 Space
IMS-Toucan Repo GH release Apache 2.0 Yes Multilingual Paper 🤗 Space
MahaTTS Repo 🤗 Hub Apache 2.0 No English + Indic Not Available Recordings, Colab
Matcha-TTS Repo GDrive MIT Yes English Paper 🤗 Space GPL-licensed phonemizer
MetaVoice-1B Repo 🤗 Hub Apache 2.0 Yes Multilingual Not Available 🤗 Space
Neural-HMM TTS Repo GitHub MIT Yes English Paper GH Pages
OpenVoice Repo 🤗 Hub CC-BY-NC 4.0 No ZH + EN Paper 🤗 Space Non Commercial
OverFlow TTS Repo GitHub MIT Yes English Paper GH Pages
Parler TTS Repo 🤗 Hub Apache 2.0 Yes English Not Available Not Available
pflowTTS Unofficial Repo GDrive MIT Yes English Paper Not Available GPL-licensed phonemizer
Piper Repo 🤗 Hub MIT Yes Multilingual Not Available Not Available GPL-licensed phonemizer
Pheme Repo 🤗 Hub CC-BY Yes English Paper 🤗 Space
RAD-MMM Repo GDrive MIT Yes Multilingual Paper Jupyter Notebook, Webpage
RAD-TTS Repo GDrive MIT Yes English Paper GH Pages
Silero Repo GH links CC BY-NC-SA No EM + DE + ES + EA Not Available Not Available Non Commercial
StyleTTS 2 Repo 🤗 Hub MIT Yes English Paper 🤗 Space GPL-licensed phonemizer
Tacotron 2 Unofficial Repo GDrive BSD-3 Yes English Paper Webpage
TorToiSe TTS Repo 🤗 Hub Apache 2.0 Yes English Technical report 🤗 Space
TTTS Repo 🤗 Hub MPL 2.0 No ZH Not Available Colab, 🤗 Space
VALL-E Unofficial Repo Not Available MIT Yes NA Paper Not Available
VITS/ MMS-TTS Repo 🤗 Hub / MMS Apache 2.0 Yes English Paper 🤗 Space GPL-licensed phonemizer
WhisperSpeech Repo 🤗 Hub MIT No English, Polish Not Available 🤗 Space, Recordings, Colab
XTTS Repo 🤗 Hub CPML Yes Multilingual Paper 🤗 Space Non Commercial
xVASynth Repo 🤗 Hub GPL-3.0 Yes Multilingual Paper 🤗 Space Copyrighted materials used for training.

Capability specifics

Click on this to toggle table visibility
Name Processor
Phonetic alphabet
🔤
Insta-clone
👥
Emotional control
🎭
Prompting
📖
Speech control
🎚
Streaming support
🌊
S2S support
🦜
Longform synthesis
Amphion CUDA 👥 🎭👥
Bark CUDA 🎭 tags
EmotiVoice
Glow-TTS
GPT-SoVITS
HierSpeech++ 👥 🎭👥 speed / stability
🎚
🦜
IMS-Toucan CUDA
MahaTTS
Matcha-TTS IPA speed / stability
🎚
MetaVoice-1B CUDA 👥 🎭👥 stability / similarity
🎚
Yes
Neural-HMM TTS
OpenVoice CUDA 👥 6-type 🎭
😡😃😭😯🤫😊
OverFlow TTS
pflowTTS
Piper
Pheme CUDA 👥 🎭👥 stability
🎚
RAD-TTS
Silero
StyleTTS 2 CPU / CUDA IPA 👥 🎭👥 🌊 Yes
Tacotron 2
TorToiSe TTS 📖 🌊
TTTS CPU/CUDA 👥
VALL-E
VITS/ MMS-TTS CUDA speed
🎚
WhisperSpeech CUDA 👥 🎭👥 speed
🎚
XTTS CUDA 👥 🎭👥 speed / stability
🎚
🌊
xVASynth CPU / CUDA ARPAbet+ 4-type 🎭
😡😃😭😯
per‑phoneme
speed / pitch / energy / 🎭
🎚
per‑phoneme
🦜
  • Processor - CPU/CUDA/ROCm (single/multi used for inference; Real-time factor should be below 2.0 to qualify for CPU, though some leeway can be given if it supports audio streaming)
  • Phonetic alphabet - None/IPA/ARPAbet (Phonetic transcription that allows to control pronunciation of certain words during inference)
  • Insta-clone - Yes/No (Zero-shot model for quick voice clone)
  • Emotional control - Yes🎭/Strict (Strict, as in has no ability to go in-between states, insta-clone switch/🎭👥)
  • Prompting - Yes/No (A side effect of narrator based datasets and a way to affect the emotional state, ElevenLabs docs)
  • Streaming support - Yes/No (If it is possible to playback audio that is still being generated)
  • Speech control - speed/pitch/ (Ability to change the pitch, duration, energy and/or emotion of generated speech)
  • Speech-To-Speech support - Yes/No (Streaming support implies real-time S2S; S2T=>T2S does not count)

How can you help?

Help make this list more complete. Create demos on the Hugging Face Hub and link them here :) Got any questions? Drop me a DM on Twitter @reach_vb.