This repo contains a workflow to process Speech & Text data to give a Gold Standard training dataset for the task of Speech Cloning / Speech Mimicry. I've used the workflow to fine-tune a Speech Cloning system using RTVC (an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS))
With high quality data, the model fine tuning process gives a higher quality synthetic speech output, and shorten training time.
- 🛠️ The Data Processing Workflow
- 💾 Scripts to manipulate the interim data.
- 📊 Notebooks to visualise & diagnose anomalous data.
Accurately transcribed words by time & speaker name, and a recording high quality. These are hard to come by because labelling audio with text is a costly human process. Whisper AI uses combination of Gold standard, then later Silver standard to help its generalisation performance
But why bother? You could skip the whole pre-processing stage - there are open source tools available to transcribe audio.
- The tools are definitely faster, but not as accurate. Watch a youtube with subtitles to see the difference.
- "Quality in, Quality out". Enevitably there are losses each time you try to represent a dataset with a model. Using a model (speeech-to-text) to synthesize training data compounds this effect.
The datasets I used were from BBC podcasts. The BBC has been in the business of high quality audio for over 100 years. Their sound engineers are highly experienced, and the BBC have always provided great accesibility - so they've great experience making transcriptions.
The system I used is made from 3 pre-trained models:
- Speaker Encoder (GE2E, green) - Encodes audio waveforms to a "Speech d-Vector" embedding, by maximising the cosine similarity between samples of the same speaker.
- Voice Synthesizer (Tacotron, orange, 30.87M parameters) - A Generative, Attentional, sequence to sequence model converts Text to Mel-spectograms.
- Voice Encoder (WaveRNN, blue, 4.481M parameters) - Sequence to Sequence neural network converts Audio from Mel-spectograms
They're assembled into a single model that allows one-shot training ("Cloning") of a reference voice. (described in this paper, Implementation used)
The Speaker Encoder must be trained first, because the embeddings it creates are input data associated with the waveform-text pairs the Voice Synthesizer is trained on.
The system is designed to imitate any speaker's voice. It can do this because of the hundreds of speakers it's pre-trained on, and the architecture of the Synthesizer + Speaker Encoder combo.
A gold standard dataset, processed with the below workflow, is used to run a a relatively small amount more iterations on data from only one speaker. The system doesn't loose its ability to clone any new speaker, but greatly improves the quality of cloned voice for the one speaker in the fine tuning dataset.
The entire model training would take around 15 days for just the Synthesizer. Fine tuning on 22 minutes of single-speaker audio data produces the below synthesised voice. Only the Synthesizer needs fine tuning, though the Vocoder could also be fine tuned afterwards.
Ground truth audio;
Speaker-A-21-utterance2.mp4
Transcript;
BOOST|BRAIN|POWER|FIRST|HE|TRIED|SMART|PILLS|THEN|HE|MOVED|ON|TO|ELECTRICAL|BRAIN|STIMULATION|FOR|COGNITIVE|ENHANCEMENT
Pre-trained model, trained for 295,000 iterations. Conditioned with audio from the same speaker;
synthetic-pretrained-AIM-1418-Speaker-A-21-utterance2.mp4
🔉 At least the pre-trained model has imitated a voice that is female! But it's not fooling anyone...
Fine-tuned model, trained for a further 19,000 iterations. Conditioned with audio from the same speaker;
synthetic-314-AIM-1418-Speaker-A-21-utterance2.mp4
🔉 In contrast to the pre-trained model, the fine-tuned model has done an amazing job - with only about 36 hours training on a NVIDIA GeForce RTX 3050 Ti Laptop GPU!!
🔉 Super interesting nuance here: The second half of the audio is spoken more quickly and flows naturally, the first half is 'robotic' and words are spoken independent of eachother.
🔉 Listening to the original audio, there are moments when the speaker audibly breathes in, and the sound engineer has not cropped this out. In the fine-tuned model an audible breath occurs before "
THEN HE
".🔉I wonder whether this is a cue for recalling (memorization) this exact training example? Or, if not simply overfitting - perhaps it's something more impressive - the RNN has learnt a deep breath in preceeds a segment of more rapid speech?
Set up a conda environment to install the below - there are tools required outside of python packages, such as CUDA for GPU support for Pytorch.
pip install -f requirements.txt
And follow instructions to install Pytorch Audio, and Pytorch
pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip
Note: A few of the pacakges required in this repo, come from the dependencies of pyannote.audio
Then, follow the other installation steps
Real Time Voice Cloning, is a submodule of this Repo. Follow its install steps. It's scripts do the model training / fine tuning.