Skip to content

samoliverschumacher/voice-cloning-workflow

Repository files navigation

Fine tuning a Speech cloning model with a "Gold Standard" 🥇 Speech Dataset

This repo contains a workflow to process Speech & Text data to give a Gold Standard training dataset for the task of Speech Cloning / Speech Mimicry. I've used the workflow to fine-tune a Speech Cloning system using RTVC (an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS))


With high quality data, the model fine tuning process gives a higher quality synthetic speech output, and shorten training time.

What is 'Gold Standard' Speech data?

Accurately transcribed words by time & speaker name, and a recording high quality. These are hard to come by because labelling audio with text is a costly human process. Whisper AI uses combination of Gold standard, then later Silver standard to help its generalisation performance

But why bother? You could skip the whole pre-processing stage - there are open source tools available to transcribe audio.

  • The tools are definitely faster, but not as accurate. Watch a youtube with subtitles to see the difference.
  • "Quality in, Quality out". Enevitably there are losses each time you try to represent a dataset with a model. Using a model (speeech-to-text) to synthesize training data compounds this effect.

The datasets I used were from BBC podcasts. The BBC has been in the business of high quality audio for over 100 years. Their sound engineers are highly experienced, and the BBC have always provided great accesibility - so they've great experience making transcriptions.

Fine tuning a pre-trained Voice-Cloning model

The system I used is made from 3 pre-trained models:

  1. Speaker Encoder (GE2E, green) - Encodes audio waveforms to a "Speech d-Vector" embedding, by maximising the cosine similarity between samples of the same speaker.
  2. Voice Synthesizer (Tacotron, orange, 30.87M parameters) - A Generative, Attentional, sequence to sequence model converts Text to Mel-spectograms.
  3. Voice Encoder (WaveRNN, blue, 4.481M parameters) - Sequence to Sequence neural network converts Audio from Mel-spectograms

Model architecture

They're assembled into a single model that allows one-shot training ("Cloning") of a reference voice. (described in this paper, Implementation used)

The Speaker Encoder must be trained first, because the embeddings it creates are input data associated with the waveform-text pairs the Voice Synthesizer is trained on.

Superior performance on a specific task, with minimal additional training time.

The system is designed to imitate any speaker's voice. It can do this because of the hundreds of speakers it's pre-trained on, and the architecture of the Synthesizer + Speaker Encoder combo.

A gold standard dataset, processed with the below workflow, is used to run a a relatively small amount more iterations on data from only one speaker. The system doesn't loose its ability to clone any new speaker, but greatly improves the quality of cloned voice for the one speaker in the fine tuning dataset.

The results

The entire model training would take around 15 days for just the Synthesizer. Fine tuning on 22 minutes of single-speaker audio data produces the below synthesised voice. Only the Synthesizer needs fine tuning, though the Vocoder could also be fine tuned afterwards.

Ground truth audio;

Speaker-A-21-utterance2.mp4

Transcript;

BOOST|BRAIN|POWER|FIRST|HE|TRIED|SMART|PILLS|THEN|HE|MOVED|ON|TO|ELECTRICAL|BRAIN|STIMULATION|FOR|COGNITIVE|ENHANCEMENT

Pre-trained model, trained for 295,000 iterations. Conditioned with audio from the same speaker;

synthetic-pretrained-AIM-1418-Speaker-A-21-utterance2.mp4

🔉 At least the pre-trained model has imitated a voice that is female! But it's not fooling anyone...

Fine-tuned model, trained for a further 19,000 iterations. Conditioned with audio from the same speaker;

synthetic-314-AIM-1418-Speaker-A-21-utterance2.mp4

🔉 In contrast to the pre-trained model, the fine-tuned model has done an amazing job - with only about 36 hours training on a NVIDIA GeForce RTX 3050 Ti Laptop GPU!!

🔉 Super interesting nuance here: The second half of the audio is spoken more quickly and flows naturally, the first half is 'robotic' and words are spoken independent of eachother.

🔉 Listening to the original audio, there are moments when the speaker audibly breathes in, and the sound engineer has not cropped this out. In the fine-tuned model an audible breath occurs before "THEN HE".

🔉I wonder whether this is a cue for recalling (memorization) this exact training example? Or, if not simply overfitting - perhaps it's something more impressive - the RNN has learnt a deep breath in preceeds a segment of more rapid speech?

Installation

Set up a conda environment to install the below - there are tools required outside of python packages, such as CUDA for GPU support for Pytorch.

Command Line tools for dat preprocessing - this repo

pip install -f requirements.txt

And follow instructions to install Pytorch Audio, and Pytorch

Diarization tool: pyannote-audio

pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip

Note: A few of the pacakges required in this repo, come from the dependencies of pyannote.audio

Then, follow the other installation steps

Speech Cloning repo

Real Time Voice Cloning, is a submodule of this Repo. Follow its install steps. It's scripts do the model training / fine tuning.

About

A Workflow for Dataset preparation & model fine tuning for a Speech Cloning system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published