Seeking Guidance for Finetuning F5-TTS. Small datasets (10-60 minutes) on 12GB VRAM #769

S-T-K · 2025-02-06T16:06:53Z

S-T-K
Feb 6, 2025

I've been getting great results with F5-TTS, but my first attempt at fine-tuning trained from scratch instead of using the pretrained model—the output started as noise and is only slowly becoming speech.

How do I correctly fine-tune instead of starting from scratch?

Do I need to set "Tokenizer File" and "Path to Pretrained Checkpoint" manually? If so, what should I put?
Does "Download corresponding dataset first, and fill in the path in scripts" (repo link) refer to this?

Project Details:
I'm working on generating voices for characters from an old game. I have 10 to 60 minutes of clean audio samples per character. Language: English.

Hardware:
GPU: Nvidia 4080 Laptop (12GB VRAM)
I'm looking for advice on the best values to set for finetuning, given my hardware. Here’s what I’ve gathered so far, but I’d love some expert input:

Parameter Questions
Batch Size per GPU: I assume 6400 should work with 12GB VRAM, but would appreciate confirmation.
Max Samples: Not sure, but I read that 2 might be fine (reference).
Gradient Accumulation Steps & Max Gradient Norm: No idea—should I just leave them at 1?
Epochs: How many would be reasonable for my dataset size?
Warmup Updates: Not sure what value is appropriate.
Save per Updates: I assume setting this high is better, as frequent saving would slow down training?
Last per Updates: Not sure what value to use here either.
Other Options:
Use 8-bit Adam optimizer – Should I enable this?
Mixed Precision – Any recommendations based on my GPU?
Logger – Not sure what’s best here.

Finetuning Duration
How long should I expect finetuning to take per character? Just so I can compare against my actual training times and check if my machine is underperforming due to driver or config issues.

Any guidance would be highly appreciated! 🚀

Thanks in advance!

S-T-K · 2025-02-07T19:37:42Z

S-T-K
Feb 7, 2025
Author

I was able to answer some of the questions myself:
For finetuning, just check the checkbox of the same name. No need to specify anything else.

On my machine any value for max samples higher than 8 leads to horribly slow training speed. So I guess going higher than that is not advisable.

Trained for 320 epochs but unfortunately the resulting model doesn't perform any better than the base model in my opinion. No discernable difference in the generated audio files.

8 replies

S-T-K Feb 8, 2025
Author

I see, that makes sense. Do you know of a metric or score I could calculate for the generated files to compare their similarity to the reference? For example, I could generate 100 files using the base model and the fine-tuned model and compare them statistically.

SWivid Feb 8, 2025
Maintainer

https://github.com/SWivid/F5-TTS/tree/main/src/f5_tts/eval

S-T-K Feb 8, 2025
Author

Oh perfect, didn't know it's part of the repo. Thank you!

SWivid Feb 8, 2025
Maintainer

Since you wanna evaluate for specific voice performance rather than zero-shot ability, you could write some code and do eval for the certain speaker voices, with

F5-TTS/src/f5_tts/eval/utils_eval.py

Lines 379 to 413 in 261b277

    
           def run_sim(args): 
        
               rank, test_set, ckpt_dir = args 
        
               device = f"cuda:{rank}" 
        
               model = ECAPA_TDNN_SMALL(feat_dim=1024, feat_type="wavlm_large", config_path=None) 
        
               state_dict = torch.load(ckpt_dir, weights_only=True, map_location=lambda storage, loc: storage) 
        
               model.load_state_dict(state_dict["model"], strict=False) 
        
               use_gpu = True if torch.cuda.is_available() else False 
        
               if use_gpu: 
        
                   model = model.cuda(device) 
        
               model.eval() 
        
               sims = [] 
        
               for wav1, wav2, truth in tqdm(test_set): 
        
                   wav1, sr1 = torchaudio.load(wav1) 
        
                   wav2, sr2 = torchaudio.load(wav2) 
        
                   resample1 = torchaudio.transforms.Resample(orig_freq=sr1, new_freq=16000) 
        
                   resample2 = torchaudio.transforms.Resample(orig_freq=sr2, new_freq=16000) 
        
                   wav1 = resample1(wav1) 
        
                   wav2 = resample2(wav2) 
        
                   if use_gpu: 
        
                       wav1 = wav1.cuda(device) 
        
                       wav2 = wav2.cuda(device) 
        
                   with torch.no_grad(): 
        
                       emb1 = model(wav1) 
        
                       emb2 = model(wav2) 
        
                   sim = F.cosine_similarity(emb1, emb2)[0].item() 
        
                   # print(f"VSim score between two audios: {sim:.4f} (-1.0, 1.0).") 
        
                   sims.append(sim) 
        
               return sims

and maybe also utmos

S-T-K Feb 11, 2025
Author

Update: Fine-tuning did improve the output quality significantly!
Initially, I thought it hadn't, but that was due to an error I missed in the console. While comparing the original pretrained model and the fine-tuned model in the finetune gradio interface, I encountered an issue. When switching back to the original model, an error (not visible in the gradio interface) caused the fine-tuned model to remain loaded, even though I believed I had switched back. As a result, I was unknowingly comparing the fine-tuned model to itself, which is why I didn't notice much difference and mistakenly thought the original model also sounded good.

So in summary: fine-tuning is worth it and f5-tts is amazing. Thank you so much @SWivid!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking Guidance for Finetuning F5-TTS. Small datasets (10-60 minutes) on 12GB VRAM #769

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Seeking Guidance for Finetuning F5-TTS. Small datasets (10-60 minutes) on 12GB VRAM #769

S-T-K Feb 6, 2025

Replies: 1 comment · 8 replies

S-T-K Feb 7, 2025 Author

S-T-K Feb 8, 2025 Author

SWivid Feb 8, 2025 Maintainer

S-T-K Feb 8, 2025 Author

SWivid Feb 8, 2025 Maintainer

S-T-K Feb 11, 2025 Author

S-T-K
Feb 6, 2025

Replies: 1 comment 8 replies

S-T-K
Feb 7, 2025
Author

S-T-K Feb 8, 2025
Author

SWivid Feb 8, 2025
Maintainer

S-T-K Feb 8, 2025
Author

SWivid Feb 8, 2025
Maintainer

S-T-K Feb 11, 2025
Author