Audio speaks way too fast on long text generations #3

pbosh · 2024-11-12T12:19:03Z

Generations <10 seconds work really well. When I use longer text like:
My unique experiences create both enrichment and potential barriers. They allow me to relate more deeply to others who have similar backgrounds, fostering empathy. However, if someone's experience differs significantly from mine, I might need to consciously work on understanding their perspective. It’s a balancing act—embracing shared experiences while being open to differences. I often reassess my choices based on new information or insights that arise. This flexibility allows me to navigate complexities more effectively. For instance, if my initial decision no longer aligns with my values or goals as circumstances change, I don’t hesitate to adjust my approach. It’s about being open to the evolution of a situation and recognizing when it calls for a different response.

The voice speeds up a lot. Like way too fast. Eg. It produces an audio file about 20 secs duration, but it should be about 60 secs. The audio speaks the full amount of text but jams it into about 20 secs.

I spent 10+ hours fiddling with it but can't affect any change, with speed or duration.

import SwiftUI
import F5TTS
import MLX
import AVFoundation

struct ContentView: View {
@State private var isGenerating = false
@State private var progressText = ""

var body: some View {
    VStack(spacing: 20) {
        ProgressView()
        Text(progressText)
    }
    .padding()
    .task {
        await generateAudio()
    }
}

private func generateAudio() async {
    isGenerating = true
    progressText = "Loading model..."
    
    do {
        let f5tts = try await F5TTS.fromPretrained(repoId: "lucasnewman/f5-tts-mlx") { progress in
            progressText = "Loading: \(progress.completedUnitCount) of \(progress.totalUnitCount)"
        }
        
        progressText = "Generating audio..."
        let text = "My unique experiences create both enrichment and potential barriers. They allow me to relate more deeply to others who have similar backgrounds, fostering empathy. However, if someone's experience differs significantly from mine, I might need to consciously work on understanding their perspective. It's a balancing act—embracing shared experiences while being open to differences. I often reassess my choices based on new information or insights that arise. This flexibility allows me to navigate complexities more effectively. For instance, if my initial decision no longer aligns with my values or goals as circumstances change, I don't hesitate to adjust my approach. It's about being open to the evolution of a situation and recognizing when it calls for a different response."
        let generatedAudio = try await f5tts.generate(
            text: text,
            duration: 50.0  // Set fixed duration of 50 seconds
        )
        
        // Save to Documents directory
        let documentsPath = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
        let outputURL = documentsPath.appendingPathComponent("output.wav")
        
        // Convert MLXArray to audio buffer and save
        let samples = Array(generatedAudio.asArray(Float32.self))
        let format = AVAudioFormat(standardFormatWithSampleRate: 24000, channels: 1)!
        let file = try AVAudioFile(forWriting: outputURL, settings: format.settings)
        let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(samples.count))!
        
        buffer.floatChannelData!.pointee.assign(from: samples, count: samples.count)
        buffer.frameLength = AVAudioFrameCount(samples.count)
        
        try file.write(from: buffer)
        
        progressText = "Saved audio to: \(outputURL.path)"
    } catch {
        progressText = "Error: \(error.localizedDescription)"
    }
    
    isGenerating = false
}

}

#Preview {
ContentView()
}

The text was updated successfully, but these errors were encountered:

lucasnewman · 2024-11-12T16:13:00Z

The max generation length is 30 seconds in the underlying model, so you need to decompose the sentences so you can generate individual snippets and then stitch them back together.

drewocarr · 2024-11-12T17:19:40Z

The max generation length is 30 seconds in the underlying model, so you need to decompose the sentences so you can generate individual snippets and then stitch them back together.

This took me days to figure out lol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio speaks way too fast on long text generations #3

Audio speaks way too fast on long text generations #3

pbosh commented Nov 12, 2024

lucasnewman commented Nov 12, 2024

drewocarr commented Nov 12, 2024

Audio speaks way too fast on long text generations #3

Audio speaks way too fast on long text generations #3

Comments

pbosh commented Nov 12, 2024

lucasnewman commented Nov 12, 2024

drewocarr commented Nov 12, 2024