Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio speaks way too fast on long text generations #3

Open
pbosh opened this issue Nov 12, 2024 · 2 comments
Open

Audio speaks way too fast on long text generations #3

pbosh opened this issue Nov 12, 2024 · 2 comments

Comments

@pbosh
Copy link

pbosh commented Nov 12, 2024

Generations <10 seconds work really well. When I use longer text like:
My unique experiences create both enrichment and potential barriers. They allow me to relate more deeply to others who have similar backgrounds, fostering empathy. However, if someone's experience differs significantly from mine, I might need to consciously work on understanding their perspective. It’s a balancing act—embracing shared experiences while being open to differences. I often reassess my choices based on new information or insights that arise. This flexibility allows me to navigate complexities more effectively. For instance, if my initial decision no longer aligns with my values or goals as circumstances change, I don’t hesitate to adjust my approach. It’s about being open to the evolution of a situation and recognizing when it calls for a different response.

The voice speeds up a lot. Like way too fast. Eg. It produces an audio file about 20 secs duration, but it should be about 60 secs. The audio speaks the full amount of text but jams it into about 20 secs.

I spent 10+ hours fiddling with it but can't affect any change, with speed or duration.


import SwiftUI
import F5TTS
import MLX
import AVFoundation

struct ContentView: View {
@State private var isGenerating = false
@State private var progressText = ""

var body: some View {
    VStack(spacing: 20) {
        ProgressView()
        Text(progressText)
    }
    .padding()
    .task {
        await generateAudio()
    }
}

private func generateAudio() async {
    isGenerating = true
    progressText = "Loading model..."
    
    do {
        let f5tts = try await F5TTS.fromPretrained(repoId: "lucasnewman/f5-tts-mlx") { progress in
            progressText = "Loading: \(progress.completedUnitCount) of \(progress.totalUnitCount)"
        }
        
        progressText = "Generating audio..."
        let text = "My unique experiences create both enrichment and potential barriers. They allow me to relate more deeply to others who have similar backgrounds, fostering empathy. However, if someone's experience differs significantly from mine, I might need to consciously work on understanding their perspective. It's a balancing act—embracing shared experiences while being open to differences. I often reassess my choices based on new information or insights that arise. This flexibility allows me to navigate complexities more effectively. For instance, if my initial decision no longer aligns with my values or goals as circumstances change, I don't hesitate to adjust my approach. It's about being open to the evolution of a situation and recognizing when it calls for a different response."
        let generatedAudio = try await f5tts.generate(
            text: text,
            duration: 50.0  // Set fixed duration of 50 seconds
        )
        
        // Save to Documents directory
        let documentsPath = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
        let outputURL = documentsPath.appendingPathComponent("output.wav")
        
        // Convert MLXArray to audio buffer and save
        let samples = Array(generatedAudio.asArray(Float32.self))
        let format = AVAudioFormat(standardFormatWithSampleRate: 24000, channels: 1)!
        let file = try AVAudioFile(forWriting: outputURL, settings: format.settings)
        let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(samples.count))!
        
        buffer.floatChannelData!.pointee.assign(from: samples, count: samples.count)
        buffer.frameLength = AVAudioFrameCount(samples.count)
        
        try file.write(from: buffer)
        
        progressText = "Saved audio to: \(outputURL.path)"
    } catch {
        progressText = "Error: \(error.localizedDescription)"
    }
    
    isGenerating = false
}

}

#Preview {
ContentView()
}

@lucasnewman
Copy link
Owner

The max generation length is 30 seconds in the underlying model, so you need to decompose the sentences so you can generate individual snippets and then stitch them back together.

@drewocarr
Copy link

The max generation length is 30 seconds in the underlying model, so you need to decompose the sentences so you can generate individual snippets and then stitch them back together.

This took me days to figure out lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants