You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generations <10 seconds work really well. When I use longer text like:
My unique experiences create both enrichment and potential barriers. They allow me to relate more deeply to others who have similar backgrounds, fostering empathy. However, if someone's experience differs significantly from mine, I might need to consciously work on understanding their perspective. It’s a balancing act—embracing shared experiences while being open to differences. I often reassess my choices based on new information or insights that arise. This flexibility allows me to navigate complexities more effectively. For instance, if my initial decision no longer aligns with my values or goals as circumstances change, I don’t hesitate to adjust my approach. It’s about being open to the evolution of a situation and recognizing when it calls for a different response.
The voice speeds up a lot. Like way too fast. Eg. It produces an audio file about 20 secs duration, but it should be about 60 secs. The audio speaks the full amount of text but jams it into about 20 secs.
I spent 10+ hours fiddling with it but can't affect any change, with speed or duration.
struct ContentView: View { @State private var isGenerating = false @State private var progressText = ""
var body: some View {
VStack(spacing: 20) {
ProgressView()
Text(progressText)
}
.padding()
.task {
await generateAudio()
}
}
private func generateAudio() async {
isGenerating = true
progressText = "Loading model..."
do {
let f5tts = try await F5TTS.fromPretrained(repoId: "lucasnewman/f5-tts-mlx") { progress in
progressText = "Loading: \(progress.completedUnitCount) of \(progress.totalUnitCount)"
}
progressText = "Generating audio..."
let text = "My unique experiences create both enrichment and potential barriers. They allow me to relate more deeply to others who have similar backgrounds, fostering empathy. However, if someone's experience differs significantly from mine, I might need to consciously work on understanding their perspective. It's a balancing act—embracing shared experiences while being open to differences. I often reassess my choices based on new information or insights that arise. This flexibility allows me to navigate complexities more effectively. For instance, if my initial decision no longer aligns with my values or goals as circumstances change, I don't hesitate to adjust my approach. It's about being open to the evolution of a situation and recognizing when it calls for a different response."
let generatedAudio = try await f5tts.generate(
text: text,
duration: 50.0 // Set fixed duration of 50 seconds
)
// Save to Documents directory
let documentsPath = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
let outputURL = documentsPath.appendingPathComponent("output.wav")
// Convert MLXArray to audio buffer and save
let samples = Array(generatedAudio.asArray(Float32.self))
let format = AVAudioFormat(standardFormatWithSampleRate: 24000, channels: 1)!
let file = try AVAudioFile(forWriting: outputURL, settings: format.settings)
let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(samples.count))!
buffer.floatChannelData!.pointee.assign(from: samples, count: samples.count)
buffer.frameLength = AVAudioFrameCount(samples.count)
try file.write(from: buffer)
progressText = "Saved audio to: \(outputURL.path)"
} catch {
progressText = "Error: \(error.localizedDescription)"
}
isGenerating = false
}
}
#Preview {
ContentView()
}
The text was updated successfully, but these errors were encountered:
The max generation length is 30 seconds in the underlying model, so you need to decompose the sentences so you can generate individual snippets and then stitch them back together.
The max generation length is 30 seconds in the underlying model, so you need to decompose the sentences so you can generate individual snippets and then stitch them back together.
Generations <10 seconds work really well. When I use longer text like:
My unique experiences create both enrichment and potential barriers. They allow me to relate more deeply to others who have similar backgrounds, fostering empathy. However, if someone's experience differs significantly from mine, I might need to consciously work on understanding their perspective. It’s a balancing act—embracing shared experiences while being open to differences. I often reassess my choices based on new information or insights that arise. This flexibility allows me to navigate complexities more effectively. For instance, if my initial decision no longer aligns with my values or goals as circumstances change, I don’t hesitate to adjust my approach. It’s about being open to the evolution of a situation and recognizing when it calls for a different response.
The voice speeds up a lot. Like way too fast. Eg. It produces an audio file about 20 secs duration, but it should be about 60 secs. The audio speaks the full amount of text but jams it into about 20 secs.
I spent 10+ hours fiddling with it but can't affect any change, with speed or duration.
import SwiftUI
import F5TTS
import MLX
import AVFoundation
struct ContentView: View {
@State private var isGenerating = false
@State private var progressText = ""
}
#Preview {
ContentView()
}
The text was updated successfully, but these errors were encountered: