Hardware power for synthesizing speech #83
-
Hi everyone! First, a huge thanks to the developers behind this amazing framework and to the community for all the helpful discussions both here and in the main branch. They've really helped me dive into the TTS world. If I'm not writing in the wrong thread, I have a question related to my current project: I'm training a VITS model to generate speech for an LLM that will be integrated into a robot. While I can rely on cloud services like OpenAI's API for the LLM, I believe the speech synthesis part needs to be done locally (due to latency requirements/I want to use my model). I'm aiming for real-time synthesis (or at least minimal latency). My question is: how powerful does the robot's hardware need to be? A Raspberry Pi 5 seems a bit too underpowered. Would a mini-PC be a better fit? Is CUDA acceleration essential for this task? I tested my current model (~370k steps, I'm planning even ~2M) on an i9-12900k without CUDA, and 'tts' generated an output file in about 6 seconds, which is acceptable for me. Thanks in advance for your insights! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You can convert Coqui's VITS models to ONNX format and then run them with sherpa-onnx, see: coqui-ai#2602 (comment) But not sure what the minimal hardware requirements would be in that case, you'd need to do some tests. |
Beta Was this translation helpful? Give feedback.
You can convert Coqui's VITS models to ONNX format and then run them with sherpa-onnx, see: coqui-ai#2602 (comment) But not sure what the minimal hardware requirements would be in that case, you'd need to do some tests.