Keep a model loaded to reduce subsequent generation time #185

linguistbro · 2024-12-03T08:14:59Z

linguistbro
Dec 3, 2024

Hello, I am pretty new to programming in general so this question might sound silly a bit. I am making subsequent generation calls (tts_to_file) and each call is taking around 14~ seconds. I believe the model is loaded each time a generation call is made, is there way to keep the model loaded at all times?

For example, I am using this to test.

from TTS.api import TTS

# Load the TTS model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def test_function(text)

tts.tts_to_file(text=text,
                file_path="assets/output.wav",
                speaker_wav="assets/grandpa_voice.wav",
                language="en")
play_audio("assets/output.wav")

test_function('test 1')
test_function('test 2')
test_function('test 2')
test_function('test 3')

Answered by eginhard

Dec 3, 2024

This is exactly what the tts = TTS(...) line does. Using:

import logging
from TTS.api import TTS

logging.basicConfig(level=logging.INFO)

# Load the TTS model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def test_function(text):
    tts.tts_to_file(
        text=text,
        file_path="tts_output.wav",
        speaker_wav="reference.wav",
        language="en"
    )

test_function('test 1')
test_function('test 2')
test_function('test 2')
test_function('test 3')

I get the following output on an RTX3090:

INFO:TTS.utils.manage:tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
INFO:TTS.tts.models:Using model: xtts
INFO:TTS.utils.synthesizer:Text …

View full answer

eginhard · 2024-12-03T09:11:50Z

eginhard
Dec 3, 2024
Maintainer

This is exactly what the tts = TTS(...) line does. Using:

import logging
from TTS.api import TTS

logging.basicConfig(level=logging.INFO)

# Load the TTS model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def test_function(text):
    tts.tts_to_file(
        text=text,
        file_path="tts_output.wav",
        speaker_wav="reference.wav",
        language="en"
    )

test_function('test 1')
test_function('test 2')
test_function('test 2')
test_function('test 3')

I get the following output on an RTX3090:

INFO:TTS.utils.manage:tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
INFO:TTS.tts.models:Using model: xtts
INFO:TTS.utils.synthesizer:Text split into sentences.
INFO:TTS.utils.synthesizer:Input: ['test 1']
INFO:TTS.utils.synthesizer:Processing time: 1.834
INFO:TTS.utils.synthesizer:Real-time factor: 1.012
INFO:TTS.utils.synthesizer:Text split into sentences.
INFO:TTS.utils.synthesizer:Input: ['test 2']
INFO:TTS.utils.synthesizer:Processing time: 0.342
INFO:TTS.utils.synthesizer:Real-time factor: 0.252
INFO:TTS.utils.synthesizer:Text split into sentences.
INFO:TTS.utils.synthesizer:Input: ['test 2']
INFO:TTS.utils.synthesizer:Processing time: 0.391
INFO:TTS.utils.synthesizer:Real-time factor: 0.251
INFO:TTS.utils.synthesizer:Text split into sentences.
INFO:TTS.utils.synthesizer:Input: ['test 3']
INFO:TTS.utils.synthesizer:Processing time: 0.349
INFO:TTS.utils.synthesizer:Real-time factor: 0.248

You could potentially save a little bit of time by precomputing the speaker embedding and reusing it (see the docs). But if it takes 14 seconds for you, the issue must come from somewhere else, even on a CPU it's faster.

4 replies

linguistbro Dec 6, 2024
Author

I have switched to streaming inference and it is 100% times better! The audio playback is near instant right now! I couldn't get deepspeed working, tho. However, even with deepspeed set to false, I get what I want! Now my problem is that the last part of the audio (0.2 secs) is being cut off. If I can learn how to fix that I will have my perfect solution for my AI companion :D Thanks a lot!

eginhard Dec 6, 2024
Maintainer

Now my problem is that the last part of the audio (0.2 secs) is being cut off.

Pretty sure that's just an inherent issue of the XTTS model

linguistbro Dec 6, 2024
Author

I see, what would be the next step for me right now if I want to eliminate that problem? I see that there are other models available that are supported by coqui-tts. I am using English with a reference.wav. I am quite confused about the differences between models.

eginhard Dec 6, 2024
Maintainer

Yes, you can try if other models work for your use case. See the readme for how to list available models.

linguistbro · 2024-12-03T14:03:27Z

linguistbro
Dec 3, 2024
Author

I have used much longer testing text (around 100 words each), using Nvidia T4 (16gb) with cuda enabled. All took the same amount of time (14~ seconds). That's why I was curious. I see, that's all we can do :(

2 replies

eginhard Dec 3, 2024
Maintainer

100 words is quite long, you should also see some warnings that this can lead to issues. Better split it into sentences. You could also use streaming to get faster first outputs.

linguistbro Dec 4, 2024
Author

I see, thanks, I will try that and see if it is significantly faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep a model loaded to reduce subsequent generation time #185

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Keep a model loaded to reduce subsequent generation time #185

linguistbro Dec 3, 2024

Replies: 2 comments · 6 replies

eginhard Dec 3, 2024 Maintainer

linguistbro Dec 6, 2024 Author

eginhard Dec 6, 2024 Maintainer

linguistbro Dec 6, 2024 Author

eginhard Dec 6, 2024 Maintainer

linguistbro Dec 3, 2024 Author

eginhard Dec 3, 2024 Maintainer

linguistbro Dec 4, 2024 Author

linguistbro
Dec 3, 2024

Replies: 2 comments 6 replies

eginhard
Dec 3, 2024
Maintainer

linguistbro Dec 6, 2024
Author

eginhard Dec 6, 2024
Maintainer

linguistbro Dec 6, 2024
Author

eginhard Dec 6, 2024
Maintainer

linguistbro
Dec 3, 2024
Author

eginhard Dec 3, 2024
Maintainer

linguistbro Dec 4, 2024
Author