-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it compatible as a voice assistant #64
Comments
Both work (now that I restarted). However the atom echo still doesn't work. I'm guessing it has something to do with the TTS taking longer than most other options and timing out. So probably not an issue with your integration. I can't find anything in the code for the echo about timeouts though. |
If you take a look at debug logs, this actually has to do with the output format that ESPHome requests from the ElevenLabs integration, which it doesn't seem to support. ESPHome wants a raw output format (literally requesting "raw"), but the ElevenLabs integration here can only output MP3
|
okay, so after doing some testing, it looks like this could work: manifest.json: ...
"requirements": ["pydub==0.25.1"],
... elevenlabs.py: ...
from pydub import AudioSegment
import io
...
async def get_tts_audio(
self, message: str, options: dict | None = None
) -> tuple[str, bytes]:
"""Get text-to-speech audio for the given message."""
tts_options = await self.get_tts_options(options)
voice_id, stability, similarity, model, optimize_latency, api_key = tts_options[
:6
]
endpoint = f"text-to-speech/{voice_id}"
data = {
"text": message,
"model_id": model,
"voice_settings": {
"stability": stability,
"similarity_boost": similarity,
},
}
if model.endswith("_v2"):
style, use_speaker_boost = tts_options[6:]
data["voice_settings"]["style"] = style
data["voice_settings"]["use_speaker_boost"] = use_speaker_boost
params = {"optimize_streaming_latency": optimize_latency}
_LOGGER.debug("Requesting TTS from %s", endpoint)
_LOGGER.debug("Request data: %s", data)
_LOGGER.debug("Request params: %s", params)
resp = await self.post(endpoint, data, params, api_key=api_key)
out_mp3 = resp.content
if options.get(ATTR_AUDIO_OUTPUT, "mp3") == "raw":
mp3_io = io.BytesIO(out_mp3)
audio = AudioSegment.from_file(mp3_io, format="mp3")
return "raw", audio.set_frame_rate(16000).set_sample_width(2).set_channels(1).raw_data
# - OR - if the audio is choppy:
return "raw", audio.set_frame_rate(32000).set_sample_width(1).set_channels(1).raw_data
else:
return "mp3", out_mp3
async def get_tts_options(
self, options: dict
) -> tuple[str, float, float, str, int, str, float, bool]:
"""Get the text-to-speech options for generating TTS audio."""
# If options is None, assign an empty dictionary to options
if not options:
options = {}
output_opts = options.get(ATTR_AUDIO_OUTPUT, "mp3")
if output_opts not in ["mp3", "raw"]:
raise ValueError("Only MP3 or raw output is supported: " + output_opts)
# Get the voice from options, or fall back to the configured default voice
voice_opt = (
options.get(ATTR_VOICE)
or self.config_entry.options.get(ATTR_VOICE)
or DEFAULT_VOICE
)
# Get the stability, similarity, model, and optimize latency from options,
# or fall back to the configured default values
stability = (
options.get(CONF_STABILITY)
or self.config_entry.options.get(CONF_STABILITY)
or DEFAULT_STABILITY
)
similarity = (
options.get(CONF_SIMILARITY)
or self.config_entry.options.get(CONF_SIMILARITY)
or DEFAULT_SIMILARITY
)
model = (
options.get(CONF_MODEL)
or self.config_entry.options.get(CONF_MODEL)
or DEFAULT_MODEL
)
optimize_latency = (
options.get(CONF_OPTIMIZE_LATENCY)
or self.config_entry.options.get(CONF_OPTIMIZE_LATENCY)
or DEFAULT_OPTIMIZE_LATENCY
)
api_key = (
options.get(CONF_API_KEY)
or self.config_entry.options.get(CONF_API_KEY)
or self._api_key
)
# Convert optimize_latency to an integer
optimize_latency = int(optimize_latency)
# Get the voice ID by name from the TTS service
voice = await self.get_voice_by_name_or_id(voice_opt)
voice_id = voice.get("voice_id", None)
# If voice_id is not found, refresh the list of voices and try again
if not voice_id:
_LOGGER.debug("Could not find voice, refreshing voices")
await self.get_voices()
voice = await self.get_voice_by_name_or_id(voice_opt)
voice_id = voice.get("voice_id", None)
# If voice_id is still not found, log a warning
# and use the first available voice
if not voice_id:
_LOGGER.warning(
"Could not find voice with name %s, available voices: %s",
voice,
[voice["name"] for voice in self._voices],
)
voice_id = self._voices[0]["voice_id"]
if model.endswith("_v2"):
style = (
options.get(CONF_STYLE)
or self.config_entry.options.get(CONF_STYLE)
or DEFAULT_STYLE
)
use_speaker_boost = (
options.get(CONF_USE_SPEAKER_BOOST)
or self.config_entry.options.get(CONF_USE_SPEAKER_BOOST)
or DEFAULT_USE_SPEAKER_BOOST
)
return (
voice_id,
stability,
similarity,
model,
optimize_latency,
api_key,
style,
use_speaker_boost,
)
return (
voice_id,
stability,
similarity,
model,
optimize_latency,
api_key,
) I also fixed a bug where the only recognized "v2" model was the multilingual one, despite the fact that there's now turbo. |
I can confirm this has fixed the issue! |
I also noticed that the tail end of "raw" output often gets cut off, so I added a line that gives another second and a half of silence at the end of the audio to fix that. Change
to
or the 16k version with a sample width of 2, if that's your preference. |
@egg82 thanks for figuring this out! I'm unfortunately on vacation for a few weeks and not able to work on this much right now, but if you'd like to make a PR that'll make it a lot easier for me to review the changes. |
Of course! These are just sort of my notes as I was trying to figure out how to fix this issue. I wouldn't expect either of the changes I made to be particularly robust. For example, I make the assumption that anything requesting raw output also requires the specific khz and depth provided, which may not always be the case. Maybe there's some more info in the request parameters that could help? The silence delay should also be configurable, for sure. Mostly just making assumptions/decisions on the quality of the audio output based purely on the data type isn't a great way to do it, but for the specific thing I need it works fine. At least it resolved this issue. Unfortunately I'm not very familiar with how HA integrations work. I made educated guesses for these fixes that turned out to be correct. |
Related issues, specifically for the "end being cut off" and "choppy audio" portions: Essentially, it looks like the logs for ESPHome devices start screaming the following two lines:
the related code is here: static const size_t RECEIVE_SIZE = 1024;
static const size_t SPEAKER_BUFFER_SIZE = 16 * RECEIVE_SIZE; where the receive buffer is 1kb and the speaker buffer is 16kb hardcoded, despite the fact that ESP32 devices have 520KB of SRAM available to them (with some caveats on how much can be allocated to what). This isn't something that the elevenlabs_tts integration can fix, but it's good to track these as the source of the problem. |
This has been fixed in Home Assistant and this issue can be closed. |
Thanks @balloob! |
I wasn't 100% sure if this was fixed yet - glad it is! Thanks for the work in making this integration better. |
Checklist
Is your feature request related to a problem? Please describe.
Even though the voice assistant section in Home Assistant allows you to choose the Eleven Labs as the TTS option, it doesn't seem to work (at least on an M5 Stack Echo).
Describe the solution you'd like
Try and make it work as a voice assistant.
Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: