Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it compatible as a voice assistant #64

Closed
3 tasks done
julianrinaldi opened this issue Nov 4, 2023 · 12 comments
Closed
3 tasks done

Make it compatible as a voice assistant #64

julianrinaldi opened this issue Nov 4, 2023 · 12 comments

Comments

@julianrinaldi
Copy link

Checklist

  • I have filled out the template to the best of my ability.
  • This only contains 1 feature request (if you have multiple feature requests, open one feature request for each feature request).
  • This issue is not a duplicate feature request of previous feature requests.

Is your feature request related to a problem? Please describe.

Even though the voice assistant section in Home Assistant allows you to choose the Eleven Labs as the TTS option, it doesn't seem to work (at least on an M5 Stack Echo).

Describe the solution you'd like

Try and make it work as a voice assistant.

Describe alternatives you've considered

None

Additional context

None

@carleeno
Copy link
Owner

carleeno commented Nov 4, 2023

Hi, although I don't have a M5 Stack Echo to try, it works as my voice assistant in browser and phone.

Can you try a couple of things?

  1. Go to Settings -> Voice Assistant -> edit your assistant -> Try Voice, and see if it works there
    image
  2. In your browser, go to your main dashboard, click the assistant icon in the top right, use the mic to speak a command and see if it replies in the browser

If 1 didn't work, make sure your config is correct and that you can hear it when making a tts.speak service call

If 2 didn't work, restart core (Developer tools -> YAML -> Restart) and try 2 again, this happened to me the first time I tried it.

Let me know what you find.

@julianrinaldi
Copy link
Author

Both work (now that I restarted). However the atom echo still doesn't work. I'm guessing it has something to do with the TTS taking longer than most other options and timing out. So probably not an issue with your integration. I can't find anything in the code for the echo about timeouts though.

@egg82
Copy link

egg82 commented Nov 12, 2023

If you take a look at debug logs, this actually has to do with the output format that ESPHome requests from the ElevenLabs integration, which it doesn't seem to support. ESPHome wants a raw output format (literally requesting "raw"), but the ElevenLabs integration here can only output MP3

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/esphome/voice_assistant.py", line 305, in _send_tts
    _extension, audio_bytes = await tts.async_get_media_source_audio(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/tts/__init__.py", line 178, in async_get_media_source_audio
    return await manager.async_get_tts_audio(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/tts/__init__.py", line 561, in async_get_tts_audio
    await pending
  File "/usr/src/homeassistant/homeassistant/components/tts/__init__.py", line 608, in get_tts_data
    extension, data = await engine_instance.internal_async_get_tts_audio(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/tts/__init__.py", line 364, in internal_async_get_tts_audio
    return await self.async_get_tts_audio(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/elevenlabs_tts/tts.py", line 88, in async_get_tts_audio
    return await self._client.get_tts_audio(message, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/elevenlabs_tts/elevenlabs.py", line 123, in get_tts_audio
    tts_options = await self.get_tts_options(options)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/elevenlabs_tts/elevenlabs.py", line 160, in get_tts_options
    raise ValueError("Only MP3 output is supported.")
ValueError: Only MP3 output is supported.

@egg82
Copy link

egg82 commented Nov 12, 2023

okay, so after doing some testing, it looks like this could work:

manifest.json:

...
"requirements": ["pydub==0.25.1"],
...

elevenlabs.py:

...
from pydub import AudioSegment
import io
...

    async def get_tts_audio(
        self, message: str, options: dict | None = None
    ) -> tuple[str, bytes]:
        """Get text-to-speech audio for the given message."""
        tts_options = await self.get_tts_options(options)
        voice_id, stability, similarity, model, optimize_latency, api_key = tts_options[
            :6
        ]

        endpoint = f"text-to-speech/{voice_id}"
        data = {
            "text": message,
            "model_id": model,
            "voice_settings": {
                "stability": stability,
                "similarity_boost": similarity,
            },
        }

        if model.endswith("_v2"):
            style, use_speaker_boost = tts_options[6:]
            data["voice_settings"]["style"] = style
            data["voice_settings"]["use_speaker_boost"] = use_speaker_boost

        params = {"optimize_streaming_latency": optimize_latency}
        _LOGGER.debug("Requesting TTS from %s", endpoint)
        _LOGGER.debug("Request data: %s", data)
        _LOGGER.debug("Request params: %s", params)

        resp = await self.post(endpoint, data, params, api_key=api_key)
        out_mp3 = resp.content

        if options.get(ATTR_AUDIO_OUTPUT, "mp3") == "raw":
            mp3_io = io.BytesIO(out_mp3)
            audio = AudioSegment.from_file(mp3_io, format="mp3")
            return "raw", audio.set_frame_rate(16000).set_sample_width(2).set_channels(1).raw_data
            # - OR - if the audio is choppy:
            return "raw", audio.set_frame_rate(32000).set_sample_width(1).set_channels(1).raw_data
        else:
            return "mp3", out_mp3

    async def get_tts_options(
        self, options: dict
    ) -> tuple[str, float, float, str, int, str, float, bool]:
        """Get the text-to-speech options for generating TTS audio."""
        # If options is None, assign an empty dictionary to options
        if not options:
            options = {}

        output_opts = options.get(ATTR_AUDIO_OUTPUT, "mp3")

        if output_opts not in ["mp3", "raw"]:
            raise ValueError("Only MP3 or raw output is supported: " + output_opts)

        # Get the voice from options, or fall back to the configured default voice
        voice_opt = (
            options.get(ATTR_VOICE)
            or self.config_entry.options.get(ATTR_VOICE)
            or DEFAULT_VOICE
        )

        # Get the stability, similarity, model, and optimize latency from options,
        # or fall back to the configured default values
        stability = (
            options.get(CONF_STABILITY)
            or self.config_entry.options.get(CONF_STABILITY)
            or DEFAULT_STABILITY
        )

        similarity = (
            options.get(CONF_SIMILARITY)
            or self.config_entry.options.get(CONF_SIMILARITY)
            or DEFAULT_SIMILARITY
        )

        model = (
            options.get(CONF_MODEL)
            or self.config_entry.options.get(CONF_MODEL)
            or DEFAULT_MODEL
        )

        optimize_latency = (
            options.get(CONF_OPTIMIZE_LATENCY)
            or self.config_entry.options.get(CONF_OPTIMIZE_LATENCY)
            or DEFAULT_OPTIMIZE_LATENCY
        )

        api_key = (
            options.get(CONF_API_KEY)
            or self.config_entry.options.get(CONF_API_KEY)
            or self._api_key
        )

        # Convert optimize_latency to an integer
        optimize_latency = int(optimize_latency)

        # Get the voice ID by name from the TTS service

        voice = await self.get_voice_by_name_or_id(voice_opt)
        voice_id = voice.get("voice_id", None)

        # If voice_id is not found, refresh the list of voices and try again
        if not voice_id:
            _LOGGER.debug("Could not find voice, refreshing voices")
            await self.get_voices()
            voice = await self.get_voice_by_name_or_id(voice_opt)
            voice_id = voice.get("voice_id", None)

            # If voice_id is still not found, log a warning
            #  and use the first available voice
            if not voice_id:
                _LOGGER.warning(
                    "Could not find voice with name %s, available voices: %s",
                    voice,
                    [voice["name"] for voice in self._voices],
                )
                voice_id = self._voices[0]["voice_id"]

        if model.endswith("_v2"):
            style = (
                options.get(CONF_STYLE)
                or self.config_entry.options.get(CONF_STYLE)
                or DEFAULT_STYLE
            )
            use_speaker_boost = (
                options.get(CONF_USE_SPEAKER_BOOST)
                or self.config_entry.options.get(CONF_USE_SPEAKER_BOOST)
                or DEFAULT_USE_SPEAKER_BOOST
            )
            return (
                voice_id,
                stability,
                similarity,
                model,
                optimize_latency,
                api_key,
                style,
                use_speaker_boost,
            )

        return (
            voice_id,
            stability,
            similarity,
            model,
            optimize_latency,
            api_key,
        )

I also fixed a bug where the only recognized "v2" model was the multilingual one, despite the fact that there's now turbo.

@julianrinaldi
Copy link
Author

I can confirm this has fixed the issue!

@egg82
Copy link

egg82 commented Nov 13, 2023

I also noticed that the tail end of "raw" output often gets cut off, so I added a line that gives another second and a half of silence at the end of the audio to fix that.

Change

            return "raw", audio.set_frame_rate(32000).set_sample_width(1).set_channels(1).raw_data

to

            expanded = audio + AudioSegment.silent(duration=1500)
            return "raw", expanded.set_frame_rate(32000).set_sample_width(1).set_channels(1).raw_data

or the 16k version with a sample width of 2, if that's your preference.

@carleeno
Copy link
Owner

@egg82 thanks for figuring this out! I'm unfortunately on vacation for a few weeks and not able to work on this much right now, but if you'd like to make a PR that'll make it a lot easier for me to review the changes.
If you do make a PR, thank you, but also lets make the end silence a config option, I think not everybody will want the extra delay at the end, especially when chaining outputs.

@egg82
Copy link

egg82 commented Nov 13, 2023

Of course! These are just sort of my notes as I was trying to figure out how to fix this issue. I wouldn't expect either of the changes I made to be particularly robust. For example, I make the assumption that anything requesting raw output also requires the specific khz and depth provided, which may not always be the case. Maybe there's some more info in the request parameters that could help? The silence delay should also be configurable, for sure. Mostly just making assumptions/decisions on the quality of the audio output based purely on the data type isn't a great way to do it, but for the specific thing I need it works fine. At least it resolved this issue.

Unfortunately I'm not very familiar with how HA integrations work. I made educated guesses for these fixes that turned out to be correct.

@egg82
Copy link

egg82 commented Nov 17, 2023

Related issues, specifically for the "end being cut off" and "choppy audio" portions:
esphome/issues#5038
esphome/feature-requests#2239

Essentially, it looks like the logs for ESPHome devices start screaming the following two lines:

[W][voice_assistant:293]: Speaker buffer full.
[W][voice_assistant:283]: Receive buffer full.

the related code is here:
https://esphome.io/api/voice__assistant_8cpp_source.html

 static const size_t RECEIVE_SIZE = 1024;
 static const size_t SPEAKER_BUFFER_SIZE = 16 * RECEIVE_SIZE;

where the receive buffer is 1kb and the speaker buffer is 16kb hardcoded, despite the fact that ESP32 devices have 520KB of SRAM available to them (with some caveats on how much can be allocated to what). This isn't something that the elevenlabs_tts integration can fix, but it's good to track these as the source of the problem.

@balloob
Copy link

balloob commented Feb 4, 2024

This has been fixed in Home Assistant and this issue can be closed.

@carleeno
Copy link
Owner

carleeno commented Feb 4, 2024

This has been fixed in Home Assistant and this issue can be closed.

Thanks @balloob!

@carleeno carleeno closed this as completed Feb 4, 2024
@egg82
Copy link

egg82 commented Feb 5, 2024

I wasn't 100% sure if this was fixed yet - glad it is! Thanks for the work in making this integration better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants