Skip to content

Latest commit

 

History

History
101 lines (84 loc) · 8.85 KB

File metadata and controls

101 lines (84 loc) · 8.85 KB

mod_google_transcribe

A Freeswitch module that generates real-time transcriptions on a Freeswitch channel by using Google's Speech-to-Text API.

Optionally, the connection to the google cloud recognizer can be delayed until voice activity has been detected. This can be useful in cases where it is desired to minimize the costs of streaming audio for transcription. This setting is governed by the channel variables starting with 1RECOGNIZER_VAD`, as described below.

API

Commands

The freeswitch module exposes two versions of an API command to transcribe speech:

version 1

uuid_google_transcribe <uuid> start <lang-code> [interim]

When using this command, additional speech processing options can be provided through Freeswitch channel variables, described below.

####version 2

uuid_google_transcribe2 <uuid> start <lang-code> [interim] (bool) \
[single-utterance](bool) [separate-recognition](bool) [max-alternatives](int) \
[profanity-filter](bool) [word-time](bool) [punctuation](bool) \
[model](string) [enhanced](bool) [hints](word seperated by , and no spaces) \
[play-file] (play file path)

This command allows speech processing options to be provided on the command line, and has the ability to optionally play an audio file as a prompt.

Example:

bgapi uuid_google_transcribe2 312033b6-4b2a-48d8-be0c-5f161aec2b3e start en-US \
true true true 5 true true true command_and_search true \
yes,no,hello https://www2.cs.uic.edu/~i101/SoundFiles/CantinaBand60.wav

Attaches media bug to channel and performs streaming recognize request.

  • uuid - unique identifier of Freeswitch channel
  • lang-code - a valid Google language code to use for speech recognition
  • interim - If the 'interim' keyword is present then both interim and final transcription results will be returned; otherwise only final transcriptions will be returned
uuid_google_transcribe <uuid> stop

Stop transcription on the channel.

Command Variables

Additional google speech options can be set through freeswitch channel variables for uuid_google_transcribe (some can alternatively be set in the command line for uuid_google_transcribe2).

variable Description
GOOGLE_SPEECH_SINGLE_UTTERANCE read this
GOOGLE_SPEECH_SEPARATE_RECOGNITION_PER_CHANNEL read this
GOOGLE_SPEECH_MAX_ALTERNATIVES read this
GOOGLE_SPEECH_PROFANITY_FILTER read this
GOOGLE_SPEECH_ENABLE_WORD_TIME_OFFSETS read this
GOOGLE_SPEECH_ENABLE_AUTOMATIC_PUNCTUATION read this
GOOGLE_SPEECH_MODEL read this
GOOGLE_SPEECH_USE_ENHANCED read this
GOOGLE_SPEECH_HINTS read this
GOOGLE_SPEECH_ALTERNATIVE_LANGUAGE_CODES a comma-separated list of language codes, per this
GOOGLE_SPEECH_SPEAKER_DIARIZATION set to 1 to enable speaker diarization
GOOGLE_SPEECH_SPEAKER_DIARIZATION_MIN_SPEAKER_COUNT read this
GOOGLE_SPEECH_SPEAKER_DIARIZATION_MAX_SPEAKER_COUNT read this
GOOGLE_SPEECH_METADATA_INTERACTION_TYPE set to 'discussion', 'presentation', 'phone_call', 'voicemail', 'professionally_produced', 'voice_search', 'voice_command', or 'dictation' per this
GOOGLE_SPEECH_METADATA_INDUSTRY_NAICS_CODE read this
GOOGLE_SPEECH_METADATA_MICROPHONE_DISTANCE set to 'nearfield', 'midfield', or 'farfield' per this
GOOGLE_SPEECH_METADATA_ORIGINAL_MEDIA_TYPE set to 'audio', or 'video' per this
GOOGLE_SPEECH_METADATA_RECORDING_DEVICE_TYPE set to 'smartphone', 'pc', 'phone_line', 'vehicle', 'other_outdoor_device', or 'other_indoor_device' per this
START_RECOGNIZING_ON_VAD if set to 1 or true, do not begin streaming audio to google cloud until voice activity is detected.
RECOGNIZER_VAD_MODE An integer value 0-3 from less to more aggressive vad detection (default: 2).
RECOGNIZER_VAD_VOICE_MS The number of milliseconds of voice activity that is required to trigger the connection to google cloud, when START_RECOGNIZING_ON_VAD is set (default: 250).
RECOGNIZER_VAD_DEBUG if >0 vad debug logs will be generated (default: 0).

Events

google_transcribe::transcription - returns an interim or final transcription. The event contains a JSON body describing the transcription result:

{
	"stability": 0,
	"is_final": true,
	"alternatives": [{
		"confidence": 0.96471,
		"transcript": "Donny was a good bowler, and a good man"
	}]
}

google_transcribe::end_of_utterance - returns an indication that an utterance has been detected. This may be returned prior to a final transcription. This event is only returned when GOOGLE_SPEECH_SINGLE_UTTERANCE is set to true.

google_transcribe::end_of_transcript - returned when a transcription operation has completed. If a final transcription has not been returned by now, it won't be. This event is only returned when GOOGLE_SPEECH_SINGLE_UTTERANCE is set to true.

google_transcribe::no_audio_detected - returned when google has returned an error indicating that no audio was received for a lengthy period of time.

google_transcribe::max_duration_exceeded - returned when google has returned an an indication that a long-running transcription has been stopped due to a max duration limit (305 seconds) on their side. It is the applications responsibility to respond by starting a new transcription session, if desired.

google_transcribe::no_audio_detected - returned when google has not received any audio for some reason.

Usage

When using drachtio-fsrmf, you can access this API command via the api method on the 'endpoint' object.

ep.api('uuid_google_transcribe', `${ep.uuid} start en-US`);  

Examples

google_transcribe.js