A Freeswitch module that generates real-time transcriptions on a Freeswitch channel by using Google's Speech-to-Text API.
Optionally, the connection to the google cloud recognizer can be delayed until voice activity has been detected. This can be useful in cases where it is desired to minimize the costs of streaming audio for transcription. This setting is governed by the channel variables starting with 1RECOGNIZER_VAD`, as described below.
The freeswitch module exposes two versions of an API command to transcribe speech:
uuid_google_transcribe <uuid> start <lang-code> [interim]
When using this command, additional speech processing options can be provided through Freeswitch channel variables, described below.
####version 2
uuid_google_transcribe2 <uuid> start <lang-code> [interim] (bool) \
[single-utterance](bool) [separate-recognition](bool) [max-alternatives](int) \
[profanity-filter](bool) [word-time](bool) [punctuation](bool) \
[model](string) [enhanced](bool) [hints](word seperated by , and no spaces) \
[play-file] (play file path)
This command allows speech processing options to be provided on the command line, and has the ability to optionally play an audio file as a prompt.
Example:
bgapi uuid_google_transcribe2 312033b6-4b2a-48d8-be0c-5f161aec2b3e start en-US \
true true true 5 true true true command_and_search true \
yes,no,hello https://www2.cs.uic.edu/~i101/SoundFiles/CantinaBand60.wav
Attaches media bug to channel and performs streaming recognize request.
uuid
- unique identifier of Freeswitch channellang-code
- a valid Google language code to use for speech recognitioninterim
- If the 'interim' keyword is present then both interim and final transcription results will be returned; otherwise only final transcriptions will be returned
uuid_google_transcribe <uuid> stop
Stop transcription on the channel.
Additional google speech options can be set through freeswitch channel variables for uuid_google_transcribe
(some can alternatively be set in the command line for uuid_google_transcribe2
).
variable | Description |
---|---|
GOOGLE_SPEECH_SINGLE_UTTERANCE | read this |
GOOGLE_SPEECH_SEPARATE_RECOGNITION_PER_CHANNEL | read this |
GOOGLE_SPEECH_MAX_ALTERNATIVES | read this |
GOOGLE_SPEECH_PROFANITY_FILTER | read this |
GOOGLE_SPEECH_ENABLE_WORD_TIME_OFFSETS | read this |
GOOGLE_SPEECH_ENABLE_AUTOMATIC_PUNCTUATION | read this |
GOOGLE_SPEECH_MODEL | read this |
GOOGLE_SPEECH_USE_ENHANCED | read this |
GOOGLE_SPEECH_HINTS | read this |
GOOGLE_SPEECH_ALTERNATIVE_LANGUAGE_CODES | a comma-separated list of language codes, per this |
GOOGLE_SPEECH_SPEAKER_DIARIZATION | set to 1 to enable speaker diarization |
GOOGLE_SPEECH_SPEAKER_DIARIZATION_MIN_SPEAKER_COUNT | read this |
GOOGLE_SPEECH_SPEAKER_DIARIZATION_MAX_SPEAKER_COUNT | read this |
GOOGLE_SPEECH_METADATA_INTERACTION_TYPE | set to 'discussion', 'presentation', 'phone_call', 'voicemail', 'professionally_produced', 'voice_search', 'voice_command', or 'dictation' per this |
GOOGLE_SPEECH_METADATA_INDUSTRY_NAICS_CODE | read this |
GOOGLE_SPEECH_METADATA_MICROPHONE_DISTANCE | set to 'nearfield', 'midfield', or 'farfield' per this |
GOOGLE_SPEECH_METADATA_ORIGINAL_MEDIA_TYPE | set to 'audio', or 'video' per this |
GOOGLE_SPEECH_METADATA_RECORDING_DEVICE_TYPE | set to 'smartphone', 'pc', 'phone_line', 'vehicle', 'other_outdoor_device', or 'other_indoor_device' per this |
START_RECOGNIZING_ON_VAD | if set to 1 or true, do not begin streaming audio to google cloud until voice activity is detected. |
RECOGNIZER_VAD_MODE | An integer value 0-3 from less to more aggressive vad detection (default: 2). |
RECOGNIZER_VAD_VOICE_MS | The number of milliseconds of voice activity that is required to trigger the connection to google cloud, when START_RECOGNIZING_ON_VAD is set (default: 250). |
RECOGNIZER_VAD_DEBUG | if >0 vad debug logs will be generated (default: 0). |
google_transcribe::transcription - returns an interim or final transcription. The event contains a JSON body describing the transcription result:
{
"stability": 0,
"is_final": true,
"alternatives": [{
"confidence": 0.96471,
"transcript": "Donny was a good bowler, and a good man"
}]
}
google_transcribe::end_of_utterance - returns an indication that an utterance has been detected. This may be returned prior to a final transcription. This event is only returned when GOOGLE_SPEECH_SINGLE_UTTERANCE is set to true.
google_transcribe::end_of_transcript - returned when a transcription operation has completed. If a final transcription has not been returned by now, it won't be. This event is only returned when GOOGLE_SPEECH_SINGLE_UTTERANCE is set to true.
google_transcribe::no_audio_detected - returned when google has returned an error indicating that no audio was received for a lengthy period of time.
google_transcribe::max_duration_exceeded - returned when google has returned an an indication that a long-running transcription has been stopped due to a max duration limit (305 seconds) on their side. It is the applications responsibility to respond by starting a new transcription session, if desired.
google_transcribe::no_audio_detected - returned when google has not received any audio for some reason.
When using drachtio-fsrmf, you can access this API command via the api method on the 'endpoint' object.
ep.api('uuid_google_transcribe', `${ep.uuid} start en-US`);