-
Notifications
You must be signed in to change notification settings - Fork 226
Training your own wake word
Precise comes with a few executables used to train and test models.
First, run through the Source Install
procedure on the readme.
Once installed, to gain access to these executables in the current terminal session,
run the command:
source .venv/bin/activate
Here's a summary of all the executables:
-
precise-collect
- Record audio samples for use with Precise -
precise-convert
- Convert wake-word model from Keras to TensorFlow -
precise-eval
- Evaluate a list of models on a dataset -
precise-listen
- Run a model on microphone audio input -
precise-engine
- Run a model on raw audio data from stdin -
precise-test
- Test a model against a dataset -
precise-train
- Train a new model on a dataset -
precise-train-incremental
- Train a model to inhibit activation by marking false activations and retraining
For more info on each individual script, you can run <script-name> -h
.
The rough process for training a model is as follows:
-
precise-collect
- Record wake word samples -
precise-train
- Initial training -
precise-train-incremental
- Reduce false activations -
precise-test
- Statistics on dataset accuracy -
precise-listen
- Real world test with your microphone -
precise-convert
- Convert .net to .pb
The first thing you'll want to do is record some audio samples of your
wake word. To do that, use the tool, precise-collect
, which will
guide you through recording a few samples. The default settings should be fine.
Use this tool to collect around 12 samples, making sure to leave a second or two of silence at the start of each recording, but with no silence after the wake word.
$ precise-collect
Audio name (Ex. recording-##): hey-computer.##
ALSA lib pcm_dsnoop.c:638:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
Press space to record (esc to exit)...
Recording...
Saved as hey-computer.00.wav
Press space to record (esc to exit)...
Audio files from precise-collect
will be WAV files in little-endian, 16 bit, mono, 16000hz PCM format. FFMpeg calls this “pcm_s16le”. If you are collecting samples using another program they must be converted to the appropriate format using an ffmpeg command:
$ ffmpeg -i input.mp3 -acodec pcm_s16le -ar 16000 -ac 1 output.wav
Now, place most of these files under hey-computer/wake-word/
and the rest
under hey-computer/test/wake-word
:
hey-computer/
├── wake-word/
│ ├── hey-computer.00.wav
│ ├── hey-computer.01.wav
│ ├── hey-computer.02.wav
│ ├── hey-computer.03.wav
│ ├── hey-computer.04.wav
│ ├── hey-computer.05.wav
│ ├── hey-computer.06.wav
│ ├── hey-computer.07.wav
│ └── hey-computer.08.wav
├── not-wake-word/
└── test/
├── wake-word/
│ ├── hey-computer.09.wav
│ ├── hey-computer.10.wav
│ ├── hey-computer.11.wav
│ └── hey-computer.12.wav
└── not-wake-word/
This tells Precise to train on the first 8 samples and evaluate the model's accuracy using the last 4.
Now, we can start to train a model with the precise-train
tool:
$ precise-train -e 60 hey-computer.net hey-computer/
...
Epoch 1/20
2018-02-23 11:32:05.235740: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
9/9 [==============================] - 0s 31ms/step - loss: 0.2620 - acc: 0.3333 - val_loss: 0.0836 - val_acc: 0.5000
...
Epoch 60/60
9/9 [==============================] - 0s 1ms/step - loss: 0.0025 - acc: 1.0000 - val_loss: 5.6518e-05 - val_acc: 1.0000
Now, we can run this model against live microphone input using precise-listen
. It will listen to the microphone and output confidence bars. Each line represents one measurement: the more Xs there are, the more confident that the model believes that the wake word was uttered. Any Xs over the threshold are denoted with a lowercase x.
$ precise-listen hey-computer.net
Using TensorFlow backend.
2018-02-23 12:46:22.622717: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_route.c:867:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
XXX-----------------------------------------------------------------------------
XXXXXXXX------------------------------------------------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx---------------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxx------------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxx-------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxxxxxxxx---------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxxxxxxxxxx-------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxxxxxxxxxxxx-----------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxxxxxxxxxxxxxx---------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxx--------------------------------------
XXXXXXXXXXXXX-------------------------------------------------------------------
XX------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
As you can see, the model has simply learned to activate to every noise, rather than specifically our wake word. Now what we need to do is reduce false activations by incorporating some data that the model should not activate on.
There are two ways of reducing false activations: recording your own
false activations and putting them in hey-computer/not-wake-word
, or
using an automated process to find false positives in large, audio files
filled with everyday noise.
To record your own false activations, launch precise-listen
in save mode:
precise-listen hey-computer.net -d hey-computer/not-wake-word
Now you can say words similar to your wake word and everytime the model activates,
it will save that recording into the hey-computer/not-wake-word
folder. Just make
sure never to say the actual wake word while in save mode.
Once you've gathered a few samples of new false activations, retrain your model with
the same precise-train
command:
precise-train hey-computer.net hey-computer/ -e 600
You can stop training with ctrl+c
once the accuracy (acc
) gets close to 1.0.
Now, you can repeat the process, running precise-listen
again. You should notice
the model learned not to activate on what it had failed on before.
While the first method works to a certain degree, you will still notice a large number of false activations during just everyday noise. To reduce the number of times the model activates when it shouldn't, we need a bunch of long audio files that don't have the wake word in it. You can use pretty much any set of sounds, but a diversified set of audio is better. A good place to start is the Public Domain Sounds Backup. You can download it with:
cd data/random
wget http://downloads.tuxfamily.org/pdsounds/pdsounds_march2009.7z
# Install p7zip
7z x pdsounds_march2009.7z
cd ../..
After downloading a set of sounds, they probably won't be in the right format. They need
to be 16 bit signed integer mono wav files with a sample rate of 16000. However, don't
worry if that's not the case. All we need is the command line tool, ffmpeg
, and the following
script:
SOURCE_DIR=data/random/mp3
DEST_DIR=data/random
for i in $SOURCE_DIR/*.mp3; do echo "Converting $i..."; fn=${i##*/}; ffmpeg -i "$i" -acodec pcm_s16le -ar 16000 -ac 1 -f wav "$DEST_DIR/${fn%.*}.wav"; done
Here, you can see it runs ffmpeg input.mp3 -acodec pcm_s16le -ar 16000 -ac 1 output.wav
on
all the mp3 files, placing the results in data/random
. Now we are ready to reduce false
activations. Begin the process with the command:
precise-train-incremental hey-computer.net hey-computer/ -r data/random/
Now you will see it run through all the wav files in data/random
, picking out clips where
it false activated, placing them into the hey-computer/not-wake-word
directory, and
retraining. This process will take a while, depending on the total length of audio in the
dataset and your processor speed.
Once it finishes, we can look at how it performs against the test dataset with:
precise-test hey-computer.net hey-computer/
And, we can test it again through the microphone with:
precise-listen hey-computer.net
Finally, if there are still too many false activations we can add more audio to
data/random
and repeat the process. If it looks good, continue below.
So far, we've only dealt with .net
files. This extension used throughout
Precise represents an HDF5 model file trained with Keras. To reduce
runtime dependencies, you must convert the .net
Keras model into a
.pb
TensorFlow model. Do this with the following command:
precise-convert hey-computer.net
That's it! Now the final, exported model consists of the following two files:
hey-computer.pb
hey-computer.pb.params
The first contains the TensorFlow neural network and the second contains details specific to Precise for how the audio was processed for the network.