-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model predictions vary significantly depending on position of wakeword in audio #237
Comments
Precise is meant to operate on a continuous stream of audio. For this reason, it only it trained to output a high score for the frames immediately after the wake word. If you want to test a model against an entire audio sample you should take the maximum output value of all outputs. Let me know if that makes sense. |
@MatthewScholefield, yes, that what's the plot is showing, the model score for every frame of the two input audio samples. So for the first clip, the maximum score is around ~0.23 and for the second clip (where the only difference is a single extra frame of zero-padding), the maximum score is only around ~0.06. It might be clearer if I make the two plots separate. So this is the model's score for all of the frames of the first input clip: And this is the model score for the second input clip: So even if I use the maximum of all the outputs, I get a very different value for an otherwise identical audio clip. |
Oh, I see, thanks for clarifying. This is definitely not intended. Just for some clarity on how it works, it feeds audio features (MFCCs) for the last buffer_t seconds to independently produce one output. You can see the value of
|
Looks like for this model That's a great point about zero-padding potentially causing an issue with the MFCC features. Here are some plots where I just duplicate the initial mic background noise for ~1 second as padding instead of zeros (so now there is ~2.5 seconds of background noise before the wake word): Clip 1 Clip 2 Where again, the only difference between clip 1 and clip 2 is 1024 more samples of background noise padding in clip 2, the actual wake word utterance is identical. There still seems to be a significant difference in the two, in both maximum score and overall trend of the frame scores over time. |
Describe the bug
When using the Python bindings for Precise, I've noticed that the model predictions can vary substantially depending on where in the input audio the wake word is located. For example, The plot below shows the default "hey mycroft" model score for two repetitions of the same audio clip, where the only difference is that the second clip has one additional frame (1024 samples) of zero-padding compare to the first clip:
I'm currently doing some evaluation of Precise compared to other wakeword solutions, and this behavior is making it difficult to accurately assess performance as the length and padding of the test clips can cause significant differences in false-positive and false-negative metrics due to this behavior.
Is this behavior expected? If so, is there a recommended way to evaluate the model to minimize such effects?
To Reproduce
The following code should re-produce the plot above, using the attached audio file below and model versions referenced in the code:
test_clip.zip
Expected behavior
Precise should have very similar scores for otherwise identical audio that just occurs at a different position in the audio stream.
The text was updated successfully, but these errors were encountered: