-
Notifications
You must be signed in to change notification settings - Fork 62
Week 6
This week's topic was decided by a poll, so we're going to talk about sound and language.
These things were voted down, but had at least one person who was "superduper interested" so I'll mention them briefly.
It works for printed material. If you want high accuracy, you need to train on the font you're detecting. Handwriting recognition is a whole different topic. Get started with OCR using ofxTesseract.
One of the more popular gesture recognition algorithms is the $1 Unistroke Recognizer or just "Dollar Recognizer". roxlu has an addon implementing this technique. Around 2006/2007 after the Nintendo Wii was released and around the release of the first iPhone, there was a huge renaissance of people playing with gesture recognition, and it was eventually incorporated into most of our devices.
OpenCV has a huge slew of tools for this. Most of the techniques are based on calculating "descriptors", or invariant features (normally rotation and scale invariant) and then matching those features. See cv::matchShapes()
and cv::moments()
. Another algorithm is called Shape Contexts and creates a point-by-point mapping between two point distributions using local "signatures".
I wanted to talk about some things related to hiding data, automating data retrieval, and minimizing the fallout from pushing boundaries.
There are some great sites out there that let you practice classic intrusion techniques in a legal way. I would really like to see these ideas used for non-malicious artistic use, but haven't seen any good examples yet. A short note about Bender from the Future.
There is a huge genre of work that is just about transcoding, though the motivations of the artists are usually placed elsewhere. See Campbell's Formula for Computer Art (2001). Steganography is the act of hiding data while you're doing this kind of transcoding. Bar codes and QR codes are related in that they are a sort of hidden message simply because they are not human-readable. (Pictures of People Scanning QR Codes, WTFQRCODES)
Multitouch has a long history, but it rose to prominence with Jeff Han's demos in 2006, with the demo video that inspired a million demo videos (especially around nuigroup). Now it's built into our trackpads, and on OSX you can access the data with sub-millimeter precision at 125Hz starting with MTDeviceCreateDefault. If you want to work with this in openFrameworks, talk to me.
Here we'll consider analysis of a sound that is primarily sinusoidal (like a whistle, or a soft flute).
Sound is a buffer of samples, just like pixels in an image.
- Instead of RGB, we have LR (or more, or less).
- Instead of 8 bit values from 0 to 255, we have floats from -1 to +1 (though they're often 16-bit integers on the sound card).
- Instead of a 2D space we have a 1D space (though even images are stored in a 1D array).
A simple technique for detecting pitch is based on counting zero crossings of the signal. You can do a lot with this, here's Zach playing Mario. You can take it a step further by looking at the distance between the zero crossings.
Phase can be determined by looking at the offset in time of the zero crossings while doing pitch detection. Phase offsets are one of the first clues for detecting the movement in space of a source emitting a constant signal. This is related to (but not quite the same as) the way gunfire locator systems, like Ears, work.
Amplitude (or volume) can be determined by looking at the distance between peaks in the signal. Another common metric for amplitude is called root mean square (RMS) amplitude, which is calculated by squaring all the samples in a signal, averaging these, and taking the square root. Neither of these techniques correspond exactly to perceived loudness, which is a little more complicated due to psychoacoustics.
Fourier analysis is about taking a chunk of time from a signal and looking at it in frequency space. This is a really important idea: that signals can be converted to and from frequency space.
- Metastasis and Mycenae Alpha by Xenakis
- Equation by Aphex Twin
- MetaSynth is one of the most popular tools for working with sound this way.
- SonART
- AudioPaint
- Ohm Pie using Sonic Visualizer
- The Voice seeing system
- Speaking Piano by Peter Ablinger
An important thing to note about all these examples: we hear sounds as evenly spaced in logarithmic frequency space, but Fourier analysis happens in linear frequency space. This means you get less apparent frequency resolution at the lower end, and more at the upper end. There are other techniques that attempts to reconcile this, but you're always making a tradeoff between time resolution and frequency resolution.
Looking for the pitch of a sound corresponds (usually) to finding the dominant peak in the frequency domain. Detecting chords is a matter of accounting for multiple peaks (but also ignoring overtones). We can also take a weighted average of all the frequencies to get the spectral centroid.
Instead of the centroid, you can compute the spectral deviation. This will give you an idea of how "pitch-like" a sound is versus how "noisy" it is.
If you can detect the percussive sounds from a signal, and then figure out how much time is between them, you can create a basic beat detector. This works in some exceptional cases.
Autocorrelation is a little more robust. It's based on multiplying a time-shifted version of a signal against itself, and summing the result. If the value is large, then the signal is "in time" with itself at that offset. If the value is small, the signal is "out of time" with itself. You can do this at multiple time offsets and look for the largest value to find the peak. Sometimes it makes more sense to try autocorrelation on the amplitude rather than the signal directly.
See ofxBeatTracking and BeatDetectionAlgorithms.pdf for more advanced techniques.
Vowels can be detected continuously based on tracking formant peaks. Formants are the resonances that make vowels identifiable. Most vowels can be distinguished by tracking two formants (f1 and f2). Because the formants are based on resonance, the formant frequency is the same regardless of the pitch of the voice (f0).
Speech to text/speech recognition is very difficult. The details are way beyond the kind of explanation I give above, because it has a lot to do with prediction and building models about what sounds and words follow other sounds and words. If you've ever listened to someone in a noisy room for 5 seconds, not understanding what they're saying, and then the last word of the sentence brings everything together, you understand why speech to text is a difficult and interconnected problem.
Some of the best services are currently maintained privately and only accessible via web services. The Google Speech API is a good example, and it's been wrapped for openFrameworks by ofxGSTT (Linux only). OSX has some built in speech recognition at the OS level, and that's been wrapped into ofxSpeech. It needs a little love to get it running again. The idea with OSX speech recognition is that it's much easier to detect what someone is saying when there are only a few options, so you initialize it with a database of words.
There is some work getting Sphinx from CMU to work with openFrameworks, including the addon ofxAsr.
Siri is pretty good at recognition, and the protocol has been opened and appropriated. Speaking of Siri and the voice, it's worth mentioning that UK Siri uses a male voice while the US Siri uses a female one.
Speech recognition has recently been a very military-funded technology.
What is going on with ofxOnlineTimeWarp?
We're going to dig through the examples inside ofxFft and figure out what's going on.
Messa di Voce is an early 2000s piece by Zach Lieberman and Golan Levin (tmema). I got a copy of the source from Golan, and we're going to dig into the sound processing and look around a little bit.
In Messa, Golan put together a hacky formant detector based on an audio stream at 22050Hz samplerate, with a 256 sample buffer size. This was followed by an FFT, which yields 128 bins at 86Hz bandwidth each up to 11kHz. Bins 4 through 124 are considered for formant detection (344Hz - 10kHz). The FFT data is averaged over time by lerping with an amount of .5 from frame to frame. The averaged data is smoothed with three passes of a 5-element gaussian kernel and peaks are detected from the positive zero crossings of the derivative. Some very open vowels like "aw" and "ah" have formats that are as similar as 100 Hz, which is below the ability of this filter to detect. A final step involved testing for the "stability" of the peaks over time, and rejecting unstable peaks.
This week there is no assignment specific to what we've discussed in class. The goal between now and the last class is to finish a single project that you're excited to share with the entire class. This could be something new, or it could be an improvement of a previous assignment for this class.
To clarify: you do not necessarily need to make something completely new for your final project, you could simply take a previous assignment further. You should be ready to talk about your project for 5-10 minutes.
If you would like to develop something with the ideas I've discussed in class, but are having trouble getting started, let's talk and work out some examples.