Raw audio generation experiments.
Basic idea: compute mel-spectrogram of audio, pre-process, generate new mel-spectrograms step-by-step, apply inverse mel-transform, apply Griffin-Lim algorithm to estimate phase of audio.
General to-do:
- Re-write GL reconstruction algorithm
- Apply LeRoux's reconstruction algorithm
- Save spectrograms after original pre-processing (for faster loading)
Idea: extract features, feed to regression algorithm, generate new audio frame.
Idea: feed raw audio frame(s) to neural network.
Approaches:
- RNN-based generation (via layered LSTMs)
- CNN-based generation
- Convolutational RNNs
- GAN-based generation (combined with CNNs?)
Approaches tried in 2017, may contain bugs, etc.
Basic idea: compute the STFT of some audio, extract features, generate new STFT step-by-step (extracting new features during each step). Currently replacing phase data with random noise and only predicting amplitudes, which leads to everything sounding pretty bad. The feature extraction has not been optimized (at all), so it takes up a lot of the computation time.
Audio generation using logistic regression - there's a lot of bugs, but since this method does not produce very good results in general, I've abandoned its development
The good:
- fast training and generation
The bad:
- can "blow up" (amplitude goes to infinity), attempts to set a maximum volume lead to unpleasant clicks and frequent "resetting" of the values.
Audio generation using decision trees, but any regressor can be used
Trees take a lot of time to train, and training multi-output trees gives a steady droning output which is less interesting than the results given by separate trees (but those take a large amount of time to train). This version can pre-load features from a file or write extracted features to a file.
The good:
- More interesting results
The bad:
-
Excruciatingly slow
-
Still end up giving drones