Deep Dreaming Music

Neuromatch Academy Deep Learning 2021 project by Maryam Faramarzi, Siobhan Hall, Máté Mohácsi, Pablo Oyarzo, Jonathan Reus and Katherine Baquero

Special credit to our TAs: Pedro F da Costa and Beatrix BenkÅ

Background

During the last decade CNNs have become increasingly powerful as models for computer vision. Their development has been carried conjointly with the exploration of their internal informational structure (1) and feature-inference processes. One of these approaches known as “dreaming”, initially developed by Google (2), has proven to be an effective method to maximize features and to take convolutional neural networks to the territory of stimuli generation. In this work, we ask how audible waveforms can be reconstructed from features by exploring dreaming algorithms as a method. This method has the capacity to work as an introspective technique for understanding internal network representations, as well as potentially acting as a generative approach for novel audio output. CNNs have been used to achieve state-of-the-art performance classifying music genres from (Mel Scale) spectrograms (3), thus we choose to explore the problem of music genre classification as a starting point for exploring internal representations. While most literature on CNN-based music genre classifiers use Mel-scale Spectrograms, this audio representation can be potentially limiting in the specific situation of deep dreaming on spectrograms, where the end goal is to reconstruct an audible time-domain waveform. It is for this reason we investigated the following audio transforms: Short-Time-Fourier Transforms (STFTs) and Mel-spectrograms. These will be collectively referred to as Spectrograms throughout.

Aims

To determine if audible waveforms can be reconstructed using the "dreaming" process from features learned by a convolutional neural network.
To investigate the best approach to training a classifier as well as choosing the type of audio data transform that can be reconverted to music once “dreamed” upon.

Methods

Dataset

Exploring the dataset

Preparing the data

GTZAN dataset

1000 audio tracks (length: 30 seconds; 22050 Hz Mono 16-bit audio files in .wav format)
10 genres, with 100 examples each (blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock).
The STFTs were computed with the real and imaginary part, as well as only the real component. Both of these versions were used separately

Models

Reference papers

Code:

Dreaming

Deep Dreaming Music Code

Update strategies during dreaming: additive and subtractive

For either approach, one can choose to normalize the gradient or not (which usually leads to faster learning at the cost of sometimes opting for less nuanced dream features)

Additive

Subtractive

Optimization Functions

Maximize activation of a single genre

Maximize activation difference between a single genre and another (or all other) genre(s)

Findings

Classification accuracies

Dreamed music:

Visualisation of dreamed jazz

Discussion

In applying the dreaming to the spectrograms, we gain insight into the learned representations the model uses to perform the classification. We are able to visually represent these features, as well as interpret them in audio formats.

We were limited by our audio reconstruction techniques which immediately revealed a catch-22.

We were able to achieve high classification accuracies when training the models with Mel-spectrograms, but these cannot be cleanly reconverted to audio data. While STFTs are cleanly converted to audio (without noise added during reconstruction), we achieved lower classification accuracies when training with STFTs. This poor classification accuracy suggests the model didn’t learn useful internal representations that can be maximised during the dreaming process that will transform an input while obeying the laws of spectrograms to ensure it can be reconstructed into audio.

Future directions and ideas

The focus on only visual representations of the audio data limited our results as we were limited by audio reconstruction techniques. Future work could investigate using the raw audio signal and using models better suited to signal data (e.g. RNNs, Transformers or autoregressive models such as WaveNet). This limitation was further evident in our use of networks pre-trained on ImageNet alone. Future work could incorporate pre-training on more applicable data such as STFT transforms to help the model learn better internal representations to be maximised during dreaming.

References

Olah C, Mordvintsev A, Schubert L. Feature visualization. Distill. 2017 Nov 7;2(11).
Google AI Blog: Inceptionism: Going Deeper into Neural Networks [Internet]. [cited 2021 Aug 16]. Available from: https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
Palanisamy K, Singhania D, Yao A. Rethinking CNN Models for Audio Classification. arXiv. 2020 Jul 22;
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014;

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
Audio data preparation		Audio data preparation
Pipelines		Pipelines
results		results
Combined_CNN_Training.ipynb		Combined_CNN_Training.ipynb
Deep_Dream_Music.ipynb		Deep_Dream_Music.ipynb
Feature_visualisation_VGG16.ipynb		Feature_visualisation_VGG16.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Dreaming Music

Background

Aims

Methods

Dataset

Models

Dreaming

Update strategies during dreaming: additive and subtractive

Optimization Functions

Findings

Classification accuracies

Dreamed music:

Visualisation of dreamed jazz

Discussion

Future directions and ideas

References

About

Releases

Packages

Contributors 7

Languages

MaryamFaramarzi/deep_dreaming_music

Folders and files

Latest commit

History

Repository files navigation

Deep Dreaming Music

Background

Aims

Methods

Dataset

Models

Dreaming

Update strategies during dreaming: additive and subtractive

Optimization Functions

Findings

Classification accuracies

Dreamed music:

Visualisation of dreamed jazz

Discussion

Future directions and ideas

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages