Deep learning has revitalized computer vision and enhanced the accuracy of machine interpretation on images in recent years. Especially convolutional neural networks (CNN) is so far the go-to machine learning algorithm with great performance. In this project, I build CNN models to classify the facial expression into six facial emotion categories such as happy (😄) , sad (😢), fear (😨) , surprise (😲), neutral (😐) and angry (😠).
label_dict={
1: ['Angry', '😠'],
2: ['Fear', '😨'],
3: ['Happy', '😄'],
4: ['Sad', '😢'],
5: ['Surprise', '😲'],
6: ['Neutral', '😐']
}
The dataset is taken from Kaggle Challenge. I train three different CNN models: 4-layer model and 5-layer model with dropout to prevent overfitting.
Facial expression recognition can bring the relationship between the human and machine closer. It has gained significant interest in socail behavioral science, clinical and security fields.
The dataset 35,887 images with the resolution of 48x48 pixels. There is a train.csv file which consists of three columns, 'emotion' , 'pixels' and 'Usage'. The 'emotion' column represents the label ranging from 0 to 6 representing the emotional categories. In the pixels column, the 48x48 matrix is flatten as a single string separated by spaces. 'Usage' column is just a label for the previous Kaggle Competition and it is not important here. Here is the distribution of the class labels.
The original dataset contains seven different classes including 'disgust'. However, when we look at the distribution among different categories, 'disgust' only accounts account for about 1.5% of the total dataset as shown in Figure 2. Therefore, I am going to drop this category. Here the modified dataset
is the dataset after the data preprocessing step.
The first convolutional neural networks model is a 4-layer simple model. There are two convolutional layers and two densely (fully) connected layers. For the simplicity, there is no regularization and dropout to avoid overfitting in this model. The average accuracy for this simple model is around 62 % between 3000 to 10000 iterations.
Layer Name | Output Size | Process |
---|---|---|
conv1 | 24 x 24 | filter =5x5 , conv_stride=1, max_pool _stride =2, channels =16 |
conv2 | 12 x 12 | filter =5x5 , conv_stride=1, max_pool _stride =2, channels =36 |
dc1 | 128 | |
dc2 | 6 | softmax,max pool |
Flops [ Ignore biases] | 3 x 10^6 |
This is a very shallow network and relatively very easy to train. Here is the loss curve and confusion matrix for this model. As we can conclude from this model, the prediction accurcy is high on Happy
,Surprise
and Neutral
category. This CNN algorithm has hard time classifying Angry
, Sad
and Fear
category and are quite often gotten confused with Neutral
as illustrated in the confusion matrix. This model can be reimplemented by running First Convolutional Neural Net-fv.ipynb
.
Figure 5. Loss curve and accuracy on the validation dataset on the left and the confusion matrix on the right figure.
Instead of applying the best single model to predict the categories, I will use multiple models and combine all the predictions together and determine the prediction based on the highest score. So when I train the model in First Convolutional Neural Net-fv.ipynb
, I save the model every 1000 iterations. Now each model will give a six probabilistic preditions with respect to each category on an image and I will combine these prediction per category across 9 models. By juding from Figure , ensemble learning has boosted the accuracy on categories such that Sad
, Fear
and Angry
. As expected the accuracy of the prediction is increased by ~2% with 63.7% accuracy.
In the previous model, we don't consider dropout and regularization to avoid overfitting. In this model, we go one layer deeper and also use 30% dropout.
Layer Name | Output Size | Process |
---|---|---|
conv1 | 24 x 24 | filter =5x5 , conv_stride=1,max_pool _stride =2, channels =32 |
conv2 | 12 x 12 | filter =5x5 , conv_stride=1,max_pool _stride =2, channels =64 |
conv3 | 6 x 6 | filter =5x5 , conv_stride=1,max_pool _stride =2, channels =64 |
dc1 | 128 | |
dc2 | 6 | softmax,max pool |
Flops [ Ignore biases] | 1.2 x 10^7 |
The second model is slightly better than the first one with an average accuracy of around 63.6% compared to ~62 % from the first model. But in terms of the amount of computation, only 1.6% increasement is not a good imporvement. Therefore, I look into the ensemble learning and the accuracy of the ensemble learning is about 65.9 %. It is about 2.2% imporvement on accuracy compared to the first shallow CNN model. Accuracy on differentiating between Sad
and Neutral
gains a boost in this model.
Figure 8. Images where the model predicted incorrectly. In the caption below each image, the label is on the left and the prediction is on the right.
Deep Convolutional Neural Net Model.ipynb
contains how to setup and run 34-layer residual neural network introduced by Kaiming He.