This repository contains the implementation of three laboratories from the course "Deep Learning Application". This course is from the Artificial Intelligence degree course from University of Florence.
The labs focus on deep learning models on various areas of research. For every lab, its related folder contains the source code (main.py
), a configuration file (config.yaml
) file which contains the model, the dataset and the run specifics, and a result folder.
Apart from these folders, the src
folder contains functions used throught the labs.
To log the runs of the exercises, Weights and Biases has been used.
All the runs used early stopping during the training phase.
- Laboratory 1: Convolutional Neural Networks
- Laboratory 2: Natural Language Processing & LLMs
- Laboratory 4: Adversarial Learning and OOD Detection
This laboratory focuses on working with simple architectures to gain experience with deep learning models, specifically using PyTorch.
The task is to reproduce, on a small scale, the results from the ResNet paper:
Deep Residual Learning for Image Recognition, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, CVPR 2016.
demonstrating that deeper networks do not always lead to better training loss or validation accuracy, using first a Multilayer Perceptron (MLP) on the MNIST dataset, then a Convolutional Neural Network (CNN).
Objective: Implement a simple Multilayer Perceptron to classify the 10 digits of MNIST (e.g. two narrow layers).
Results on test set:
Dataset | Loss | Accuracy | Precision | Recall |
---|---|---|---|---|
MNIST | 0.055 | 98.590% | 98.587% | 98.577% |
Link to W&B report for MLP: https://api.wandb.ai/links/alessio-boetti/qcepp0cs
Objective: Repeat the verification you did above, but with Convolutional Neural Networks, preferrably on a more complex dataset.
To start, I chose to implement a CNN with 9 convolutional layers on CIFAR10. The model achieved a test set accuracy of 82.11%.
To push things further and see if it's true that depth doesn't increase performance for CNNs, I increased the number of conv layers. The following table shows the model trained along with their test performances:
Model | Loss | Accuracy | Precision | Recall |
---|---|---|---|---|
9 ConvLayers CNN | 0.647 | 81.930% | 81.816% | 81.930% |
17 ConvLayers CNN | 0.701 | 81.080% | 80.984% | 81.080% |
33 ConvLayers CNN | 1.090 | 70.680% | 70.604% | 70.680% |
49 ConvLayers CNN | 1.620 | 51.920% | 51.656% | 51.920% |
Link to W&B report for CNNs with various depth: https://api.wandb.ai/links/alessio-boetti/lpiry17k
Indeed the performance got drastically worse, going down of 36%, or 30 percentage points. The worsening of the models is even clearer from the report above.
Then I tried adding Residual Connections, aka Skip Connections, to try to stabilize training and get better performance. Here are the results:
Model | Loss | Accuracy | Precision | Recall |
---|---|---|---|---|
9 ConvLayers ResNet | 0.747 | 79.770% | 79.650% | 79.770% |
17 ConvLayers ResNet | 0.690 | 80.400% | 80.338% | 80.400% |
33 ConvLayers ResNet | 0.707 | 80.550% | 80.413% | 80.550% |
49 ConvLayers ResNet | 0.704 | 80.930% | 80.755% | 80.930% |
Link to W&B report for ResNets with various depth: https://api.wandb.ai/links/alessio-boetti/s802vpyf
As we can see both from the table and the report, the residual connections first of all helped stabilize the training: they helped achieve nearly the same performance even when depth was increased. They didn't strongly support the assumption that deeper is better though, as the performance increased for deeper ResNets by small improvements.
It's interesting to note from the table though that adding skip connections to the 9 ConvLayer CNN, resulting in 9 ConvLayer ResNet, decreased a little the performance. This could be due to chance, and to check this more trainings could be run varying the seed.
However this pattern is also found throughout the training phase, when assessing the model on the validation set, as shown in the following image:
This could be due to the following:
- Residual connections add unnecessary complexity for a shallow CNN where the vanishing gradient problem is not so relevant.
- For shallow nets the identity mapping could interfere the learning process bypassing the non-linear transformations of the conv layers.
- Residual connections introduce a bias towards learning an identity function. Added to shallow CNNs, this can be counterproductive if the task requires more complex feature transformations
It should also be noted that the deepest ResNet (49 conv layers) is still worse than shallow CNNs (9 and 17 conv layers):
Objective: Use the CNN model you trained in Exercise 1.2 and implement Class Activation Maps (below). Use your implementation to demonstrate how your trained CNN attends to specific image features to recognize specific classes.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR'16 (arXiv:1512.04150, 2015).
I decided to use Grad-CAM instead of CAM to explain the CNN predictions on CIFAR10.
In Grad-CAM, an image is given to the model along with its class label. The gradient of the logits of the class w.r.t the activations maps of the final convolutional layer is computed and then the gradients are averaged across each feature map to give an importance score:
where the two summations are the Global Average Pooling mechanism, the derivative represents the gradient backpropagation,
Then we multiply each activation map by its importance score (i.e. alpha) and sum the values. To only consider the pixels that have a positive influence on the score of the class of interest, a ReLU nonlinearity is also applied:
$ \mathcal{L}^{c} = ReLU(\sum_{k}\alpha_{k}^{c}A^{k})$
In the results folder of the lab exercise the grad-cam heatmaps can be found. Unfortunately the heatmaps didn't come out enough complex.
In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things.
Objective: In this first exercise you will train a small autoregressive GPT model for character generation to generate text in the style of Dante Aligheri.
Results on validation set:
Dataset | Loss |
---|---|
Dante's Inferno | 3.651 |
From the above image we can clearly see the model overfits quite soon (on 5000 epochs).
Here the generated output in Dante's style (1000 tokens):
come Suo rote la` dove testa la ritte e la dove parte queta,
la` di qua, dopo al pologno del sasso; porsentando e` questa gente stanca non puote fu rico de la belle stelle,
'altro fu l'usuria castra Giace. meco d'un su Nome ch'intrato acquesio.
Cola` e` colui che si` presso s'appressa, che dal cominciar, con l'occhio fora del garonso, ciglio` ne li altri piedi sparte.
Ma dimmi, de l'alto socchio si scovra, guardoliar piu` convien ch'io ti 'ntesi, raguardar piu` a via diversa leggia
gente viscende ove non fasse, faldo venir le mostra persona fronte.
<<I s'appura di colorno scio` accascio>>, comincio` olsi con l'occhio a volse; <<e se la voce lucia tanto non dispiaccia.
Noi si discende mai a lor vieta in forte de la sua scurgelme, o lo 'nfernio dolce che parlar via memolte me ali,
mosse, per fuor de la bola la marca dove 'l sangue in lor non si puo`, com'io mi al certo perco` a Maoneta,
quando la brina in su la terra e lassa, si leva, giacendo l
The output doesn't make sense in italian, but it's clearly recognizable as a text with an antique style. The way syllables and punctuation are concatenated gives a quite impressive result.
It's also worth noting the sequence of two-line paragraphs, and even the insertion of double quotes <<>>, meaning the model learned to capture the structural connection between a spoken phrase and other parts of the text.
Objective: In this exercise we will see how to use the Hugging Face model and dataset ecosystem to access a huge variety of pre-trained transformer models. Instantiate the GPT2Tokenizer
and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length. Then instantiate a pre-trained GPT2LMHeadModel
and use the generate()
method to generate text from a prompt.
I performed some experiments on encoding with two input phrases.
First phrase:
- Input: "Hello World!"
- Input text length: 2
- Encoded text length: 3
- Decoded text: "Hello World!"
Second phrase:
- Input: "Why did the transformer go to Hugging Face? It needed a little attention!"
- Input text length: 13
- Encoded text length: 16
- Decoded text: "Why did the transformer go to Hugging Face? It needed a little attention!"
Then, regarding the generation experiments:
-
Prompt: "Thanks for all the"
-
Generated text: "Thanks for all the hard work. I'm excited to hear from you guys! (Thanks for reading)"
-
Prompt: "Peter Piper picked a peck of pickled peppers."
-
Generated text: "Peter Piper picked a peck of pickled peppers. He was going to make a pot of them and put them in the fridge for an hour and a half. When he finished, he walked away and picked the peppers, put them in the fridge"
Objective: Peruse the text classification datasets on Hugging Face. Choose a moderately sized dataset and use a LLM to train a classifier to solve the problem.
I chose DistilRoBERTa as LLM for the exercise. DistilRoBERTa is a small LLM but can learn powerful and complex representations rivaling with other BERT-like models.
I fine-tuned the model on the Yelp Review Full text classification dataset. The fine-tuning was performed end-to-end, like the BERT paper suggests, using a cross-entropy loss.
I also fine-tuned DistilRoBERTa on the smaller Stanford IMDB text classification dataset, again end-to-end with a cross-entropy loss.
Results on the test sets:
Dataset | Loss | Accuracy | Precision | Recall |
---|---|---|---|---|
Yelp Review Full | 2.290 | 10.326% | 42.139% | 5.163% |
Stanford IMDB | 2.303 | 16.576% | 16.834% | 3.315% |
The test set performances, as we can see from the table, are very low, maybe due to not optimal hyperparameters. However given the scope of the exercise and the time required to fine-tune the model, hyperopt was not performed.
Surprisingly, when fine-tuning both models, the best epoch registered on the validation set was early in the training stage: on the Yelp dataset, the 8th epoch was the best based on validation accuracy, while for the IMDB dataset the 1st epoch was the best based on validation accuracy (they both stopped after a few epochs due to early stopping).
For the model fine-tuned on the Yelp dataset, while the accuracy and recall on the validation set were stagnant, the precision performance on the validation set however had an increasing trend. Since the early stopping was measured on the accuracy on the validation set, this could mean that the accuracy may not be the best metric to assess the model performance (see images below). Also, the early stopping was set to a low value (20 epochs) due to the long time taken by a single training epoch.
Maybe using the F1 score as target for early stopping could have been a better choice, maybe leading to better overall performance.
In this laboratory session we will develop a methodology for detecting OOD samples and measuring the quality of OOD detection. We will also experiment with incorporating adversarial examples during training to render models more robust to adversarial attacks.
Objective: In this first exercise you will build a simple OOD detection pipeline and implement some performance metrics to evaluate its performance. There are several metrics used to evaluate OOD detection performance, we will concentrate on two threshold-free approaches: the area under the Receiver Operator Characteristic (ROC) curve for ID classification, and the area under the Precision-Recall curve for both ID and OOD scoring.
I used a 9 Conv Layer CNN due to its fast training time and high accuracy. This model will be used throughout all the following exercises.
I chose CIFAR10 as ID dataset and SVHN as OOD dataset.
The model has been retrained with different settings, and the resulting test set performance is different from the first training (see Exercise 1.2). This is mainly due to not training with augmented samples (augment: False
in the config.yaml
file), since the model performance needed to be a baseline for the following exercises and using augmentation with OOD could have produced misleading or less clear results.
Results on the test set:
Dataset | Loss | Accuracy | Precision | Recall |
---|---|---|---|---|
CIFAR10 (ID) | 1.031 | 76.790% | 76.675% | 76.790% |
SVHN (OOD) | 6.996 | 9.826% | 7.313% | 9.347% |
From the table it seems like the model clearly distinguished the OOD images as coming from a different distribution than the ID images. Let's check the plots.
ID confusion matrix:
OOD confusion matrix:
As we can see from the confusion matrices, when predicting the class for OOD images, the model predicted with high frequency the classes 0, 3, 5, 8. We could say that the model predicted a 0 to be an airplane, a 3 to be a cat, a 5 to be a dog and an 8 to be a ship.
The precision-recall curve and the ROC curve showed good performances, indicating the model is able to separate well enough the two classes for many decision thresholds.
However the scores histograms don't appear very separable, but the line plot shows that many OOD scores have higher values than the ID scores.
In this second exercise we will experiment with enhancing our base model to be (more) robust to adversarial attacks.
Objective: Implement FGSM and generate some adversarial examples using your trained ID model. Evaluate these samples qualitatively and quantitatively. Evaluate how dependent on
Using the model from the previous exercise, I performed a randomized (fast) FGSM attack:
Fast is better than free: Revisiting adversarial training, Eric Wong, Leslie Rice, J. Zico Kolter. 2020
to obtain corrupted images for various values of
In the lab exercise result folder the corrupted images (
In the same folder there are the result plots created when treating adversarial images (for various values of
Here are the results on various CIFAR10 test sets:
Epsilon ( |
Loss | Accuracy | Precision | Recall |
---|---|---|---|---|
0 (original test set) | 0.998 | 76.820% | 76.782% | 76.820% |
1.902 | 62.450% | 62.614% | 62.450% | |
2.959 | 49.290% | 49.787% | 49.290% | |
4.018 | 39.670% | 40.733% | 39.670% | |
4.983 | 32.260% | 33.610% | 32.260% | |
5.846 | 27.010% | 28.491% | 27.010% | |
6.583 | 23.180% | 24.680% | 23.180% | |
7.206 | 19.880% | 21.412% | 19.880% | |
7.718 | 17.460% | 18.939% | 17.460% | |
8.136 | 15.590% | 17.069% | 15.590% | |
8.474 | 14.370% | 15.782% | 14.370% |
As we can see from the table, the performance decreases when
Objective: Use your implementation of FGSM to augment your training dataset with adversarial samples. Ideally, you should implement this data augmentation on the fly so that the adversarial samples are always generated using the current model. Evaluate whether the model is more (or less) robust to ID samples using your OOD detection pipeline and metrics you implemented in Exercise 1.
I trained the 9 Conv Layer CNN with a loss composed of "clean" loss and adversarial loss with
Here are the results on the CIFAR10 test sets:
Epsilon ( |
Loss | Accuracy | Precision | Recall |
---|---|---|---|---|
0 (original test set) | 0.978 | 76.800% | 76.589% | 76.800% |
1.121 | 74.120% | 73.836% | 74.120% | |
1.262 | 71.750% | 71.418% | 71.750% | |
1.402 | 69.340% | 68.992% | 69.340% | |
1.537 | 67.310% | 66.902% | 67.310% | |
1.668 | 65.670% | 65.235% | 65.670% | |
1.789 | 63.970% | 63.517% | 63.970% | |
1.902 | 62.330% | 61.875% | 62.330% | |
2.006 | 60.910% | 60.431% | 60.910% | |
2.097 | 60.040% | 59.546% | 60.040% | |
2.178 | 59.050% | 58.552% | 59.050% |
The model accuracy on the CIFAR10 original test set decreased of 0.20 percentage points, remaining thus quite stable. Also, the attack on the adversarially trained model decreased the performance, however not so drastically as when applied to the original model.
Let's see the performance and the plots showing original CIFAR10 as ID and SVHN as OOD sets.
Dataset | Loss | Accuracy | Precision | Recall |
---|---|---|---|---|
CIFAR10 (ID) | 0.978 | 76.800% | 76.589% | 76.800% |
SVHN (OOD) | 6.686 | 11.305% | 8.351% | 9.926% |
ID confusion matrix:
OOD confusion matrix:
Comparing these two confusion matrices with the ones from exercise 1, we can see that the ID performance remained quite stable, while the OOD performance varied somewhat. In particular, the model seemed more uncertain, predicting with higher frequency other classes (especially 2) in addition to 0, 3, 5 and 8 (whose frequency dropped a bit, especially class 3).
The AUPR score and the AUC ROC score increased a little, and the number of model confidence on the OOD set decreased a bit (less softmax scores with value equal to 1).
On the lab result folder the plots showing CIFAR10 as ID and CIFAR10 corrupted as OOD sets can be found.
Notes: ODIN is a very simple approach, and you can already start experimenting by implementing a temperature hyperparameter in your base model and doing a grid search on
I performed two experiments:
- First, I added ODIN postprocessing method to the exercise 1 OOD detection pipeline (ID test set: CIFAR10, OOD test set: SVHN). I ran the experiment with different values of
$T$ and$\varepsilon$ . In particular,$T \in \lbrace1, 5, 10, 50, 100, 500, 1000\rbrace$ , while$\varepsilon \in \lbrace0, 1/255, 2/255, 3/255, 4/255, 5/255\rbrace$ - Then I added ODIN to the adversarially trained model (ID test set: CIFAR10, OOD test set: SVHN). Again,
$T \in \lbrace1, 5, 10, 50, 100, 500, 1000\rbrace$ , while$\varepsilon \in \lbrace0, 1/255, 2/255, 3/255, 4/255, 5/255\rbrace$
The plots of experiment 1 can be found here, under the name ood_odin-<T>-<ε>_...
The plots of experiment 2 can be found here, under the name ood_odin-<T>-<ε>_...
Objective: Implement the targeted Fast Gradient Sign Method to generate adversarial samples that imitate samples from a specific class. Evaluate your adversarial samples qualitatively and quantitatively.
I applied targeted FGSM on both the original model of exercise 1 and the adversarially trained model of exercise 2.2 in order to generate the images. The targeted FGSM was applied in 3 different ways:
- Standard: minimizing the loss of the target label
- One-vs-One (OVO): minimizing the loss of the target label while at the same time maximizing the loss of the true label.
- One-vs-Rest (OVR): minimizing the loss of the target label while at the same time maximizing the loss of all other labels.
This resulted in a total of 6 experiments.
Finally I also tried to generate adversarial images after training the usual CNN 9 Conv Layer model with a targeted attack, to see the effect of another target attack.
In all the experiments the target class was 5 (dog).
The adversarial images and their difference from the original images can be found in the lab exercise result folder, inside the imgs_adv_samples
subfolders of every experiment.