- Title: Fast R-CNN
- Authors: Ross Girshick
- Link: https://arxiv.org/abs/1504.08083
- Tags: Neural Network, RCNN, VGG, AlexNet, selective search
- Year: 2015
-
What
- The original R-CNN had three major disadvantages:
- Two-staged training pipeline: Instead of only training a CNN, one had to train first a CNN and then multiple SVMs.
- Expensive training: Training was slow and required lots of disk space (feature vectors needed to be written to disk for all region proposals (2000 per image) before training the SVMs).
- Slow test: Each region proposal had to be handled independently.
- Fast R-CNN ist an improved version of R-CNN and tackles the mentioned problems.
- It no longer uses SVMs, only CNNs (single-stage).
- It does one single feature extraction per image instead of per region, making it much faster (9x faster at training, 213x faster at test).
- It is more accurate than R-CNN.
- The original R-CNN had three major disadvantages:
-
How
- The basic architecture, training and testing methods are mostly copied from R-CNN.
- For each image at test time they do:
- They generate region proposals via selective search.
- They feed the image once through the convolutional layers of a pre-trained network, usually VGG16.
- For each region proposal they extract the respective region from the features generated by the network.
- The regions can have different sizes, but the following steps need fixed size vectors. So each region is downscaled via max-pooling so that it has a size of 7x7 (so apparently they ignore regions of sizes below 7x7...?).
- This is called Region of Interest Pooling (RoI-Pooling).
- During the backwards pass, partial derivatives can be transferred to the maximum value (as usually in max pooling). That derivative values are summed up over different regions (in the same image).
- They reshape the 7x7 regions to vectors of length
F*7*7
, whereF
was the number of filters in the last convolutional layer. - They feed these vectors through another network which predicts:
- The class of the region (including background class).
- Top left x-coordinate, top left y-coordinate, log height and log width of the bounding box (i.e. it fine-tunes the region proposal's bounding box). These values are predicted once for every class (so
K*4
values).
- Architecture as image:
- Sampling for training
- Efficiency
- If batch size is
B
it is inefficient to sample regions proposals fromB
images as each image will require a full forward pass through the base network (e.g. VGG16). - It is much more efficient to use few images to share most of the computation between region proposals.
- They use two images per batch (each 64 region proposals) during training.
- This technique introduces correlations between examples in batches, but they did not observe any problems from that.
- They call this technique "hierarchical sampling" (first images, then region proposals).
- If batch size is
- IoUs
- Positive examples for specific classes during training are region proposals that have an IoU with ground truth bounding boxes of
>=0.5
. - Examples for background region proposals during training have IoUs with any ground truth box in the interval
(0.1, 0.5]
.- Not picking IoUs below 0.1 is similar to hard negative mining.
- Positive examples for specific classes during training are region proposals that have an IoU with ground truth bounding boxes of
- They use 25% positive examples, 75% negative/background examples per batch.
- They apply horizontal flipping as data augmentation, nothing else.
- Efficiency
- Outputs
- For their class predictions the use a simple softmax with negative log likelihood.
- For their bounding box regression they use a smooth L1 loss (similar to mean absolute error, but switches to mean squared error for very low values).
- Smooth L1 loss is less sensitive to outliers and less likely to suffer from exploding gradients.
- The smooth L1 loss is only active for positive examples (not background examples). (Not active means that it is zero.)
- Training schedule
- The use SGD.
- They train 30k batches with learning rate 0.001, then 0.0001 for another 10k batches. (On Pascal VOC, they use more batches on larger datasets.)
- They use twice the learning rate for the biases.
- They use momentum of 0.9.
- They use parameter decay of 0.0005.
- Truncated SVD
- The final network for class prediction and bounding box regression has to be applied to every region proposal.
- It contains one large fully connected hidden layer and one fully connected output layer (
K+1
classes plusK*4
regression values). - For 2000 proposals that becomes slow.
- So they compress the layers after training to less weights via truncated SVD.
- A weights matrix is approximated via
- U (
u x t
) are the firstt
left-singular vectors of W. - Sigma is a
t x t
diagonal matrix of the topt
singular values. - V (
v x t
) are the firstt
right-singular vectors of W.
- U (
- W is then replaced by two layers: One contains
Sigma V^T
as weights (no biases), the other containsU
as weights (with original biases). - Parameter count goes down to
t(u+v)
fromuv
.
- A weights matrix is approximated via
-
Results
- They try three base models:
- AlexNet (Small, S)
- VGG-CNN-M-1024 (Medium, M)
- VGG16 (Large, L)
- On VGG16 and Pascal VOC 2007, compared to original R-CNN:
- Training time down to 9.5h from 84h (8.8x faster).
- Test rate with SVD (1024 singular values) improves from 47 seconds per image to 0.22 seconds per image (213x faster).
- Test rate without SVD improves similarly to 0.32 seconds per image.
- mAP improves from 66.0% to 66.6% (66.9% without SVD).
- Per class accuracy results:
- Fixing the weights of VGG16's convolutional layers and only fine-tuning the fully connected layers (those are applied to each region proposal), decreases the accuracy to 61.4%.
- This decrease in accuracy is most significant for the later convolutional layers, but marginal for the first layers.
- Therefor they only train the convolutional layers starting with
conv3_1
(9 out of 13 layers), which speeds up training.
- Multi-task training
- Training models on classification and bounding box regression instead of only on classification improves the mAP (from 62.6% to 66.9%).
- Doing this in one hierarchy instead of two seperate models (one for classification, one for bounding box regression) increases mAP by roughly 2-3 percentage points.
- They did not find a significant benefit of training the model on multiple scales (e.g. same image sometimes at 400x400, sometimes at 600x600, sometimes at 800x800 etc.).
- Note that their raw CNN (everything before RoI-Pooling) is fully convolutional, so they can feed the images at any scale through the network.
- Increasing the amount of training data seemed to improve mAP a bit, but not as much as one might hope for.
- Using a softmax loss instead of an SVM seemed to marginally increase mAP (0-1 percentage points).
- Using more region proposals from selective search does not simply increase mAP. Instead it can lead to higher recall, but lower precision.
- Using densely sampled region proposals (as in sliding window) significantly reduces mAP (from 59.2% to 52.9%). If SVMs instead of softmaxes are used, the results are even worse (49.3%).
- They try three base models: