Paper

Summary

What
- The authors reevaluate the original residual design of neural networks.
- They compare various architectures of residual units and actually find one that works quite a bit better.
How
- The new variation starts the transformation branch of each residual unit with BN and a ReLU.
- It removes BN and ReLU after the last convolution.
- As a result, the information from previous layers can flow completely unaltered through the shortcut branch of each residual unit.
- The image below shows some variations (of the position of BN and ReLU) that they tested. The new and better design is on the right:
- They also tried various alternative designs for the shortcut connections. However, all of these designs performed worse than the original one. Only one (d) came close under certain conditions. Therefore, the recommendation is to stick with the old/original design.
Results
- Significantly faster training for very deep residual networks (1001 layers).
- Better regularization due to the placement of BN.
- CIFAR-10 and CIFAR-100 results, old vs. new design: