- Title: Resnet in Resnet: Generalizing Residual Architectures
- Authors: Sasha Targ, Diogo Almeida, Kevin Lyman
- Link: http://arxiv.org/abs/1603.08029
- Tags: Neural Network, residual
- Year: 2016
-
What
- They describe an architecture that merges classical convolutional networks and residual networks.
- The architecture can (theoretically) learn anything that a classical convolutional network or a residual network can learn, as it contains both of them.
- The architecture can (theoretically) learn how many convolutional layers it should use per residual block (up to the amount of convolutional layers in the whole network).
-
How
- Just like residual networks, they have "blocks". Each block contains convolutional layers.
- Each block contains residual units and non-residual units.
- They have two "streams" of data in their network (just matrices generated by each block):
- Residual stream: The residual blocks write to this stream (i.e. it's their output).
- Transient stream: The non-residual blocks write to this stream.
- Residual and non-residual layers receive both streams as input, but only write to their stream as output.
- Their architecture visualized:
- Because of this architecture, their model can learn the number of layers per residual block (though BN and ReLU might cause problems here?):
- The easiest way to implement this should be along the lines of the following (some of the visualized convolutions can be merged):
- Input of size CxHxW (both streams, each C/2 planes)
- Concat
- Residual block: Apply C/2 convolutions to the C input planes, with shortcut addition afterwards.
- Transient block: Apply C/2 convolutions to the C input planes.
- Apply BN
- Apply ReLU
- Concat
- Output of size CxHxW.
- Input of size CxHxW (both streams, each C/2 planes)
- The whole operation can also be implemented with just a single convolutional layer, but then one has to make sure that some weights stay at zero.
-
Results
- They test on CIFAR-10 and CIFAR-100.
- They search for optimal hyperparameters (learning rate, optimizer, L2 penalty, initialization method, type of shortcut connection in residual blocks) using a grid search.
- Their model improves upon a wide ResNet and an equivalent non-residual CNN by a good margin (CIFAR-10: 0.5-1%, CIFAR-100: 1-2%).