Skip to content

Latest commit

 

History

History
40 lines (35 loc) · 2.49 KB

Resnet_in_Resnet.md

File metadata and controls

40 lines (35 loc) · 2.49 KB

Paper

  • Title: Resnet in Resnet: Generalizing Residual Architectures
  • Authors: Sasha Targ, Diogo Almeida, Kevin Lyman
  • Link: http://arxiv.org/abs/1603.08029
  • Tags: Neural Network, residual
  • Year: 2016

Summary

  • What

    • They describe an architecture that merges classical convolutional networks and residual networks.
    • The architecture can (theoretically) learn anything that a classical convolutional network or a residual network can learn, as it contains both of them.
    • The architecture can (theoretically) learn how many convolutional layers it should use per residual block (up to the amount of convolutional layers in the whole network).
  • How

    • Just like residual networks, they have "blocks". Each block contains convolutional layers.
    • Each block contains residual units and non-residual units.
    • They have two "streams" of data in their network (just matrices generated by each block):
      • Residual stream: The residual blocks write to this stream (i.e. it's their output).
      • Transient stream: The non-residual blocks write to this stream.
    • Residual and non-residual layers receive both streams as input, but only write to their stream as output.
    • Their architecture visualized: Architecture
    • Because of this architecture, their model can learn the number of layers per residual block (though BN and ReLU might cause problems here?): Learning layercount
    • The easiest way to implement this should be along the lines of the following (some of the visualized convolutions can be merged):
      • Input of size CxHxW (both streams, each C/2 planes)
        • Concat
          • Residual block: Apply C/2 convolutions to the C input planes, with shortcut addition afterwards.
          • Transient block: Apply C/2 convolutions to the C input planes.
        • Apply BN
        • Apply ReLU
      • Output of size CxHxW.
    • The whole operation can also be implemented with just a single convolutional layer, but then one has to make sure that some weights stay at zero.
  • Results

    • They test on CIFAR-10 and CIFAR-100.
    • They search for optimal hyperparameters (learning rate, optimizer, L2 penalty, initialization method, type of shortcut connection in residual blocks) using a grid search.
    • Their model improves upon a wide ResNet and an equivalent non-residual CNN by a good margin (CIFAR-10: 0.5-1%, CIFAR-100: 1-2%).