Tutorial: computer vision #216

justheuristic · 2021-04-09T16:01:23Z

Let's add a tutorial for training VIT/ResNet50 with Decentralized SGD

The intent is to use DecentralizedSGD optimizer with vissl library for swav.

Here's a basic tutorial for training simclr in vissl: https://colab.research.google.com/drive/1Rt3Plt3ph84i1A-eolLFafybwjrBFxYe?usp=sharing .

The engineering is up to you, but it appears that the two hardest tasks will be to

modify vissl trainer to use DecentralizedSGD instead of fixed parallelism
tune DecentralizedSGD performance for swav (see below)

The main request for DecentralizedSGD is to implement a training regime where it would be able to run averaging all the time with latest parameters. The issue with the current implementation is that DecentralizedSGD will spend up to half of the time looking for groups and when it will end up averaging model parameters, these parameters will be an older snapshot since before the averager began looking for group. Finally, when the averager actually updates model parameters, these updates will disregard any local changes to model parameters made during averaging.

Here's a few ideas on how to improve DecentralizedSGD:

after an evaraging round, instead of overwriting, it would be better to compute weight = weight + averaged_weight - weight_before_averaging_step. This will prevent DecentralizedSGD from disregarding local updates concurrent with averager.step.
in DecentralizedAverager, we can implement a callback that allows the user to update the model parameters right before the beginning of AllReduce (i.e. after the group is formed). This should significantly reduce the staleness of averaged parameters.
in DecentralizedSGD, we can modify the code for calling averager step to allow for concurrent matchmaking and allreduce. In other words, once the averager has found one group, let him immediately look for the next group while still running allreduce.

Implementing this into an example will require the following steps:

create a root folder, e.g. ./hivemind/examples/swav, containing...
modified training runner that uses DecentralizedSGD
basic README similar to this or this
- describe what it does (and how)
- list additional requirements
- full how-to-run guide

The text was updated successfully, but these errors were encountered:

justheuristic added enhancement New feature or request help wanted Extra attention is needed labels Apr 9, 2021

justheuristic assigned ploshkin and xtinkt Apr 9, 2021

justheuristic changed the title ~~Training SWAV with Decentralized SGD~~ Tutorial: training SWAV with Decentralized SGD Apr 10, 2021

ploshkin mentioned this issue Apr 12, 2021

Prevent DecentralizedSGD from accidentally skipping a fraction of training batches #218

Merged

justheuristic changed the title ~~Tutorial: training SWAV with Decentralized SGD~~ Tutorial: computer vision Feb 24, 2022

justheuristic linked a pull request Jun 1, 2022 that will close this issue

Add example with Swin training #459

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: computer vision #216

Tutorial: computer vision #216

justheuristic commented Apr 9, 2021 •

edited

Loading

Tutorial: computer vision #216

Tutorial: computer vision #216

Comments

justheuristic commented Apr 9, 2021 • edited Loading

justheuristic commented Apr 9, 2021 •

edited

Loading