You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The main request for DecentralizedSGD is to implement a training regime where it would be able to run averaging all the time with latest parameters. The issue with the current implementation is that DecentralizedSGD will spend up to half of the time looking for groups and when it will end up averaging model parameters, these parameters will be an older snapshot since before the averager began looking for group. Finally, when the averager actually updates model parameters, these updates will disregard any local changes to model parameters made during averaging.
Here's a few ideas on how to improve DecentralizedSGD:
after an evaraging round, instead of overwriting, it would be better to compute weight = weight + averaged_weight - weight_before_averaging_step. This will prevent DecentralizedSGD from disregarding local updates concurrent with averager.step.
in DecentralizedAverager, we can implement a callback that allows the user to update the model parameters right before the beginning of AllReduce (i.e. after the group is formed). This should significantly reduce the staleness of averaged parameters.
in DecentralizedSGD, we can modify the code for calling averager step to allow for concurrent matchmaking and allreduce. In other words, once the averager has found one group, let him immediately look for the next group while still running allreduce.
Implementing this into an example will require the following steps:
create a root folder, e.g. ./hivemind/examples/swav, containing...
modified training runner that uses DecentralizedSGD
Let's add a tutorial for training VIT/ResNet50 with Decentralized SGD
The intent is to use DecentralizedSGD optimizer with vissl library for swav.
Here's a basic tutorial for training simclr in vissl: https://colab.research.google.com/drive/1Rt3Plt3ph84i1A-eolLFafybwjrBFxYe?usp=sharing .
The engineering is up to you, but it appears that the two hardest tasks will be to
The main request for DecentralizedSGD is to implement a training regime where it would be able to run averaging all the time with latest parameters. The issue with the current implementation is that DecentralizedSGD will spend up to half of the time looking for groups and when it will end up averaging model parameters, these parameters will be an older snapshot since before the averager began looking for group. Finally, when the averager actually updates model parameters, these updates will disregard any local changes to model parameters made during averaging.
Here's a few ideas on how to improve DecentralizedSGD:
Implementing this into an example will require the following steps:
./hivemind/examples/swav
, containing...The text was updated successfully, but these errors were encountered: