Presented in SoCC 2020.
Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo
A work across HKU and ByteDance.
- Parameter servers (PS) are widely used in distributed DNN training. But their performance will be damaged by stragglers for some reasons (e.g., imbalanced parameter distribution, bandwidth contention, or computation interference).
- Few studies have investigated efficient parameter (aka load) distribution among parameter servers (PS).
- Propose a dynamic parameter server load distribution scheme called PSLD.
- Mitigate PS straggler issues and accelerate distributed model training.
- An exploitation-exploration method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
- Implemented on BytePS and vanilla MXNet PS architectures.
Not read the details of the algorithms.