Elastic Parameter Server Load Distribution in Deep Learning Clusters

Metadata

Presented in SoCC 2020.

Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo

Understanding the paper

A work across HKU and ByteDance.

Motivation

Parameter servers (PS) are widely used in distributed DNN training. But their performance will be damaged by stragglers for some reasons (e.g., imbalanced parameter distribution, bandwidth contention, or computation interference).
Few studies have investigated efficient parameter (aka load) distribution among parameter servers (PS).

Solution

Propose a dynamic parameter server load distribution scheme called PSLD.
- Mitigate PS straggler issues and accelerate distributed model training.
- An exploitation-exploration method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
- Implemented on BytePS and vanilla MXNet PS architectures.

Not read the details of the algorithms.