Skip to content

Latest commit

 

History

History
27 lines (16 loc) · 1.13 KB

elastic-parameter-server-load-distribution-in-deep-learning-clusters.md

File metadata and controls

27 lines (16 loc) · 1.13 KB

Elastic Parameter Server Load Distribution in Deep Learning Clusters

Metadata

Presented in SoCC 2020.

Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo

Understanding the paper

A work across HKU and ByteDance.

Motivation

  • Parameter servers (PS) are widely used in distributed DNN training. But their performance will be damaged by stragglers for some reasons (e.g., imbalanced parameter distribution, bandwidth contention, or computation interference).
  • Few studies have investigated efficient parameter (aka load) distribution among parameter servers (PS).

Solution

  • Propose a dynamic parameter server load distribution scheme called PSLD.
    • Mitigate PS straggler issues and accelerate distributed model training.
    • An exploitation-exploration method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
    • Implemented on BytePS and vanilla MXNet PS architectures.

The workflow of PSLD

Not read the details of the algorithms.