Presented in arxiv:2202.07896.
Authors: Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, Cong Wang, ByteDance, City University of Hong Kong, The Chinese University of Hong Kong
This paper presents Aryl, a cluster scheduler that introduces capacity loaning to loan idle inference GPU servers for training jobs.
It exploits elastic scaling that scales a training job’s GPU allocation to better utilize loaned resources.