- EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs (SC 2023) [Paper] [Code]
- BUAA & Alibaba
- Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency (SC 2023) [Paper] [Code]
- NUS
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning (EuroSys 2019) [Paper]
- NYU
- Tofu: Automatic partition a dataflow graph of fine-grained tensor operations.
- One weird trick for parallelizing convolutional neural networks (arXiv 1404.599) [Paper]
- Data parallelism for convolutional layers; model parallelism for fully-connected layers.
- A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (OSDI 2020) [Personal Notes] [Paper] [Code]
- THU & ByteDance
- BytePS: Communication framework
- Leverage spare CPU and bandwidth resources
- Consider network topology
- Zico: Efficient GPU Memory Sharing for Concurrent DNN Training (ATC 2021) [Personal Notes] [Paper]
- UNIST & Ajou & Alibaba & KAIST
- Reduce the overall GPU consumption for co-located DNN training jobs
- Utilize NVIDIA MPS
- Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications (MLSys 2020) [Paper] [Code]
- UMich SymbioticLab
- Fine-grained GPU sharing; customized TensorFlow.
- Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 2018) [Paper]
- MSRA
- Time slicing; suspend and resume; mini-batch granularity.
- SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping (ASPLOS 2020) [Paper]
- NYU
- Tensor swapping
- Consider both GPU memory allocation and operator scheduling
- Capuchin: Tensor-based GPU Memory Management for Deep Learning (ASPLOS 2020) [Paper]
- HUST & MSRA & USC
- Combination of tensor swapping and recomputation.
- Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization (MLSys 2020) [Paper] [Code]
- UC Berkeley
- Define tensor recomputation as an optimization problem.
- SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks (PPoPP 2018) [Paper]
- Brown & UESTC & Los Alamos National Laboratory & Pacific Northwest National Laboratory & MIT
- Cost-aware recomputation
- Remove the convolutional layer tensor with low computational overhead
- vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design (MICRO 2016) [Paper]
- NVIDIA
- Predictively swap tensors to overlap the CPU-GPU communication time.
- Training Deep Nets with Sublinear Memory Cost (arXiv 1604.06174) [Personal Notes] [Paper] [Code]
- UW & Dato Inc. & MIT
- Memory Monger
- Sublinear memory cost; trade computation for memory.