Skip to content

Latest commit

 

History

History
73 lines (60 loc) · 5.06 KB

deep-learning-training.md

File metadata and controls

73 lines (60 loc) · 5.06 KB

Deep Learning Training

Elastic Training

  • EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs (SC 2023) [Paper] [Code]
    • BUAA & Alibaba

Parallelism

  • Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency (SC 2023) [Paper] [Code]
    • NUS
  • Supporting Very Large Models using Automatic Dataflow Graph Partitioning (EuroSys 2019) [Paper]
    • NYU
    • Tofu: Automatic partition a dataflow graph of fine-grained tensor operations.
  • One weird trick for parallelizing convolutional neural networks (arXiv 1404.599) [Paper]
    • Google
    • Data parallelism for convolutional layers; model parallelism for fully-connected layers.

Optimizing Network Communication

  • A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (OSDI 2020) [Personal Notes] [Paper] [Code]
    • THU & ByteDance
    • BytePS: Communication framework
    • Leverage spare CPU and bandwidth resources
    • Consider network topology

Reduce GPU Memory Footprints

GPU Sharing

  • Zico: Efficient GPU Memory Sharing for Concurrent DNN Training (ATC 2021) [Personal Notes] [Paper]
    • UNIST & Ajou & Alibaba & KAIST
    • Reduce the overall GPU consumption for co-located DNN training jobs
    • Utilize NVIDIA MPS
  • Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications (MLSys 2020) [Paper] [Code]
    • UMich SymbioticLab
    • Fine-grained GPU sharing; customized TensorFlow.
  • Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 2018) [Paper]
    • MSRA
    • Time slicing; suspend and resume; mini-batch granularity.

Tensor Swapping / Recomputation

  • SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping (ASPLOS 2020) [Paper]
    • NYU
    • Tensor swapping
    • Consider both GPU memory allocation and operator scheduling
  • Capuchin: Tensor-based GPU Memory Management for Deep Learning (ASPLOS 2020) [Paper]
    • HUST & MSRA & USC
    • Combination of tensor swapping and recomputation.
  • Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization (MLSys 2020) [Paper] [Code]
    • UC Berkeley
    • Define tensor recomputation as an optimization problem.
  • SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks (PPoPP 2018) [Paper]
    • Brown & UESTC & Los Alamos National Laboratory & Pacific Northwest National Laboratory & MIT
    • Cost-aware recomputation
    • Remove the convolutional layer tensor with low computational overhead
  • vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design (MICRO 2016) [Paper]
    • NVIDIA
    • Predictively swap tensors to overlap the CPU-GPU communication time.
  • Training Deep Nets with Sublinear Memory Cost (arXiv 1604.06174) [Personal Notes] [Paper] [Code]
    • UW & Dato Inc. & MIT
    • Memory Monger
    • Sublinear memory cost; trade computation for memory.

Compression

  • Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training (ISCA 2020) [Paper]
    • UofT
    • LSTM RNN training
  • Gist: Efficient Data Encoding for Deep Neural Network Training (ISCA 2018) [Paper]
    • MSR & UMich & UofT
    • Data encoding