Skip to content

Latest commit

 

History

History
106 lines (84 loc) · 7.41 KB

mlsys-2024.md

File metadata and controls

106 lines (84 loc) · 7.41 KB

MLSys 2024

Meta Info

Homepage: https://mlsys.org/Conferences/2024

Paper list: https://mlsys.org/Conferences/2024/AcceptedPapers

Papers

Large Language Models (LLMs)

  • LoRA serving
    • S-LoRA: Serving Thousands of Concurrent LoRA Adapters [Paper] [arXiv] [Code]
      • UC Berkeley
        • A system to serve many LoRA adapters
        • Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
        • Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
        • Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
        • Built on top of LightLLM
    • Punica: Multi-Tenant LoRA Serving [arXiv] [Code]
      • UW & Duke
        • A system to serve multiple LoRA models in a shared GPU cluster
        • A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
          • Batch GPU operations for concurrent execution of different LoRA models
          • A GPU only needs to store a single copy of the pre-trained model
        • A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
          • Route the new request to a small set of active GPUs
          • Allocate additional GPU resources when the existing GPUs are fully utilized
          • Periodically migrate existing requests for consolidation
  • LLM inference
    • Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference [Paper] [Code]
      • UBC & d-Matrix
    • Prompt Cache: Modular Attention Reuse for Low-Latency Inference [Paper]
      • Yale & Google
    • HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices [Paper]
      • NUS
    • Vidur: A Large-scale Simulation Framework for LLM Inference [Paper] [Code]
      • GaTech & MSR India
    • FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics [Paper]
      • THU & Infinigence-AI
  • LLM fine-tuning
    • Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems [Paper]
      • UT-Austin
  • LLM for data manipulation
    • UniDM: A Unified Framework for Data Manipulation with Large Language Models [Paper]
      • Alibaba & USTC

Mixture-of-Experts (MoEs)

  • MoE training
    • Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication [Paper]
      • HKU & AWS & Boson AI
  • MoE inference
    • QMoE: Sub-1-Bit Compression of Trillion Parameter Models [Paper] [Code]
      • Institute of Science and Technology Austria
    • SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Diffusion Models

  • DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines [Paper]
    • HKU & AWS

Deep Learning Recommendation Models (DLRMs)

  • Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation [Paper]
    • Meta AI

ML Compilation

  • ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time [Paper]
    • CMU
    • Perform hybrid static+dynamic compiler optimizations and end-to-end tensor code generation

Quantization

  • FP8
    • Efficient Post-training Quantization with FP8 Formats [Paper]
      • Intel
  • LLM
    • AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration [Paper] [Code]
      • MIT
      • Best Paper Award
    • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving [Paper] [Code] [Slides] [Poster]
      • UW
    • Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache [Paper] [Code]
      • UT-Texas & Oxford & Eindhoven University of Technology & Lawrence Livermore National Laboratory & CMU
  • ML training
    • JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training [Paper] [Slides]
      • AMD

Model Adaptation

  • FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms [Paper] [Code] [Slides]

Cloud Configuration Generation

  • CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation [Paper] [Homepage] [Code] [Benchmark]
    • Alibaba Cloud & UMich & UCLA & UC Merced

Acronyms

  • ML: Machine Learning
  • LLM: Large Language Model
  • LoRA: Low-Rank Adaptation
  • MoE: Mixture-of-Experts