Homepage: https://mlsys.org/Conferences/2024
Paper list: https://mlsys.org/Conferences/2024/AcceptedPapers
- LoRA serving
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters [Paper] [arXiv] [Code]
- UC Berkeley
- A system to serve many LoRA adapters
- Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
- Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
- Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
- Built on top of LightLLM
- UC Berkeley
- Punica: Multi-Tenant LoRA Serving [arXiv] [Code]
- UW & Duke
- A system to serve multiple LoRA models in a shared GPU cluster
- A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
- Batch GPU operations for concurrent execution of different LoRA models
- A GPU only needs to store a single copy of the pre-trained model
- A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
- Route the new request to a small set of active GPUs
- Allocate additional GPU resources when the existing GPUs are fully utilized
- Periodically migrate existing requests for consolidation
- UW & Duke
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters [Paper] [arXiv] [Code]
- LLM inference
- Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference [Paper] [Code]
- UBC & d-Matrix
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference [Paper]
- Yale & Google
- HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices [Paper]
- NUS
- Vidur: A Large-scale Simulation Framework for LLM Inference [Paper] [Code]
- GaTech & MSR India
- FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics [Paper]
- THU & Infinigence-AI
- Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference [Paper] [Code]
- LLM fine-tuning
- Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems [Paper]
- UT-Austin
- Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems [Paper]
- LLM for data manipulation
- UniDM: A Unified Framework for Data Manipulation with Large Language Models [Paper]
- Alibaba & USTC
- UniDM: A Unified Framework for Data Manipulation with Large Language Models [Paper]
- MoE training
- Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication [Paper]
- HKU & AWS & Boson AI
- Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication [Paper]
- MoE inference
- DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines [Paper]
- HKU & AWS
- Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation [Paper]
- Meta AI
- ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time [Paper]
- CMU
- Perform hybrid static+dynamic compiler optimizations and end-to-end tensor code generation
- FP8
- Efficient Post-training Quantization with FP8 Formats [Paper]
- Intel
- Efficient Post-training Quantization with FP8 Formats [Paper]
- LLM
- AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration [Paper] [Code]
- MIT
- Best Paper Award
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving [Paper] [Code] [Slides] [Poster]
- UW
- Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache [Paper] [Code]
- UT-Texas & Oxford & Eindhoven University of Technology & Lawrence Livermore National Laboratory & CMU
- AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration [Paper] [Code]
- ML training
- CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation [Paper] [Homepage] [Code] [Benchmark]
- Alibaba Cloud & UMich & UCLA & UC Merced
- ML: Machine Learning
- LLM: Large Language Model
- LoRA: Low-Rank Adaptation
- MoE: Mixture-of-Experts