Homepage: https://neurips.cc/Conferences/2024
Paper list: https://neurips.cc/virtual/2024/papers.html?filter=titles
- Total: 15671
- Accept: 25.8% (4037)
- Poster: 23.3% (3650)
- Spotlight: 2.1% (326)
- Oral: 0.4% (61)
- LLM Inference
- SGLang: Efficient Execution of Structured Language Model Programs [Paper] [Code] [arXiv]
- Stanford & UC Berkeley
- Co-design both the front-end language (programming interface) and the back-end runtime
- SGLang Primitives
- Enable the manipulation of prompts and generations
gen
: call LLM generationselect
: let the LLM choose the option with the highest probability from a listextend
or+=
: extend the current prompt
- Control of parallelism
fork
: fork the current prompt statejoin
: rejoin the forked prompt states
- Enable the manipulation of prompts and generations
- Compilation optimizations
- Code movement for improving prefix sharing
- Doesn't strictly preserve the original computation —— aggressive
- Prompt GPT-4 to re-order graph nodes
- Code movement for improving prefix sharing
- Runtime
- RadixAttention
- Utilize a radix tree (w/ efficient prefix search, reuse, insertion, eviction)
- LRU eviction policy
- Cache-aware scheduling → Increase the cache hit rate
- Key idea: Sort the requests by matched prefix length
- RadixAttention
- Efficient LLM Scheduling by Learning to Rank [Paper] [Code]
- UCSD & THU & Snowflake & UC-Berkeley
- Insight: it is possible to predict the relative ranks of output lengths in a batch of requests.
- Develop a scheduler for LLM inference that can approximate the shortest-job-first (SJF) schedule better than existing approaches
- SGLang: Efficient Execution of Structured Language Model Programs [Paper] [Code] [arXiv]
- Compound AI Systems
- Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems [Paper] [Code]
- Stanford & UC Berkeley & Princeton
- Systematically study how the number of LM calls affects the performance of two natural inference strategy designs.
- Vote: Aggregate LM responses via majority voting
- Filter-Vote: Majority voting after filtering results with an LM
- Insight
- More LM calls lead to higher performance on “easy” queries, but lower performance on “hard” queries, and nonmonotone behavior can emerge when a task contains both types of queries.
- An analytical scaling model to predict the performance of Vote and Filter-Vote systems and find the optimal number of LM calls to make.
- Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems [Paper] [Code]
- Adapter Selection
- Stylus: Automatic Adapter Selection for Diffusion Models [Paper] [Homepage] [Code]
- UC Berkeley & CMU & Google DeepMind
- Problem: how to match the prompt to a set of relevant adapters
- Stylus
- Select and automatically compose task-specific adapters based on a prompt's keywords
- Three-stage approach
- Refiner: Leverage visual-language foundational models (VLM) to generate semantic descriptions of adapters then translate them into embeddings
- Retriever: Fetch the most relevant adapters over the entirety of the user’s prompt using cosine similarity
- Composer: Segment the prompt into tasks from a prompt’s keywords and assign retrieved adapters to tasks
- StylusDocs
- An adapter dataset consists of 75K LoRAs (sourced from Civitai) with pre-computed adapter embeddings
- Stylus: Automatic Adapter Selection for Diffusion Models [Paper] [Homepage] [Code]
- Inference
- Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference [Paper]
- HKUST & HKU & Salesforce AI Research & UIUC
- Develop a general RTK (reverse transition kernel) framework that enables a more balanced subproblem decomposition
- Propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems
- Accelerating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity [Paper]
- Stanford
- Propose to divide the sampling process into
$$O(1)$$ blocks with parallelizable Picard iterations within each block
- Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference [Paper]
- Talking Face Video Generation
- VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time [Paper] [Homepage]
- MSRA
- A framework to generate lifelike talking faces with appealing visual affective skills (VAS).
- A diffusion-based holistic facial dynamics and head movement generation model that works in a face latent space.
- Support the online generation of 512×512 videos at up to 40 FPS.
- VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time [Paper] [Homepage]
- Facial Parts Swapping
- FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images [Paper] [Code (coming...)]
- Alibaba
- Facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module
- The consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters
- FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images [Paper] [Code (coming...)]
- Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [Paper] [Code] [arXiv]
- PKU & ByteDance
- Best Paper Award
- VAR: Visual Autoregressive Modeling
- Redefine the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction”
- Multi-scale token maps are autoregressively generated from coarse to fine scales (lower to higher resolutions), with parallel token generation within each scale
- Autoregressive Image Generation without Vector Quantization [Paper] [Code] [arXiv]
- MIT & Google DeepMind & THU
- Propose to model the per-token probability distribution using a diffusion procedure
- Define a Diffusion Loss function to model the per-token probability
- Evaluated across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants
- Inference
- Fast and Memory-Efficient Video Diffusion Using Streamlined Inference [Paper] [Code]
- NEU
- Streamlined Inference: Leverage the temporal and spatial properties of video diffusion models
- Three core components
- Feature Slicer: Partition input features into sub-features
- Operator Grouping: Process each sub-feature with a group of consecutive operators
- Step Rehash: Accelerate inference through skipping unnecessary steps
- Fast and Memory-Efficient Video Diffusion Using Streamlined Inference [Paper] [Code]
- Evaluation
- VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [Paper] [Homepage] [Code] [Dataset]
- UTS & ZJU
- 1.67M unique text-to-video Prompts from real users.
- 6.69M videos generated by four state-of-the-art diffusion models (Pika, VideoCraft2, Text2Video-Zero, ModelScope).
- Evaluation of Text-to-Video Generation Models: A Dynamics Perspective [Paper] [Homepage] [Code]
- UCAS & HIT & Adelaide & Baidu
- Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content.
- DEVIL: An evaluation protocol that centers on the dynamics dimension to evaluate T2V generation models
- UCAS & HIT & Adelaide & Baidu
- Boosting Text-to-Video Generative Model with MLLMs Feedback [Paper]
- MSRA
- Utilize Multimodal Large Language Models (MLLMs) to perform fine-grained video preference annotations → VideoPrefer (13.5K preference annotations)
- VideoRM: The reward model for text-to-video alignment
- MSRA
- VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [Paper] [Homepage] [Code] [Dataset]