description |
---|
LLM Inference |
Presented in arxiv:2312.07104. [ Code ]
Authors: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng (UCB, Stanford)
- A single generation call = The process of taking a sequence of input tokens and generating a sequence of output tokens
- Programming LLMs = The development of computer programs that control the generation processes of LLMs
- The paradigm of programming LLMs changes
- Single and straightforward generation calls → Multiple and interdependent generation calls interleaved with control flow
- User cases
- Prompting techniques
- Basic: few-shot prompting, chain-of-thought prompting, retrieval-augmented generation
- Meta-reasoning
- LLM initially generates multiple solutions through chain-of-thought prompting;
- It then performs meta-reasoning across these solutions to derive the final resolution.
- Skeleton-of-thought
- LLM first generates a skeleton of the answer;
- It then issues multiple parallel calls to elaborate the skeleton into a detailed explanation.
- Tree-of-thought
- Integrate LLM calls within tree search algorithms (e.g., BFS).
- LLM-based agents
- The behavior of the agent is controlled by loops and conditional statements
- Prompting techniques
- How to manage the multiple API calls of LLMs?
- A good system should include LLM state management, decoding process control, and smooth integration with external tools.
Co-design both the front-end language (programming interface) and the back-end runtime
- SGLang
- An embedded domain-specific language in Python
- Interpreter
- Manage the prompt state and execute SGLang programs
- Treat a prompt as a stream and submit primitive operations to this stream for asynchronous execution
- Compiler
- Compiled as a computational graph and enable further optimizations
- SGVM
- A custom serving engine/runtime
- RadixAttention
- SGLang Primitives
- Enable the manipulation of prompts and generations
gen
: call LLM generationselect
: let the LLM choose the option with the highest probability from a listextend
or+=
: extend the current prompt
- Control of parallelism
fork
: fork the current prompt statejoin
: rejoin the forked prompt states
- Enable the manipulation of prompts and generations
- Compilation optimizations
- Code movement for improving prefix sharing
- Doesn't strictly preserve the original computation —— aggressive
- Prompt GPT-4 to re-order graph nodes
- Code movement for improving prefix sharing
- Runtime
- RadixAttention
- Utilize a radix tree (w/ efficient prefix search, reuse, insertion, eviction)
- Least Recently Used (LRU) eviction policy
- Cache-aware scheduling → Increase the cache hit rate
- Key idea: Sort the requests by matched prefix length
- Newly “extend” CUDA kernel
- RadixAttention
- Approximately 9k lines of Python code
- SGVM reuses some components from LightLLM