[Feature] Support Beam Search #3032

laixinn · 2025-01-21T11:46:57Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

cc @HandH1998 @sleepcoo @ispobock

Beam search is a common method in LLM generation, supported by some LLM engines, e.g,. vLLM, Transformers.

This issue proposes our implementation to support beam search in SGLang and discusses its rationality, similar to an RFC.

vLLM's beam search implementation was performant, but in a recent release, beam search support was dropped from the core (vllm-project/vllm#6226) and became much slower. Our implementation aims to achieve minimal modifications and minimal overhead. We found that in vLLM's high-level implementation (https://github.com/vllm-project/vllm/blob/2fc6944c5e69d5d0ce15d09a855452c795d75c3c/vllm/entrypoints/llm.py#L507), each decoding iteration becomes prefilling (max token=1), which is the primary source of overhead.

Our implementation is shown in the figure below. The shared prefix is determined in the prefilling stage, and in decoding stage, we duplicate the prefix indices during the allocation of kv cache space for the next decoding iteration. This approach can avoid most prefilling processes and the need to modify too much code to maintain a tree cache for beam search sequences. After that, each iteration overwrites the top-beam-width kv cache indices into request-to-token retrieval (req_to_token) from the last retrieval and frees the parent kv cache indices whose subsequent sequences are all dropped. In this cae, green parent will be dropped because their children are not within the top-beam-width.

From the algorithmic perspective, we follow the greedy strategy from the vLLM version. Support for more versions of beam search algorithms is left for future work.

We benchmark our implementation for both accuracy and efficiency in Qwen2.5-1.5B-Instruct, with batch size=32, beam width=5 and warmup. The beam search outputs achieve 80% accuracy with the Transformers. The top-one beam search outputs score 0.734 in MMLU benchmarking. Our efficiency introduces at least 65% overhead but significantly outperforms Transformers and vLLM fork, see blow.

We plan to support:

the overlap mode
replace overwrite operation by trition kernel.

Related resources

No response

zhyncs assigned ispobock and HandH1998 Jan 21, 2025

zhyncs added the enhancement New feature or request label Jan 21, 2025

laixinn mentioned this issue Jan 23, 2025

[Feature] Beam Search #3066

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support Beam Search #3032

[Feature] Support Beam Search #3032

laixinn commented Jan 21, 2025

[Feature] Support Beam Search #3032

[Feature] Support Beam Search #3032

Comments

laixinn commented Jan 21, 2025

Checklist

Motivation

Related resources