Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support Beam Search #3032

Open
2 of 4 tasks
laixinn opened this issue Jan 21, 2025 · 0 comments
Open
2 of 4 tasks

[Feature] Support Beam Search #3032

laixinn opened this issue Jan 21, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@laixinn
Copy link

laixinn commented Jan 21, 2025

Checklist

Motivation

cc @HandH1998 @sleepcoo @ispobock

Beam search is a common method in LLM generation, supported by some LLM engines, e.g,. vLLM, Transformers.

This issue proposes our implementation to support beam search in SGLang and discusses its rationality, similar to an RFC.

vLLM's beam search implementation was performant, but in a recent release, beam search support was dropped from the core (vllm-project/vllm#6226) and became much slower. Our implementation aims to achieve minimal modifications and minimal overhead. We found that in vLLM's high-level implementation (https://github.com/vllm-project/vllm/blob/2fc6944c5e69d5d0ce15d09a855452c795d75c3c/vllm/entrypoints/llm.py#L507), each decoding iteration becomes prefilling (max token=1), which is the primary source of overhead.

Our implementation is shown in the figure below. The shared prefix is determined in the prefilling stage, and in decoding stage, we duplicate the prefix indices during the allocation of kv cache space for the next decoding iteration. This approach can avoid most prefilling processes and the need to modify too much code to maintain a tree cache for beam search sequences. After that, each iteration overwrites the top-beam-width kv cache indices into request-to-token retrieval (req_to_token) from the last retrieval and frees the parent kv cache indices whose subsequent sequences are all dropped. In this cae, green parent will be dropped because their children are not within the top-beam-width.
Image

From the algorithmic perspective, we follow the greedy strategy from the vLLM version. Support for more versions of beam search algorithms is left for future work.

We benchmark our implementation for both accuracy and efficiency in Qwen2.5-1.5B-Instruct, with batch size=32, beam width=5 and warmup. The beam search outputs achieve 80% accuracy with the Transformers. The top-one beam search outputs score 0.734 in MMLU benchmarking. Our efficiency introduces at least 65% overhead but significantly outperforms Transformers and vLLM fork, see blow.
Image

We plan to support:

  • the overlap mode
  • replace overwrite operation by trition kernel.

Related resources

No response

@zhyncs zhyncs added the enhancement New feature or request label Jan 21, 2025
@laixinn laixinn mentioned this issue Jan 23, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants