You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our implementation is shown in the figure below. The shared prefix is determined in the prefilling stage, and in decoding stage, we duplicate the prefix indices during the allocation of kv cache space for the next decoding iteration. This approach can avoid most prefilling processes and the need to modify too much code to maintain a tree cache for beam search sequences. After that, each iteration overwrites the top-beam-width kv cache indices into request-to-token retrieval (req_to_token) from the last retrieval and frees the parent kv cache indices whose subsequent sequences are all dropped. In this cae, green parent will be dropped because their children are not within the top-beam-width.
From the algorithmic perspective, we follow the greedy strategy from the vLLM version. Support for more versions of beam search algorithms is left for future work.
We benchmark our implementation for both accuracy and efficiency in Qwen2.5-1.5B-Instruct, with batch size=32, beam width=5 and warmup. The beam search outputs achieve 80% accuracy with the Transformers. The top-one beam search outputs score 0.734 in MMLU benchmarking. Our efficiency introduces at least 65% overhead but significantly outperforms Transformers and vLLM fork, see blow.
We plan to support:
the overlap mode
replace overwrite operation by trition kernel.
Related resources
No response
The text was updated successfully, but these errors were encountered:
Checklist
Motivation
cc @HandH1998 @sleepcoo @ispobock
Beam search is a common method in LLM generation, supported by some LLM engines, e.g,. vLLM, Transformers.
This issue proposes our implementation to support beam search in SGLang and discusses its rationality, similar to an RFC.
vLLM's beam search implementation was performant, but in a recent release, beam search support was dropped from the core (vllm-project/vllm#6226) and became much slower. Our implementation aims to achieve minimal modifications and minimal overhead. We found that in vLLM's high-level implementation (https://github.com/vllm-project/vllm/blob/2fc6944c5e69d5d0ce15d09a855452c795d75c3c/vllm/entrypoints/llm.py#L507), each decoding iteration becomes prefilling (max token=1), which is the primary source of overhead.
Our implementation is shown in the figure below. The shared prefix is determined in the prefilling stage, and in decoding stage, we duplicate the prefix indices during the allocation of kv cache space for the next decoding iteration. This approach can avoid most prefilling processes and the need to modify too much code to maintain a tree cache for beam search sequences. After that, each iteration overwrites the top-beam-width kv cache indices into request-to-token retrieval (req_to_token) from the last retrieval and frees the parent kv cache indices whose subsequent sequences are all dropped. In this cae, green parent will be dropped because their children are not within the top-beam-width.
From the algorithmic perspective, we follow the greedy strategy from the vLLM version. Support for more versions of beam search algorithms is left for future work.
We benchmark our implementation for both accuracy and efficiency in Qwen2.5-1.5B-Instruct, with batch size=32, beam width=5 and warmup. The beam search outputs achieve 80% accuracy with the Transformers. The top-one beam search outputs score 0.734 in MMLU benchmarking. Our efficiency introduces at least 65% overhead but significantly outperforms Transformers and vLLM fork, see blow.
We plan to support:
Related resources
No response
The text was updated successfully, but these errors were encountered: