[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge) #11938

heheda12345 · 2025-01-10T17:02:41Z

This pr implements step 1 of #11382 , so that

we won't waste memory on sliding window & full attention interleaved models
support prefix caching of sliding window attention, where the cache hit only requires the tokens inside sliding window not evicted

Benchmark result (accelerate hybrid model & very little overhead on standard full attention models)

this pr:

VLLM_USE_V1=1 python3 benchmark_throughput.py --model google/gemma-2-27b-it --input-len 6144 --output-len 1024 --num-prompts 50
Throughput: 0.17 requests/s, 1239.96 total tokens/s, 177.14 output tokens/s
VLLM_USE_V1=1 python3 benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --input-len 6144 --output-len 1024 --num-prompts 50
Throughput: 1.48 requests/s, 10609.31 total tokens/s, 1515.62 output tokens/s

main branch (d53575a)

VLLM_USE_V1=1 python3 benchmark_throughput.py --model google/gemma-2-27b-it --input-len 6144 --output-len 1024 --num-prompts 50
Throughput: 0.15 requests/s, 1077.11 total tokens/s, 153.87 output tokens/s
VLLM_USE_V1=1 python3 benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --input-len 6144 --output-len 1024 --num-prompts 50
Throughput: 1.49 requests/s, 10682.79 total tokens/s, 1526.11 output tokens/s

Key modifications

kv cache initialization:

original workflow:
1. num_gpu_blocks, _ = self.model_executor.determine_num_available_blocks()
2. self.model_executor.initialize(num_gpu_blocks) (allocate kv cache)

modified workflow

 # Get all kv cache tensor needed via parsing the model
 kv_cache_spec = self.model_executor.get_kv_cache_spec()
 # Get availble_gpu_memory (instead of determine num_blocks based on that) by profile_run
 availble_gpu_memory = self.model_executor.get_available_memory()
 # EngineCore determines the page size & how to create each kv cache tensor
 kv_cache_config, num_gpu_blocks = get_kv_cache_config(
     vllm_config, kv_cache_spec, availble_gpu_memory)
 # Executor initialize the kv_cache based on that decision
 self.model_executor.initialize(kv_cache_config)

grouped allocation
- original: one KVCacheManager that allocate memory for all layers
- modified:
  - multiple KVCacheManagers, one for each group of layer (see "group the layers" for detail). All KVCacheManagers have the same page size and allocate memory from the same pool.
  - add group_id to kv_block_hash
  - block_table in worker: add a new dimension for groups
  - two KVCacheManager implementation, for full attention & sliding window attention respectively

I plan to split it into the following prs:

kv cache initialization, as discussed above.
add a new "group" dimension to the block_table, to represent the different memory allocated for different types of kv cache.
change AttentionMetadata to dict[layer_name, AttentionMetadata]
a very large pr implementing HybridKVCacheManager, which is a pluggable alternative with KVCacheManager, and won't touch the code path for standard models with only full attention layers.

Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit 176dc6d)

Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit c5a5155)

Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit de8324b)

Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit fa9b0bb)

Signed-off-by: Chen Zhang <[email protected]>

github-actions · 2025-01-10T17:02:54Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-01-10T17:03:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 added 15 commits December 27, 2024 04:52

support self-attention only model

4346147

Signed-off-by: Chen Zhang <[email protected]>

generate kv cache config for grouped allocation

11d3d4c

Signed-off-by: Chen Zhang <[email protected]>

update names

6be308f

Signed-off-by: Chen Zhang <[email protected]>

finish scheduler part, not verified

8dfb3ab

Signed-off-by: Chen Zhang <[email protected]>

change groups from dict to list

d9c0747

Signed-off-by: Chen Zhang <[email protected]>

finish worker part, can run but wrong answer

e75adce

Signed-off-by: Chen Zhang <[email protected]>

block_id_multiplier

6c0019a

Signed-off-by: Chen Zhang <[email protected]>

hide kv cache behind torch.compile

7efa2dc

Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit 176dc6d)

support pp & non-attn layers

8bb7b4e

Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit c5a5155)

format

2975a28

Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit de8324b)

update cpu engine

eb0f746

Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit fa9b0bb)

can run mix layer now

46a4ac4

Signed-off-by: Chen Zhang <[email protected]>

support sliding window, has problem

0d1cbef

Signed-off-by: Chen Zhang <[email protected]>

correct result on sliding window only models

1d18af6

Signed-off-by: Chen Zhang <[email protected]>

add print and example

f6b663a

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96, comaniac, alexm-neuralmagic, zhuohan123 and youkaichao as code owners January 10, 2025 17:02

heheda12345 marked this pull request as draft January 10, 2025 17:02

mergify bot added the needs-rebase label Jan 10, 2025

heheda12345 changed the title ~~[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models~~ [V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge) Jan 11, 2025

heheda12345 mentioned this pull request Jan 11, 2025

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge) #11938

[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge) #11938

heheda12345 commented Jan 10, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 10, 2025

mergify bot commented Jan 10, 2025

[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge) #11938

Are you sure you want to change the base?

[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge) #11938

Conversation

heheda12345 commented Jan 10, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 10, 2025

mergify bot commented Jan 10, 2025

heheda12345 commented Jan 10, 2025 •

edited by github-actions bot

Loading