[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

heheda12345 · 2025-01-11T15:30:49Z

This pr changes the workflow of EngineCore._initialize_kv_caches to enable more flexible control of kv cache format in the future.
It is splitted from #11938 and is a preparation for #11382
Original workflow:

num_gpu_blocks, _ = self.model_executor.determine_num_available_blocks()
self.model_executor.initialize(num_gpu_blocks)

New workflow:

# Get all kv cache tensor needed by the model
kv_cache_spec = self.model_executor.get_kv_cache_spec()

# Profiles the peak memory usage of the model to determine how much
# memory can be allocated for kv cache.
availble_gpu_memory = self.model_executor.get_available_memory()

# Get the kv cache tensor size
kv_cache_config, num_gpu_blocks = get_kv_cache_config(
    vllm_config, kv_cache_spec, availble_gpu_memory)


# Initialize kv cache and warmup the execution
self.model_executor.initialize(kv_cache_config)

This pr introduces 2 new concepts:

KVCacheSpec, a data structure to represent the kv cache needed by each attention layer, which is constructed by asking the model runner to analyze all Attention modules. Will add more types of Spec in the future, e.g., SlidingWindowSpec, MLASpec
KVCacheConfig, a class to represent how to allocate the kv cache Tensor. It is quite simple now, i.e., tensors with the same size. But it may be extended to the following cases:
1. tensors with different sizes, to support MLA & spec decode
2. allocate a global buffer, and make the kv_cache tensors to point to different offsets, to support multiple types of layer sharing the same memory pool.

Signed-off-by: Chen Zhang <[email protected]>

github-actions · 2025-01-11T15:31:01Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 added 3 commits January 11, 2025 05:38

can run

0827ca8

Signed-off-by: Chen Zhang <[email protected]>

update comment

4cef29c

Signed-off-by: Chen Zhang <[email protected]>

update comment

f634e4f

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96, comaniac and alexm-neuralmagic as code owners January 11, 2025 15:30

heheda12345 changed the title ~~[V1] Move more control of kv cache initialization from model_executor to to EngineCore~~ [V1] Move more control of kv cache initialization from model_executor to EngineCore Jan 11, 2025

heheda12345 added 2 commits January 11, 2025 07:32

format

3c24e78

Signed-off-by: Chen Zhang <[email protected]>

format

7cb80c3

Signed-off-by: Chen Zhang <[email protected]>

comaniac self-assigned this Jan 11, 2025

bind kv cache to model runner

9979614

Signed-off-by: Chen Zhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

heheda12345 commented Jan 11, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 11, 2025

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

Are you sure you want to change the base?

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

Conversation

heheda12345 commented Jan 11, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 11, 2025

heheda12345 commented Jan 11, 2025 •

edited by github-actions bot

Loading