Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Jan 11, 2025

This pr changes the workflow of EngineCore._initialize_kv_caches to enable more flexible control of kv cache format in the future.
It is splitted from #11938 and is a preparation for #11382
Original workflow:

num_gpu_blocks, _ = self.model_executor.determine_num_available_blocks()
self.model_executor.initialize(num_gpu_blocks)

New workflow:

# Get all kv cache tensor needed by the model
kv_cache_spec = self.model_executor.get_kv_cache_spec()

# Profiles the peak memory usage of the model to determine how much
# memory can be allocated for kv cache.
availble_gpu_memory = self.model_executor.get_available_memory()

# Get the kv cache tensor size
kv_cache_config, num_gpu_blocks = get_kv_cache_config(
    vllm_config, kv_cache_spec, availble_gpu_memory)


# Initialize kv cache and warmup the execution
self.model_executor.initialize(kv_cache_config)

This pr introduces 2 new concepts:

  1. KVCacheSpec, a data structure to represent the kv cache needed by each attention layer, which is constructed by asking the model runner to analyze all Attention modules. Will add more types of Spec in the future, e.g., SlidingWindowSpec, MLASpec
  2. KVCacheConfig, a class to represent how to allocate the kv cache Tensor. It is quite simple now, i.e., tensors with the same size. But it may be extended to the following cases:
    1. tensors with different sizes, to support MLA & spec decode
    2. allocate a global buffer, and make the kv_cache tensors to point to different offsets, to support multiple types of layer sharing the same memory pool.

Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@heheda12345 heheda12345 changed the title [V1] Move more control of kv cache initialization from model_executor to to EngineCore [V1] Move more control of kv cache initialization from model_executor to EngineCore Jan 11, 2025
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
@comaniac comaniac self-assigned this Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants