[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960
+409
−101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pr changes the workflow of
EngineCore._initialize_kv_caches
to enable more flexible control of kv cache format in the future.It is splitted from #11938 and is a preparation for #11382
Original workflow:
New workflow:
This pr introduces 2 new concepts:
KVCacheSpec
, a data structure to represent the kv cache needed by each attention layer, which is constructed by asking the model runner to analyze all Attention modules. Will add more types of Spec in the future, e.g.,SlidingWindowSpec
,MLASpec
KVCacheConfig
, a class to represent how to allocate the kv cache Tensor. It is quite simple now, i.e., tensors with the same size. But it may be extended to the following cases: