Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt caching in mlx_lm.server #1026

Merged
merged 5 commits into from
Oct 14, 2024
Merged

Prompt caching in mlx_lm.server #1026

merged 5 commits into from
Oct 14, 2024

Conversation

awni
Copy link
Member

@awni awni commented Oct 9, 2024

Added a basic prompt cache in mlx_lm.server for chat mode. But it does support chatting with cache reuse.

@awni awni requested a review from angeloskath October 9, 2024 19:47
@chimezie
Copy link
Contributor

This would not be back wards compatible with any later incorporation of batched input for generate (i.e., #948)

Copy link
Contributor

@chimezie chimezie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave door for open for (#948)

@@ -474,15 +531,15 @@ def handle_completion(

def handle_stream(
self,
prompt: mx.array,
prompt: List[int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in particular. Can't we handle a single or multiple (batched) prompt that falls back to the behavior for a single prompt by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always update this type to List[List[int]] when the time comes.

Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic! I left a few comments that may or may not need addressing.

llms/mlx_lm/server.py Outdated Show resolved Hide resolved
llms/mlx_lm/server.py Show resolved Hide resolved
llms/mlx_lm/server.py Show resolved Hide resolved
llms/mlx_lm/server.py Show resolved Hide resolved
@awni awni merged commit 605c485 into main Oct 14, 2024
2 checks passed
@awni awni deleted the server_cache branch October 14, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants