Enable caching for 'generate' and 'stream_generate' functions to ensure persistence of cache across multiple requests #989

nath1295 · 2024-09-17T20:10:32Z

Add two new data classes called CacheHistory and StepOutput for storing cache history along with the token history
Add the option to return cache in "generate" and "stream_generate" for further cache reuse.
Add two functions to save and load cache from disk.
"prompt" argument in the "generate" and "stream_generate" is no longer a suffix for the cache history. It will be the full prompt. In "generate_step", there is a check to find out the index of the maximum shared prefix between the list of token ids from the new prompt and the token ids from the history prompt.

Usage

from mlx_lm import load, stream_generate
from mlx_lm.utils import save_cache, load_cache

model, tokenizer = load('/Path/to/model')

prompt = 'Your long prompt here...'

# First generation without prompt cache history
for i, cache in stream_generate(model=model, 
        tokenizer=tokenizer, prompt=prompt, max_tokens=100, return_cache=True, verbose=True):
    print(i, end='')
# Processing prompt (1431/1431): 100%|██████████| 3/3 [00:02<00:00,  1.50it/s]
# Prompt preprocessing time for 1431 tokens: 2.007s (713.1801 tok/sec)

# Second generation with prompt cache history
for i, cache in stream_generate(model=model, 
        tokenizer=tokenizer, prompt=prompt, max_tokens=100, return_cache=True, verbose=True, cache_history=cache):
    new += i
    print(i, end='')
# Processing prompt (1/1): 100%|██████████| 1/1 [00:00<00:00, 595.61it/s]
# Prompt preprocessing time for 1 tokens: 0.001921s (520.6299 tok/sec)

# Save the cache history to use later
save_cache(cache, filename='test.safetensors', metadata=dict(model_id='My random model'))

# Load an existing cache from disk
cache, metadata = load_cache(filename='test.safetensors')

…on api

nath1295 · 2024-09-18T10:00:54Z

Just updating the title of the PR for clarity. Now KV cache of any generation can be reused for other requests with these changes.

…ument for generate and stream_generate

nath1295 · 2024-09-27T16:53:13Z

The code in server.py is modified accordingly to adapt to the changes made with generate_step. Prompt caching is available on server.py by default.

awni · 2024-10-12T21:34:21Z

Thanks for the PR! However, most of this functionality should already be included in #1015 and #1026, so I will close this.

If there is anything here that those don't address please feel free to submit a follow up PR rebased on the latest. Thanks!

modify generate_step in utils for cache history restoration with pyth…

3ef1011

…on api

nath1295 changed the title ~~Cache prompt with "generate" and "generate_stream" in python function~~ Enable caching for 'generate' and 'stream_generate' functions to ensure persistence of cache across multiple requests Sep 18, 2024

mark-lord mentioned this pull request Sep 27, 2024

PR: Add KV-cache creation capability to mlx_lm.generate for after a text completion #1001

Closed

nath1295 added 2 commits September 27, 2024 13:43

Better support for rotating KV-cache and add stop word list as an arg…

7e98499

…ument for generate and stream_generate

modified mlx_lm.server.py to support prompt caching.

9725751

awni closed this Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable caching for 'generate' and 'stream_generate' functions to ensure persistence of cache across multiple requests #989

Enable caching for 'generate' and 'stream_generate' functions to ensure persistence of cache across multiple requests #989

nath1295 commented Sep 17, 2024 •

edited

Loading

nath1295 commented Sep 18, 2024

nath1295 commented Sep 27, 2024

awni commented Oct 12, 2024

Enable caching for 'generate' and 'stream_generate' functions to ensure persistence of cache across multiple requests #989

Enable caching for 'generate' and 'stream_generate' functions to ensure persistence of cache across multiple requests #989

Conversation

nath1295 commented Sep 17, 2024 • edited Loading

Usage

nath1295 commented Sep 18, 2024

nath1295 commented Sep 27, 2024

awni commented Oct 12, 2024

nath1295 commented Sep 17, 2024 •

edited

Loading