Use HuggingFace's new KV Cache Implementation #9

mostafaelhoushi · 2024-10-20T02:59:09Z

In order to enable Llama3.2 1B (see #8 ), we had to upgrade from transformers v4.34.1 to v4.45.2.

This new version of transformers had refactored the KV cache implementation to a more efficient implementation that would have required us to refactor forward_early(...) and forward_remainder(...) in self_speculation/llama_model_utils.py. Instead, we opted to use the less efficient legacy KV cache.

In order to ensure apples-to-apples comparison, in 62debc0, we changed autoregressive decoding to use legacy cache.

Ideally, we should ensure forward_early(...) and forward_remainder(...) to use transformers new more efficient KV cache implementation.

The text was updated successfully, but these errors were encountered:

HimanshuJanbandhu · 2024-12-16T09:18:20Z

Hi @mostafaelhoushi , I would like to contribute to this.
Would require a bit more clarity on the task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HuggingFace's new KV Cache Implementation #9

Use HuggingFace's new KV Cache Implementation #9

mostafaelhoushi commented Oct 20, 2024

HimanshuJanbandhu commented Dec 16, 2024 •

edited

Loading

Use HuggingFace's new KV Cache Implementation #9

Use HuggingFace's new KV Cache Implementation #9

Comments

mostafaelhoushi commented Oct 20, 2024

HimanshuJanbandhu commented Dec 16, 2024 • edited Loading

HimanshuJanbandhu commented Dec 16, 2024 •

edited

Loading