Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HuggingFace's new KV Cache Implementation #9

Open
mostafaelhoushi opened this issue Oct 20, 2024 · 1 comment
Open

Use HuggingFace's new KV Cache Implementation #9

mostafaelhoushi opened this issue Oct 20, 2024 · 1 comment

Comments

@mostafaelhoushi
Copy link
Contributor

In order to enable Llama3.2 1B (see #8 ), we had to upgrade from transformers v4.34.1 to v4.45.2.

This new version of transformers had refactored the KV cache implementation to a more efficient implementation that would have required us to refactor forward_early(...) and forward_remainder(...) in self_speculation/llama_model_utils.py. Instead, we opted to use the less efficient legacy KV cache.

In order to ensure apples-to-apples comparison, in 62debc0, we changed autoregressive decoding to use legacy cache.

Ideally, we should ensure forward_early(...) and forward_remainder(...) to use transformers new more efficient KV cache implementation.

@HimanshuJanbandhu
Copy link
Contributor

HimanshuJanbandhu commented Dec 16, 2024

Hi @mostafaelhoushi , I would like to contribute to this.
Would require a bit more clarity on the task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants