You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to enable Llama3.2 1B (see #8 ), we had to upgrade from transformers v4.34.1 to v4.45.2.
This new version of transformers had refactored the KV cache implementation to a more efficient implementation that would have required us to refactor forward_early(...) and forward_remainder(...) in self_speculation/llama_model_utils.py. Instead, we opted to use the less efficient legacy KV cache.
In order to ensure apples-to-apples comparison, in 62debc0, we changed autoregressive decoding to use legacy cache.
Ideally, we should ensure forward_early(...) and forward_remainder(...) to use transformers new more efficient KV cache implementation.
The text was updated successfully, but these errors were encountered:
In order to enable Llama3.2 1B (see #8 ), we had to upgrade from
transformers
v4.34.1 to v4.45.2.This new version of
transformers
had refactored the KV cache implementation to a more efficient implementation that would have required us to refactorforward_early(...)
andforward_remainder(...)
inself_speculation/llama_model_utils.py
. Instead, we opted to use the less efficient legacy KV cache.In order to ensure apples-to-apples comparison, in 62debc0, we changed autoregressive decoding to use legacy cache.
Ideally, we should ensure
forward_early(...)
andforward_remainder(...)
to usetransformers
new more efficient KV cache implementation.The text was updated successfully, but these errors were encountered: