Is GQA available for 7 and 13b llama2 models? #55
-
GQA generates a small kV cache. With gqa, a larger amount of context fits in vram. I'm confused about exllamav2's readme; specifically: " this scheme allows Llama2 70B to run on a single 24 GB GPU with a 2048-token context, producing coherent and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently none of them uses GQA which effectively limits the context size to 2048. " Is llama2 13b trained to use both (mqa?) and gqa? Or do you mean 34b and 70b models are missing gqa? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I meant that no 13B models currently use GQA. Even though Meta released Llama2-70B and CodeLlama-34B with GQA, none of the other models use it. |
Beta Was this translation helpful? Give feedback.
I meant that no 13B models currently use GQA. Even though Meta released Llama2-70B and CodeLlama-34B with GQA, none of the other models use it.