Is GQA available for 7 and 13b llama2 models? #55

BarfingLemurs · 2023-09-23T10:47:03Z

BarfingLemurs
Sep 23, 2023

GQA generates a small kV cache. With gqa, a larger amount of context fits in vram.

I'm confused about exllamav2's readme; specifically:

" this scheme allows Llama2 70B to run on a single 24 GB GPU with a 2048-token context, producing coherent and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently none of them uses GQA which effectively limits the context size to 2048. "

Is llama2 13b trained to use both (mqa?) and gqa? Or do you mean 34b and 70b models are missing gqa?

Answered by turboderp

Sep 23, 2023

I meant that no 13B models currently use GQA. Even though Meta released Llama2-70B and CodeLlama-34B with GQA, none of the other models use it.

View full answer

turboderp · 2023-09-23T20:05:45Z

turboderp
Sep 23, 2023
Maintainer

I meant that no 13B models currently use GQA. Even though Meta released Llama2-70B and CodeLlama-34B with GQA, none of the other models use it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is GQA available for 7 and 13b llama2 models? #55

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is GQA available for 7 and 13b llama2 models? #55

BarfingLemurs Sep 23, 2023

Replies: 1 comment

turboderp Sep 23, 2023 Maintainer

BarfingLemurs
Sep 23, 2023

turboderp
Sep 23, 2023
Maintainer