Add q-cache 6 and 8 support for Exllamav2 #6280

randoentity · 2024-07-27T16:05:19Z

Checklist:

I have read the Contributing guidelines.

GodEmperor785 · 2024-09-19T15:36:21Z

@oobabooga could this be merged to main? It would be useful for models that can get unstable with Q4 cache quantization (like Qwen or Mistral Nemo as reported by some people), also the current 8bit cache seems to be old and author of exllamav2 says that Q8 is better (even Q4 can be better while taking less space).

randoentity mentioned this pull request Jul 27, 2024

Add Q6 cache support for ExllamaV2 #6278

Open

Add q-cache 6 and 8 support for Exllamav2

3d17c80

randoentity mentioned this pull request Aug 4, 2024

Exllamav2 qcache support (Q4, Q6, Q8) #6303

Open

randoentity force-pushed the exl2_qcache branch from 198a067 to 3d17c80 Compare August 4, 2024 19:57

Ph0rk0z mentioned this pull request Aug 31, 2024

ExllamaV2 tensor parallelism to increase multi gpu inference speeds code help #6356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add q-cache 6 and 8 support for Exllamav2 #6280

Add q-cache 6 and 8 support for Exllamav2 #6280

randoentity commented Jul 27, 2024

GodEmperor785 commented Sep 19, 2024

Add q-cache 6 and 8 support for Exllamav2 #6280

Are you sure you want to change the base?

Add q-cache 6 and 8 support for Exllamav2 #6280

Conversation

randoentity commented Jul 27, 2024

Checklist:

GodEmperor785 commented Sep 19, 2024