Why ollama and Q4 quantization? #7

rmusser01 · 2024-06-22T16:08:11Z

Hi, opening an issue as I can't PM you on twitter.
First: I think the project is really cool.
Second: I just larp as a redteamer and recently started working on an LLM project(tldw), so by all means, I am far from being knowledgeable.

That being said, why ollama instead of plain llama.cpp or llamafile? https://github.com/Mozilla-Ocho/llamafile

Also, why Q4 quant instead of a Q8 quant? Since you're using a 7B model, the quantization has a much larger effect on the reasoning capability of the model, than it would if it were say a 70B.

Paper discussing the impact of quantization on Llama3: https://arxiv.org/abs/2404.14047
Granted, you're using Mistral, but still, according to your blogpost(don't have the URL handy) if I remember correctly, it said that you also finetuned Mistral at Q4 quantization, and (again, talking out of my ass) I would assume you would suffer similar issues with that level of quantization with Mistral as you would Llama3.

Pointing this out in case it wasn't known at the time, or if it was, you might be able to help point me to some new info.
Thanks again for the project, I think it's pretty neat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why ollama and Q4 quantization? #7

Why ollama and Q4 quantization? #7

rmusser01 commented Jun 22, 2024 •

edited

Loading

Why ollama and Q4 quantization? #7

Why ollama and Q4 quantization? #7

Comments

rmusser01 commented Jun 22, 2024 • edited Loading

rmusser01 commented Jun 22, 2024 •

edited

Loading