Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why ollama and Q4 quantization? #7

Open
rmusser01 opened this issue Jun 22, 2024 · 0 comments
Open

Why ollama and Q4 quantization? #7

rmusser01 opened this issue Jun 22, 2024 · 0 comments

Comments

@rmusser01
Copy link

rmusser01 commented Jun 22, 2024

Hi, opening an issue as I can't PM you on twitter.
First: I think the project is really cool.
Second: I just larp as a redteamer and recently started working on an LLM project(tldw), so by all means, I am far from being knowledgeable.

That being said, why ollama instead of plain llama.cpp or llamafile? https://github.com/Mozilla-Ocho/llamafile

Also, why Q4 quant instead of a Q8 quant? Since you're using a 7B model, the quantization has a much larger effect on the reasoning capability of the model, than it would if it were say a 70B.

Paper discussing the impact of quantization on Llama3: https://arxiv.org/abs/2404.14047
Granted, you're using Mistral, but still, according to your blogpost(don't have the URL handy) if I remember correctly, it said that you also finetuned Mistral at Q4 quantization, and (again, talking out of my ass) I would assume you would suffer similar issues with that level of quantization with Mistral as you would Llama3.

Pointing this out in case it wasn't known at the time, or if it was, you might be able to help point me to some new info.
Thanks again for the project, I think it's pretty neat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant