-
Notifications
You must be signed in to change notification settings - Fork 15
Home
igardev edited this page Jan 21, 2025
·
13 revisions
- Download the release files for your OS from https://github.com/ggerganov/llama.cpp/releases (or build from source).
- Download the LLM model and run llama.cpp server (combined in one command)
2.1. No GPUs - run from shell the following command:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -c 2048 -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
2.2. With Nvidia GPUs and installed latest cuda
- If you have more than 16GB VRAM run from shell the following command:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
- If you have less than 16GB VRAM run from shell the following command:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --hf-file qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.
Now you could start using llama-vscode extension.
Enjoy!
Prerequisites - Homebrew
- Install llama.cpp with the command
brew install llama.cpp
- Download the LLM model and run llama.cpp server (combined in one command)
- If you have more than 16GB VRAM:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
- If you have less than 16GB VRAM:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --hf-file qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.
Now you could start using llama-vscode extension.
Enjoy!
- Download file qwen2.5-coder-1.5b-q8_0.gguf from https://huggingface.co/ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF/blob/main/qwen2.5-coder-1.5b-q8_0.gguf
- Download the release files for Windows from https://github.com/ggerganov/llama.cpp/releases and extract them.
- Run llama.cpp server
3.1 No GPUs
In the extracted files folder put the model qwen2.5-coder-1.5b-q8_0.gguf and start llama.cpp server from command window:
llama-server.exe -m qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -c 2048 -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
3.2 With Nvidia GPUs and installed latest cuda
In the extracted files folder put the model qwen2.5-coder-1.5b-q8_0.gguf and start llama.cpp server from command window:
llama-server.exe -m qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -c 2048 --n-gpu-layers 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
Now you could start using llama-vscode extension.
Enjoy!
For all OS - if you have better hardware (GPUs) you could use bigger models from https://huggingface.co/ggml-org like qwen2.5-coder-3b-q8_0.gguf , qwen2.5-coder-7b-q8_0.gguf or qwen2.5-coder-14b-q8_0.gguf. Any FIM-compatible model, supported by llama.cpp, could be used.
More details about llama.cpp server