Home

Setup instructions for llama.cpp server

For Linux

Download the release files for your OS from https://github.com/ggerganov/llama.cpp/releases (or build from source).
Download the LLM model and run llama.cpp server (combined in one command)
2.1. No GPUs - run from shell the following command:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -c 2048 -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
2.2. With Nvidia GPUs and installed latest cuda

If you have more than 16GB VRAM run from shell the following command:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
If you have less than 16GB VRAM run from shell the following command:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --hf-file qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.
Now you could start using llama-vscode extension.
Enjoy!

For Mac

Prerequisites - Homebrew

Install llama.cpp with the command
brew install llama.cpp
Download the LLM model and run llama.cpp server (combined in one command)

If you have more than 16GB VRAM:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
If you have less than 16GB VRAM:
llama-server --hf-repo ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --hf-file qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.
Now you could start using llama-vscode extension.
Enjoy!

For Windows

Download file qwen2.5-coder-1.5b-q8_0.gguf from https://huggingface.co/ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF/blob/main/qwen2.5-coder-1.5b-q8_0.gguf
Download the release files for Windows from https://github.com/ggerganov/llama.cpp/releases and extract them.
Run llama.cpp server
3.1 No GPUs
In the extracted files folder put the model qwen2.5-coder-1.5b-q8_0.gguf and start llama.cpp server from command window:
llama-server.exe -m qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -c 2048 -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
3.2 With Nvidia GPUs and installed latest cuda
In the extracted files folder put the model qwen2.5-coder-1.5b-q8_0.gguf and start llama.cpp server from command window:
llama-server.exe -m qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -c 2048 --n-gpu-layers 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
Now you could start using llama-vscode extension.
Enjoy!

For all OS - if you have better hardware (GPUs) you could use bigger models from https://huggingface.co/ggml-org like qwen2.5-coder-3b-q8_0.gguf , qwen2.5-coder-7b-q8_0.gguf or qwen2.5-coder-14b-q8_0.gguf. Any FIM-compatible model, supported by llama.cpp, could be used.
More details about llama.cpp server

Provide feedback

Saved searches