All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- [Server] Remove deprecated self extending context window from llama-server example
- [Server] Remove --log-disable from llama-server example
- [Server] Add -help option
- [Server] Add -chatTemplate option
- [Server] Add human readable file size
- [Benchmark] Add llama-bench example
- [Build] Update torch to 2.2.1+cu121
- [Build] Update OpenBLAS to 0.3.27
- [Build] Update Python to 3.12
- [Server] Default KV cache type to f16
- [Documentation] Use gemma-2-9b-it-IQ4_XS.gguf model across all examples
- [Build] Fix CUDA build after renaming in upstream llama.cpp
- [Build] Fix gguf_dump.py after renaming in upstream llama.cpp
- [Build] Add missing tiktoken package to support GLM models
- [Build] Fix wikitext URI
- [Server] Remove broken chrome startup
- [Build] Simplify the python dependency installation
- [Build] Downgrade the "torch" package to 2.1.2+cu121
- [Build] Add build targets option
- [Server] Change binary
server
tollama-server
to match renaming in llama.cpp project - [Tools] Change binary
tokenize
tollama-tokenize
to match renaming in llama.cpp project - [Documentation] Update examples to match the state of the llama.cpp project
- [Server] Limit KV cache data types to f32, f16, q8_0 and q4_0
- [Build] Rename cuBLAS to CUDA
- [Server] Add kvCacheDataType option
- [Server] Automatically enable q4_0 quantized KV cache with Flash Attention
- [Server] Automatically enable Flash Attention on GPUS with at least Pascal architecture
- [Build] Enable parallel building with CMake utilizing all CPU threads
- [Server] Add verbose option
- [Server] Output the exact invocation command of the llama.cpp server
- [Tools] Add count_tokens.ps1 script
- [Server] Add n-predict option
- [Build] Update "torch" package to 2.4.0.dev20240516+cu121
- [Server] Enable flash attention
- [Build] Fix installation of latest python packages
- [Server] Remove now per default enabled option --cont-batching
- [Server] Add -port option
- [Build] Add list of installed python packages
- [Build] Update "torch" package to 2.3.0.dev20240311+cu121
- [Server] Add fallback for empty head_count_kv values
- [Server] Add fallback if model details could not be read by gguf-dump.py
- [Server] Add filename of the model path as an alias
- [Server] Add support for self extending the context window (SelfExtend)
- [Server] Add automatic calculation of numberOfGPULayers option
- [Server] Add formatted output of computed memory details
- [Server] Fix numberOfGPULayers option override
- [Server] Add contextSize option
- [Server] Add numberOfGPULayers option
- [Server] Add parallel option
- [Server] Add support for executing the server example script from any directory
- [Server] Add listing available models if model path is missing
- [Server] Add KV cache placeholder
- [Server] Add polling for server before starting the browser
- [Server] Add maximum of 10 parallel job executions
- [Build] Add automatic NVIDIA GPU detection in the build context
- [Server] Replace all server examples with one generic server.ps1 script
- [Build] Update OpenBLAS to v0.3.26
- [Build] Fix python requirements installation
- [Build] Add Falcon 180B convert script
- [Build] Add additional convert requirements for Falcon models
- [Server] Add example for Falcon 40B model
- [Server] Add example for FashionGPT 70B model
- [Server] Add example for Llama 2 7B model
- [Server] Add example for Llama 2 13B model
- [Server] Add example for Upstage Llama 2 70B
- [Server] Add example for Phind CodeLlama 34B model
- [Server] Add example for Phind CodeLlama 34B model with 16k context
- [Server] Add example for Phind CodeLlama 34B model with 32k context
- [Server] Add example for WizardCoder 15B model
- [Server] Add example for Mistral 7B model
- [Prompt] Add prompt to chat with Llama 2
- [Prompt] Add german language prompt
- [Grammar] Add JSON grammar with floating point numbers support
- [Documentation] Add RoPE parameter to documentation
- [Documentation] Add JSON response to documentation
- [Documentation] Add version parameter to documentation
- [Documentation] Add prompt cache to documentation
- [Documentation] Add enabling of Hardware Accelerated GPU Scheduling to documentation
- [Build] Fix python requirements installation
- [Build] Add optional version parameter
- [Build] Add console output and execution duration
- [Build] Default llama.cpp version to latest release tag
- [Build] Add server example to the build
- [Build] Add documentation on how to use the webinterface
- [Build] Fix automatic update of the submodules
- [Build] Add dataset "wikitext-2-raw-v1"
- [Build] Add documentation on how to measure model perplexity
- [Build] OpenBLAS workaround for Windows
- [Build] Rebuild script