Releases: turboderp/exllamav2
Releases · turboderp/exllamav2
0.2.4
- Support Pixtral
- Refactoring for more multimodal support
- Faster filter evaluation
- Various optimizations and bugfixes
- Various quality of life improvements
Full Changelog: v0.2.3...v0.2.4
0.2.3
- No longer use safetensors for loading weights (fix virtual memory issues on Windows especially)
- Disable fasttensors option (now redundant)
- Prioritize HF Tokenizers model when both HF and SPM models available
- Add XTC sampler
- Add YaRN support
- Various fixes and QoL improvements
Full Changelog: v0.2.2...v0.2.3
0.2.2
- small fixes related to LMFE
- allow SDPA during normal inference with custom bias
Full Changelog: v0.2.1...v0.2.2
0.2.1
- TP: fallback SDPA mode when flash-attn is unavailable
- Faster filter/grammar path
- Add DRY
- Fix issues since 0.1.9 (streams/graphs) when loading certain models via Tabby
- Banish Râul
Full Changelog: v0.2.0...v0.2.1
0.2.0
Small release to fix various issues in 0.1.9
Full Changelog: v0.1.9...v0.2.0
0.1.9
- Add experimental tensor-parallel mode. Currently supports Llama(1+2+3), Qwen2 and Mistral models
- CUDA Graphs to reduce overhead and CPU bottlenecking
- Various other optimizations
- Some bugfixes
Full Changelog: v0.1.8...v0.1.9
0.1.8
- Support Llama 3.1 (correct RoPE scaling etc.)
- Support IndexTeam architecture
- Some bugfixes and QoL improvements
Full Changelog: v0.1.7...v0.1.8
0.1.7
- Support Gemma2
- Support InternLM2
- Various bugfixes and optimizations
Full Changelog: v0.1.6...v0.1.7
0.1.6
- Fix dynamic generator fallback mode (was broken for prompts longer than max_input_len)
- Fix inference on ROCm wave64 devices
- Made model conversion script part of
exllamav2
package - CPU optimizations
Full Changelog: v0.1.5...v0.1.6
0.1.5
- Added Q6 and Q8 cache modes
- Defragment cache in dynamic generator
- Use SDPA with Torch 2.3.0+
- Updated wheels to Torch 2.3.1
- Added Python 3.12 wheels, plus Python 3.9 for ROCm
Full Changelog: v0.1.4...v0.1.5