Skip to content

Releases: turboderp/exllamav2

0.2.4

12 Nov 03:21
Compare
Choose a tag to compare
  • Support Pixtral
  • Refactoring for more multimodal support
  • Faster filter evaluation
  • Various optimizations and bugfixes
  • Various quality of life improvements

Full Changelog: v0.2.3...v0.2.4

0.2.3

29 Sep 11:04
Compare
Choose a tag to compare
  • No longer use safetensors for loading weights (fix virtual memory issues on Windows especially)
  • Disable fasttensors option (now redundant)
  • Prioritize HF Tokenizers model when both HF and SPM models available
  • Add XTC sampler
  • Add YaRN support
  • Various fixes and QoL improvements

Full Changelog: v0.2.2...v0.2.3

0.2.2

14 Sep 19:20
Compare
Choose a tag to compare
  • small fixes related to LMFE
  • allow SDPA during normal inference with custom bias

Full Changelog: v0.2.1...v0.2.2

0.2.1

08 Sep 17:26
Compare
Choose a tag to compare
  • TP: fallback SDPA mode when flash-attn is unavailable
  • Faster filter/grammar path
  • Add DRY
  • Fix issues since 0.1.9 (streams/graphs) when loading certain models via Tabby
  • Banish Râul

Full Changelog: v0.2.0...v0.2.1

0.2.0

28 Aug 21:00
Compare
Choose a tag to compare

Small release to fix various issues in 0.1.9

Full Changelog: v0.1.9...v0.2.0

0.1.9

22 Aug 11:54
Compare
Choose a tag to compare
  • Add experimental tensor-parallel mode. Currently supports Llama(1+2+3), Qwen2 and Mistral models
  • CUDA Graphs to reduce overhead and CPU bottlenecking
  • Various other optimizations
  • Some bugfixes

Full Changelog: v0.1.8...v0.1.9

0.1.8

24 Jul 06:36
Compare
Choose a tag to compare
  • Support Llama 3.1 (correct RoPE scaling etc.)
  • Support IndexTeam architecture
  • Some bugfixes and QoL improvements

Full Changelog: v0.1.7...v0.1.8

0.1.7

11 Jul 13:20
Compare
Choose a tag to compare
  • Support Gemma2
  • Support InternLM2
  • Various bugfixes and optimizations

Full Changelog: v0.1.6...v0.1.7

0.1.6

24 Jun 00:36
Compare
Choose a tag to compare
  • Fix dynamic generator fallback mode (was broken for prompts longer than max_input_len)
  • Fix inference on ROCm wave64 devices
  • Made model conversion script part of exllamav2 package
  • CPU optimizations

Full Changelog: v0.1.5...v0.1.6

0.1.5

09 Jun 00:19
Compare
Choose a tag to compare
  • Added Q6 and Q8 cache modes
  • Defragment cache in dynamic generator
  • Use SDPA with Torch 2.3.0+
  • Updated wheels to Torch 2.3.1
  • Added Python 3.12 wheels, plus Python 3.9 for ROCm

Full Changelog: v0.1.4...v0.1.5