Skip to content

Releases: oobabooga/text-generation-webui

v1.15

01 Oct 17:48
3b06cb4
Compare
Choose a tag to compare

Backend updates

  • Transformers: bump to 4.45.
  • ExLlamaV2: bump to 0.2.3.
  • flash-attention: bump to 2.6.3.
  • llama-cpp-python: bump to 0.3.1.
  • bitsandbytes: bump to 0.44.
  • PyTorch: bump to 2.4.1.
  • ROCm: bump wheels to 6.1.2.
  • Remove AutoAWQ, AutoGPTQ, HQQ, and AQLM from requirements.txt:
    • AutoAWQ and AutoGPTQ were removed due to lack of support for PyTorch 2.4.1 and CUDA 12.1.
    • HQQ and AQLM were removed to make the project leaner since they're experimental with limited use.
    • You can still install those libraries manually if you are interested.

Changes

  • Exclude Top Choices (XTC): A sampler that boosts creativity, breaks writing clichΓ©s, and inhibits non-verbatim repetition (#6335). Thanks @p-e-w.
  • Make it possible to sort repetition penalties with "Sampler priority". The new keywords are:
    • repetition_penalty
    • presence_penalty
    • frequency_penalty
    • dry
    • encoder_repetition_penalty
    • no_repeat_ngram
    • xtc (not a repetition penalty but also added in this update)
  • Don't import PEFT unless necessary. This makes the web UI launch faster.
  • Add beforeunload event to add confirmation dialog when leaving page (#6279). Thanks @leszekhanusz.
  • update API documentation with examples to list/load models (#5902). Thanks @joachimchauvet.
  • Training pro update script.py (#6359). Thanks @FartyPants.

Bug fixes

  • Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) (#6357). Thanks @GralchemOz.
  • API: Relax multimodal format, fixes HuggingFace Chat UI (#6353). Thanks @Papierkorb.
  • Force /bin/bash shell for conda (#6386). Thanks @Thireus.
  • Do not set value for histories in chat when --multi-user is used (#6317). Thanks @mashb1t.
  • typo in OpenAI response format (#6365). Thanks @jsboige.

v1.14

20 Aug 04:29
073694b
Compare
Choose a tag to compare

Backend updates

  • llama-cpp-python: bump to 0.2.89.
  • Transformers: bump to 4.44.

Other changes

  • Model downloader: use a single session for all downloaded files to reduce the time to start each download.
  • Add a --tokenizer-dir flag to be used with llamacpp_HF.

v1.13

01 Aug 05:28
d011040
Compare
Choose a tag to compare

Backend updates

  • llama-cpp-python: bump to 0.2.85 (adds Llama 3.1 support).

UI updates

  • Make compress_pos_emb float (#6276). Thanks @hocjordan.
  • Make n_ctx, max_seq_len, and truncation_length numbers rather than sliders, to make it possible to type the context length manually.
  • Improve the style of headings in chat messages.
  • LaTeX rendering:
    • Add back single $ for inline equations.
    • Fix rendering for equations enclosed between \[ and \].
    • Fix rendering for multiline equations.

Bug fixes

  • Fix saving characters through the UI.
  • Fix instruct mode displaying "quotes" as ""double quotes"".
  • Fix chat sometimes not scrolling down after sending a message.
  • Fix the chat "stop" event.
  • Make --idle-timeout work for API requests.

Other changes

  • Model downloader: improve the progress bar by adding the filename, size, and download speed for each downloaded file.
  • Better handle the Llama 3.1 Jinja2 template by not including its optional "tools" headers.

v1.12

25 Jul 15:19
dd97a83
Compare
Choose a tag to compare

Backend updates

  • Transformers: bump to 4.43 (adds Llama 3.1 support).
  • ExLlamaV2: bump to 0.1.8 (adds Llama 3.1 support).
  • AutoAWQ: bump to 0.2.6 (adds Llama 3.1 support).

UI updates

  • Make text between quote characters colored in chat and chat-instruct modes.
  • Prevent LaTeX from being rendered for inline "$", as that caused problems for phrases like "apples cost $1, oranges cost $2".
  • Make the markdown cache infinite and clear it when switching to another chat. This cache exists because the markdown conversion is CPU-intensive. By making it infinite, messages in a full 128k context will be cached, making the UI more responsive for long conversations.

Bug fixes

  • Fix a race condition that caused the default character to not be loaded correctly on startup.
  • Fix Linux shebangs (#6110). Thanks @LuNeder.

Other changes

  • Make the Google Colab notebook use the one-click installer instead of its own Python environment for better stability.
  • Disable flash-attention on Google Colab by default, as its GPU models do not support it.

v1.11

23 Jul 05:34
d1115f1
Compare
Choose a tag to compare

UI updates

  • Optimize the UI: events triggered by clicking on buttons, selecting values from dropdown menus, etc have been refactored to minimize the number of connections made between the UI and the server. As a result, the UI is now significantly faster and more responsive.
  • Use chat-instruct mode by default: most models nowadays are instruction-following models, and this mode automatically uses the model's Jinja2 template to generate the prompt, leading to higher-quality outputs.
  • Improve the style of code blocks in light mode.
  • Increase the font weight of chat messages (for chat and chat-instruct modes).
  • Use gr.Number for RoPE scaling parameters (#6233). Thanks @Vhallo.
  • Don't export the instruction template to settings.yaml on "Save UI defaults to settings.yaml" (it gets ignored and replaced with the model template).

Backend updates

  • llama-cpp-python: bump to 0.2.83 (adds Mistral-Nemo support).

Other changes

  • training: Added ChatML-format.json format example (#5899). Thanks @FartyPants.
  • Customize the subpath for gradio, use with reverse proxy (#5106). Thanks @canoalberto.

Bug fixes

  • Fix an issue where the chat contents sometimes disappear for a split second during streaming (#6247). Thanks @Patronics.
  • Fix the chat UI losing its vertical scrolling position when the input area grows to more than 1 line.

v1.10.1

13 Jul 17:56
0315122
Compare
Choose a tag to compare

Library updates

  • FlashAttention: bump to v2.6.1. Now Gemma-2 works in ExLlamaV2 with FlashAttention without any quality loss.

Bug fixes

  • Fix for MacOS users encountering model load errors with llama.cpp (#6227). Thanks @InvectorGator.

v1.10

11 Jul 23:43
d01c68f
Compare
Choose a tag to compare

Library updates

  • llama-cpp-python: bump to 0.2.82.
  • ExLlamaV2: bump to 0.1.7 (adds Gemma-2 support).

Changes

  • Add new --no_xformers and --no_sdpa flags for ExLlamaV2.
    • Note: to use Gemma-2 with ExLlamaV2, you currently must use the --no_flash_attn --no_xformers --no_sdpa flags, or check the corresponding checkboxes in the UI before loading the model, otherwise it will perform very badly.
  • Minor UI updates.

v1.9.1

05 Jul 10:38
e813b32
Compare
Choose a tag to compare

Bug fixes

  • UI: Fix some broken chat histories not showing in the "Past chats" menu.
  • Prevent llama.cpp from being monkey patched more than once, avoiding an infinite recursion error.

v1.9

05 Jul 03:24
3315d00
Compare
Choose a tag to compare

Backend updates

  • 4-bit and 8-bit kv cache options have been added to llama.cpp and llamacpp_HF. They reuse the existing --cache_8bit and --cache_4bit flags. Thanks @GodEmperor785 for figuring out what values to pass to llama-cpp-python.
  • Transformers:
    • Add eager attention option to make Gemma-2 work correctly (#6188). Thanks @GralchemOz.
    • Automatically detect bfloat16/float16 precision when loading models in 16-bit precision.
    • Automatically apply eager attention to models with Gemma2ForCausalLM architecture.
    • Gemma-2 support: Automatically detect and apply the optimal settings for this model with the two changes above. No need to set --bf16 --use_eager_attention manually.
  • Automatically obtain the EOT token from Jinja2 templates and add it to the stopping strings, fixing Llama-3-Instruct not stopping. No need to add <eot> to the custom stopping strings anymore.

UI updates

  • Whisper STT overhaul: this extension has been rewritten, replacing the Gradio microphone component with a custom microphone element that is much more reliable (#6194). Thanks @RandomInternetPreson, @TimStrauven, and @mamei16.
  • Make the character dropdown menu coexist in the "Chat" tab and the "Parameters > Character" tab, after some people pointed out that moving it entirely to the Chat tab makes it harder to edit characters.
  • Colors in the light theme have been improved, making it a bit more aesthetic.
  • Increase the chat area on mobile devices.

Bug fixes

  • Fix the API request to AUTOMATIC1111 in the sd-api-pictures extension.
  • Fix a glitch when switching tabs with "Show controls" unchecked in the chat tab and extensions loaded.

Library updates

  • llama-cpp-python: bump to 0.2.81 (adds Gemma-2 support).
  • Transformers: bump to 4.42 (adds Gemma-2 support).

Support

v1.8

27 Jun 02:38
6915c50
Compare
Choose a tag to compare

Releases with version numbers are back! The last one was v1.7 in October 8th, 2023, so I am calling this one v1.8.

From this release on, it will be possible to install past releases by downloading the .zip source and running the start_ script in it. The installation script no longer updates to the latest version automatically. This doesn't apply to snapshots/releases before this one.

New backend

UI updates

  • Improved "past chats" menu: this menu is now a vertical list of text items instead of a dropdown menu, making it a lot easier to switch between past conversations. Only one click is required instead of two.
  • Store the chat history in the browser: if you restart the server and do not refresh the browser, your conversation will not be accidentally erased anymore.
  • Avoid some unnecessary calls to the backend, making the UI faster and more responsive.
  • Move the "Character" droprown menu to the main Chat tab, to make it faster to switch between different characters.
  • Change limits of RoPE scaling sliders in UI (#6142). Thanks @GodEmperor785.
  • Do not expose "alpha_value" for llama.cpp and "rope_freq_base" for transformers to keep things simple and avoid conversions.
  • Remove an obsolete info message intended for GPTQ-for-LLaMa.
  • Remove the "Tab" shortcut to switch between the generation tabs and the "Parameter" tabs, as it was awkward.
  • Improved streaming of lists, which would flicker and temporarily display horizontal lines sometimes.

Bug fixes

  • Revert the reentrant generation lock to a simple lock, fixing an issue caused by the change.
  • Fix GGUFs with no BOS token present, mainly qwen2 models. (#6119). Thanks @Ph0rk0z.
  • Fix "500 error" issue caused by block_requests.py (#5976). Thanks @nero-dv.
  • Setting default alpha_value and fixing loading some newer DeepSeekCoder GGUFs (#6111). Thanks @mefich.

Library updates

  • llama-cpp-python: bump to 0.2.79 (after a month of wrestling with GitHub Actions).
  • ExLlamaV2: bump to 0.1.6.
  • flash-attention: bump to 2.5.9.post1.
  • PyTorch: bump to 2.2.2. That's the last 2.2 patch version.
  • HQQ: bump to 0.1.7.post3. Makes HQQ functional again.

Other updates

  • Do not "git pull" during installation, allowing previous releases (from this one on) to be installed.
  • Make logs more readable, no more \u7f16\u7801 (#6127). Thanks @Touch-Night.

Support this project