Skip to content

VRAM management feature

Φφ edited this page Apr 17, 2023 · 5 revisions

The idea is simple: one needs only a single model (LLM, SD, TTS) loaded into VRAM any given time.

If this feature is enabled, on a picture request, SD-api-pics would automatically:

  1. unload your language model from VRAM
  2. ask A1111 to reload last Stable Diffusion checkpoint
  3. prompt it for an image
  4. ask A1111 to free VRAM (unloading SD checkpoint)
  5. reload your LLM back to VRAM
  6. display results

(If VRAM management is disabled, steps 1, 2, 4 and 5 are omitted.)

Beware that this option uses system resources rather extensively and to be fast it needs at least TWICE AS MUCH RAM as you have REQUIRED VRAM.

Here's what happens:
STEP Init Textgen Start Unload LLM Done Load SD Picture gen Unload SD Done Load LLM Status QUO
VRAM - LLM LLM - - SD SD - - LLM
RAM SD SD SD SD + LLM ¹ SD + LLM ¹ LLM LLM LLM + SD ¹ LLM + SD ¹ SD
  • ¹ So the RAM usage peaks at ( SD + LLM ) size, which caps at 2×total VRAM.

If the RAM is insufficient for caching, the models are forced to be loaded from Disk, increasing the latency significantly.

Automatic1111's WebUI by default caches one SD checkpoint used; but even if it did not, OS itself caches files in RAM so if you have enough reloading's fast.

Unloading weights is almost instaneous, re-loading the model usually takes 5-15 seconds depending on your system.


With this feature, I bet it's possible to run LLaMA-7b-4bits & SD 320x320 on a single GPU with less than 6 Gb VRAM, but no guarantees & that surely would be slow.

Additional testing is needed.

Clone this wiki locally