VRAM management feature

The idea is simple: one needs only a single model (LLM, SD, TTS) loaded into VRAM any given time.

If this feature is enabled, on a picture request, SD-api-pics would automatically:

unload your language model from VRAM
ask A1111 to reload last Stable Diffusion checkpoint
prompt it for an image
ask A1111 to free VRAM (unloading SD checkpoint)
reload your LLM back to VRAM
display results

(If VRAM management is disabled, steps 1, 2, 4 and 5 are omitted.)

Beware that this option uses system resources rather extensively and to be fast it needs at least TWICE AS MUCH RAM as you have REQUIRED VRAM.

Here's what happens:

STEP	Init	Textgen Start	Unload LLM	Done	Load SD	Picture gen	Unload SD	Done	Load LLM	Status QUO
VRAM	-	LLM	↓ LLM	-	-	SD	↓ SD	-	-	LLM
RAM	SD	SD	SD	SD + LLM ¹	↑ SD + LLM ¹	LLM	LLM	LLM + SD ¹	↑ LLM + SD ¹	SD

¹ So the RAM usage peaks at ( SD + LLM ) size, which caps at 2×total VRAM.

If the RAM is insufficient for caching, the models are forced to be loaded from Disk, increasing the latency significantly.

Automatic1111's WebUI by default caches one SD checkpoint used; but even if it did not, OS itself caches files in RAM so if you have enough reloading's fast.

Unloading weights is almost instaneous, re-loading the model usually takes 5-15 seconds depending on your system.

With this feature, I bet it's possible to run LLaMA-7b-4bits & SD 320x320 on a single GPU with less than 6 Gb VRAM, but no guarantees & that surely would be slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRAM management feature

Issues to handle:

To fix:

Fixed, document:

Fixed, request closure:

???

Re-review pulls

??? pulls

Clone this wiki locally