Running model on multiple GPUs? How to do it? Does koboldcpp allows it do easily? #1003

martinenkoEduard · 2024-07-17T08:33:28Z

martinenkoEduard
Jul 17, 2024

Running model on multiple GPUs? How to do it?
Can you show simple example?

What are the restrictions? Should be GPUs identical?
Or is it possible for instance to have
one RTX 3070 and one 3080? What about memory sharing?

Do mistal and llama support these features?
Also can you share a rig configuration with multiple GPUs for local LLM deployment?

LostRuins · 2024-07-17T11:05:51Z

LostRuins
Jul 17, 2024
Maintainer

Yes, multi gpu is supported, different GPU are supported too. Set the GPU type to "all" and then select the ratio with --tensor_split

0 replies

nielsijzerman · 2024-08-08T11:30:46Z

nielsijzerman
Aug 8, 2024

Having trouble getting that to work with RTX 4070 12GB and RTX 1050Ti 4GB.

Loads into both the VRAM, but won't split layers easily, rows not at all. They use the same driver, updated yesterday 560.81. Tried both koboldcpp and _cu12, using a variety of different GGUF models and sizes I use for benchmarking all the time.
It only generated tokens once or twice on 1050ti as well, 0 % activity except for when it's loading the VRAM.
Improvement perhaps not worth the trouble, but I'll investigate some more and post some benchmarks somewhere.

Actually, I installed it in this machine to offload the TTS from the 4070. Running it on the 1050's CUDA's made a big impact on useability combined with my RX580. Anyone having experience doing that? What TTS has a cmdl parameter for selecting the GPU? It was easy; with the 580; run the LLM on that using HIP Blas or Vulkan and use the only CUDAs they were to find with Coqui/Whisper. Also, no driver issues with two different brands.*

When running only a llm Q4, the 1050ti is almost as fast/slow as the RX580 8GB, even in the best combo I found for it (using the _rocm fork with HIP Blas or Vulkan) When gaming, the RX580 is almost twice as fast.

Will update when I find out.

Had some headaches before trying to get multi gpu systems to work because of slightly different versions, a month or so apart from each other but one already deprecated/legacy, the other almost. Even though they seemed to belong in the same range, they just weren't related close enough.
A trauma that has haunted me since my fruitless attempts with Voodoo2 SLI, back when it was 3DFX . I didn't know that different VRAM sizes don't work together, hardly had any internet, and thought my hack job connector cable was the culprit. Good old times.

2 replies

LostRuins Aug 8, 2024
Maintainer

You won't necessarily see GPU activity at all times, Make sure you are viewing the right statistics under 'CUDA'

You can confirm its working if VRAM is being used on both GPUs. Running nvidia-smi will show this too.

nielsijzerman Aug 18, 2024

Had to disable HAGS, might anyone run into the same "huh?" as I did.

ReMeDy-TV · 2024-08-09T12:20:08Z

ReMeDy-TV
Aug 9, 2024

How would I specify multiple GPU's in vast.ai docker options? I assume it doesn't automatically split them. I think I have to do it thru docker options as I'm on cloud so I can't see the initial .exe screen (I get into the back end though after it loads).

So far I have a Mistral-Large model split in two separated by commas. I'll be using 4x RTX 3090's. Here's what I have so far:

-e KCPP_MODEL="https://huggingface.co/bartowski/Tess-3-Mistral-Large-2-123B-GGUF/resolve/main/Tess-3-Mistral-Large-2-123B-Q4_K_S/Tess-3-Mistral-Large-2-123B-Q4_K_S-00001-of-00002.gguf?download=true, https://huggingface.co/bartowski/Tess-3-Mistral-Large-2-123B-GGUF/resolve/main/Tess-3-Mistral-Large-2-123B-Q4_K_S/Tess-3-Mistral-Large-2-123B-Q4_K_S-00002-of-00002.gguf?download=true" -e KCPP_ARGS="--usecublas --gpulayers 999 --contextsize 25000 --multiuser --flashattention"

Edit: Okay, nevermind. It didn't error out, and 80.8 GB out of my 96 GB is filled, so it must be setup to automatically do it after all. I'm impressed, lol.

1 reply

LostRuins Aug 9, 2024
Maintainer

Default is an equal split, but you can control it by adding an extra --tensor_split parameter to adjust the ratios

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running model on multiple GPUs? How to do it? Does koboldcpp allows it do easily? #1003

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Running model on multiple GPUs? How to do it? Does koboldcpp allows it do easily? #1003

martinenkoEduard Jul 17, 2024

Replies: 3 comments · 3 replies

LostRuins Jul 17, 2024 Maintainer

nielsijzerman Aug 8, 2024

LostRuins Aug 8, 2024 Maintainer

nielsijzerman Aug 18, 2024

ReMeDy-TV Aug 9, 2024

LostRuins Aug 9, 2024 Maintainer

martinenkoEduard
Jul 17, 2024

Replies: 3 comments 3 replies

LostRuins
Jul 17, 2024
Maintainer

nielsijzerman
Aug 8, 2024

LostRuins Aug 8, 2024
Maintainer

ReMeDy-TV
Aug 9, 2024

LostRuins Aug 9, 2024
Maintainer