Running model on multiple GPUs? How to do it? Does koboldcpp allows it do easily? #1003
Replies: 3 comments 3 replies
-
Yes, multi gpu is supported, different GPU are supported too. Set the GPU type to "all" and then select the ratio with |
Beta Was this translation helpful? Give feedback.
-
Having trouble getting that to work with RTX 4070 12GB and RTX 1050Ti 4GB. Loads into both the VRAM, but won't split layers easily, rows not at all. They use the same driver, updated yesterday 560.81. Tried both koboldcpp and _cu12, using a variety of different GGUF models and sizes I use for benchmarking all the time. Actually, I installed it in this machine to offload the TTS from the 4070. Running it on the 1050's CUDA's made a big impact on useability combined with my RX580. Anyone having experience doing that? What TTS has a cmdl parameter for selecting the GPU? It was easy; with the 580; run the LLM on that using HIP Blas or Vulkan and use the only CUDAs they were to find with Coqui/Whisper. Also, no driver issues with two different brands.* When running only a llm Q4, the 1050ti is almost as fast/slow as the RX580 8GB, even in the best combo I found for it (using the _rocm fork with HIP Blas or Vulkan) When gaming, the RX580 is almost twice as fast. Will update when I find out.
|
Beta Was this translation helpful? Give feedback.
-
How would I specify multiple GPU's in vast.ai docker options? I assume it doesn't automatically split them. I think I have to do it thru docker options as I'm on cloud so I can't see the initial .exe screen (I get into the back end though after it loads). So far I have a Mistral-Large model split in two separated by commas. I'll be using 4x RTX 3090's. Here's what I have so far: -e KCPP_MODEL="https://huggingface.co/bartowski/Tess-3-Mistral-Large-2-123B-GGUF/resolve/main/Tess-3-Mistral-Large-2-123B-Q4_K_S/Tess-3-Mistral-Large-2-123B-Q4_K_S-00001-of-00002.gguf?download=true, https://huggingface.co/bartowski/Tess-3-Mistral-Large-2-123B-GGUF/resolve/main/Tess-3-Mistral-Large-2-123B-Q4_K_S/Tess-3-Mistral-Large-2-123B-Q4_K_S-00002-of-00002.gguf?download=true" -e KCPP_ARGS="--usecublas --gpulayers 999 --contextsize 25000 --multiuser --flashattention" Edit: Okay, nevermind. It didn't error out, and 80.8 GB out of my 96 GB is filled, so it must be setup to automatically do it after all. I'm impressed, lol. |
Beta Was this translation helpful? Give feedback.
-
Running model on multiple GPUs? How to do it?
Can you show simple example?
What are the restrictions? Should be GPUs identical?
Or is it possible for instance to have
one RTX 3070 and one 3080? What about memory sharing?
Do mistal and llama support these features?
Also can you share a rig configuration with multiple GPUs for local LLM deployment?
Beta Was this translation helpful? Give feedback.
All reactions