-
Hi, When running with a CUDA-enabled build of llama.cpp with an A100 GPU, I get really slow performance, <1 tok/sec running inference on the 8B parameter R1/Llama3 model. I would appreciate some tips on how I can diagnose the issue. The same model runs at ~10 tok/sec on my Mac, with llama.cpp built with Metal. Here's the beginning of the run:
Thanks in advance, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
If nothing changed in the past months then you'll still have to choose how many layers you want to offload to gpu through -ngl |
Beta Was this translation helpful? Give feedback.
-
Thanks! That did it. Running with -ngl 33 fixes the issue and has it blasting out tokens. llama_perf_sampler_print: sampling time = 197.68 ms / 1107 runs ( 0.18 ms per token, 5599.99 tokens per second) |
Beta Was this translation helpful? Give feedback.
If nothing changed in the past months then you'll still have to choose how many layers you want to offload to gpu through -ngl