<1 tok/sec with A100 with DeepSeek-R1-Distill-Llama-8B #11555

sb98052 · 2025-01-31T18:07:47Z

sb98052
Jan 31, 2025

Hi,

When running with a CUDA-enabled build of llama.cpp with an A100 GPU, I get really slow performance, <1 tok/sec running inference on the 8B parameter R1/Llama3 model. I would appreciate some tips on how I can diagnose the issue. The same model runs at ~10 tok/sec on my Mac, with llama.cpp built with Metal.

Here's the beginning of the run:

llama.cpp (master)]$ build/bin/llama-cli -m /home/gwsapan/r1.gguf              
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no                                                                                         
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                                                                                                                                                                                                                            ggml_cuda_init: found 1 CUDA devices:                                                                                              
  Device 0: NVIDIA PG509-210, compute capability 8.0, VMM: yes                                                                     
build: 4600 (553f1e46) with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2) for x86_64-redhat-linux                                    
main: llama backend init                                                                                                           
main: load the model and apply lora adapter, if any                                                                                
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA PG509-210) - 80611 MiB free                                            
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /home/gwsapan/r1.gguf (version GGUF V3 (latest))                                                                                                                                    
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                                                                                                                                                                     
llama_model_loader: - kv   0:                       general.architecture str              = llama                                  
llama_model_loader: - kv   1:                               general.type str              = model                                  
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Llama 8B           
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Llama              
llama_model_loader: - kv   4:                         general.size_label str              = 8B                                     
llama_model_loader: - kv   5:                            general.license str              = mit  
llama_model_loader: - kv   6:                          llama.block_count u32              = 32                                                                                                                                                                        
llama_model_loader: - kv   7:                       llama.context_length u32              = 131072                                 
llama_model_loader: - kv   8:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 14336                                  
llama_model_loader: - kv  10:                 llama.attention.head_count u32              = 32                                     
llama_model_loader: - kv  11:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 500000.000000                                                                                                                                                             
llama_model_loader: - kv  13:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                          general.file_type u32              = 1
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 128256                                 
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 128                                    
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2

Thanks in advance,

Answered by cmp-nct

Feb 1, 2025

If nothing changed in the past months then you'll still have to choose how many layers you want to offload to gpu through -ngl

View full answer

cmp-nct · 2025-02-01T03:52:02Z

cmp-nct
Feb 1, 2025

If nothing changed in the past months then you'll still have to choose how many layers you want to offload to gpu through -ngl

0 replies

sb98052 · 2025-02-03T19:11:24Z

sb98052
Feb 3, 2025
Author

Thanks! That did it. Running with -ngl 33 fixes the issue and has it blasting out tokens.

llama_perf_sampler_print: sampling time = 197.68 ms / 1107 runs ( 0.18 ms per token, 5599.99 tokens per second)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<1 tok/sec with A100 with DeepSeek-R1-Distill-Llama-8B #11555

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

<1 tok/sec with A100 with DeepSeek-R1-Distill-Llama-8B #11555

sb98052 Jan 31, 2025

Replies: 2 comments

cmp-nct Feb 1, 2025

sb98052 Feb 3, 2025 Author

sb98052
Jan 31, 2025

cmp-nct
Feb 1, 2025

sb98052
Feb 3, 2025
Author