[Question] Serving Llama 3.3 70B model with 8x RTX 4090 GPUs using Triton Inference Server without NVLink #677

novela77 · 2025-01-13T07:18:06Z

novela77
Jan 13, 2025

Hello,

I am planning to serve the Llama 3.3 70B model using Triton Inference Server with the TensorRT-LLM backend on a system equipped with 8 NVIDIA RTX 4090 GPUs. My questions are:

Is it feasible to serve the Llama 3.3 70B model in FP16 precision under this configuration?
Can this setup operate efficiently without NVLink, or would the absence of NVLink significantly impact performance?
Are there any additional considerations or recommendations for optimizing performance in this environment?

Thank you for your assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Serving Llama 3.3 70B model with 8x RTX 4090 GPUs using Triton Inference Server without NVLink #677

{{title}}

Replies: 0 comments

Select a reply

[Question] Serving Llama 3.3 70B model with 8x RTX 4090 GPUs using Triton Inference Server without NVLink #677

novela77 Jan 13, 2025

Replies: 0 comments

novela77
Jan 13, 2025