You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am planning to serve the Llama 3.3 70B model using Triton Inference Server with the TensorRT-LLM backend on a system equipped with 8 NVIDIA RTX 4090 GPUs. My questions are:
Is it feasible to serve the Llama 3.3 70B model in FP16 precision under this configuration?
Can this setup operate efficiently without NVLink, or would the absence of NVLink significantly impact performance?
Are there any additional considerations or recommendations for optimizing performance in this environment?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I am planning to serve the Llama 3.3 70B model using Triton Inference Server with the TensorRT-LLM backend on a system equipped with 8 NVIDIA RTX 4090 GPUs. My questions are:
Thank you for your assistance.
Beta Was this translation helpful? Give feedback.
All reactions