DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) #184

haardm · 2024-08-14T18:36:14Z

Hi team,

We are observing PCIe violations on p5 happen consistently fromt the time the instance is launched.
We act on this with a terminate and replace instance logic, but that's an expensive operation timewise as well as the instance type being a P5 from EC2. Also, this is a default set threshold and can't be managed by client while subscribing to the policy.

Few asks:

Is there an upstream fix from Nvidia that is planned?
Is there any repercussion of temporarily not subscribing to this policy?
What would go wrong if we let the PCIe errors to keep happening silently?

nikkon-dev · 2024-08-20T05:35:38Z

Hi,

This is a known issue that will be resolved in an upcoming release. In short, this diagnostic didn’t correctly normalize the error rates to the PCIe generation throughput.

haardm changed the title ~~DCGM Policy Violation Notification channel reporting too many PCIe violations on P5~~ DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) #184

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) #184

haardm commented Aug 14, 2024

nikkon-dev commented Aug 20, 2024

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) #184

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) #184

Comments

haardm commented Aug 14, 2024

nikkon-dev commented Aug 20, 2024