You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are observing PCIe violations on p5 happen consistently fromt the time the instance is launched.
We act on this with a terminate and replace instance logic, but that's an expensive operation timewise as well as the instance type being a P5 from EC2. Also, this is a default set threshold and can't be managed by client while subscribing to the policy.
Few asks:
Is there an upstream fix from Nvidia that is planned?
Is there any repercussion of temporarily not subscribing to this policy?
What would go wrong if we let the PCIe errors to keep happening silently?
The text was updated successfully, but these errors were encountered:
haardm
changed the title
DCGM Policy Violation Notification channel reporting too many PCIe violations on P5
DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100)
Aug 15, 2024
This is a known issue that will be resolved in an upcoming release. In short, this diagnostic didn’t correctly normalize the error rates to the PCIe generation throughput.
Hi team,
We are observing PCIe violations on p5 happen consistently fromt the time the instance is launched.
We act on this with a terminate and replace instance logic, but that's an expensive operation timewise as well as the instance type being a P5 from EC2. Also, this is a default set threshold and can't be managed by client while subscribing to the policy.
Few asks:
Is there an upstream fix from Nvidia that is planned?
Is there any repercussion of temporarily not subscribing to this policy?
What would go wrong if we let the PCIe errors to keep happening silently?
The text was updated successfully, but these errors were encountered: