Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) #184

Open
haardm opened this issue Aug 14, 2024 · 1 comment

Comments

@haardm
Copy link

haardm commented Aug 14, 2024

Hi team,

We are observing PCIe violations on p5 happen consistently fromt the time the instance is launched.
We act on this with a terminate and replace instance logic, but that's an expensive operation timewise as well as the instance type being a P5 from EC2. Also, this is a default set threshold and can't be managed by client while subscribing to the policy.

Few asks:

Is there an upstream fix from Nvidia that is planned?
Is there any repercussion of temporarily not subscribing to this policy?
What would go wrong if we let the PCIe errors to keep happening silently?

@haardm haardm changed the title DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) Aug 15, 2024
@nikkon-dev
Copy link
Collaborator

Hi,

This is a known issue that will be resolved in an upcoming release. In short, this diagnostic didn’t correctly normalize the error rates to the PCIe generation throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants