Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FI_MR_CACHE_MONITOR=kdreg2 for all nersc machines #6687

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Oct 15, 2024

With new slighshot software (s2.2 h11.0.1), I encountered some hangs in our init for certain cases at higher node counts.
Using FI_MR_CACHE_MONITOR=kdreg2 avoids any issues.
HPE claims this might be default one day.
For one HR F-case at 256 nodes, using kdreg2 was about 1% slower.
Fixes #6655
I also found some other issues (some with lower node-count) that this fixes (even on pm-cpu currently):
Fixes #6516
Fixes #6451
Fixes #6521

[bfb]

@ndkeen ndkeen self-assigned this Oct 15, 2024
@ndkeen ndkeen added Machine Files BFB PR leaves answers BFB pm-gpu Perlmutter machine at NERSC (GPU nodes) pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Oct 15, 2024
Copy link

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6687/
on branch gh-pages at 2024-10-15 17:24 UTC

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 15, 2024

I'm now seeing it looks like this will fix #6516, which would be great as it will allow me to update the GCC compiler version.

Confirmed that using kdreg2 is allowing those hanging tests with newer GCC to now work. Verified that I can update GCC to latest version and all testing seems OK so far. Would like to see this PR in first, then follow with PR to update module versions.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 17, 2024

Fixes #6451
Fixes #6521

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes) pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
2 participants