Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken AMD OpenCL locality on systems with multiple PCI domains #696

Open
bgoglin opened this issue Nov 15, 2024 · 0 comments
Open

Broken AMD OpenCL locality on systems with multiple PCI domains #696

bgoglin opened this issue Nov 15, 2024 · 0 comments

Comments

@bgoglin
Copy link
Contributor

bgoglin commented Nov 15, 2024

The OpenCL extension for querying locality of AMD GPU (CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD) doesn't report the PCI domain. Systems with multiple PCI domains were rare, but at least systems with multiple MI300A CPU+GPUs (like Adastra and maybe El Capitan) use one domain per CPU now.

The issue is confirmed at ROCm/clr#106 but it won't be fixed upstream, likely because the OpenCL runtime doesn't matter anymore.

Possible solutions:

  • remove AMD OpenCL locality queries since it doesn't matter much anymore apparently
  • if there are multiple PCI domains in the machine, ignore AMD OpenCL locality (just attach to root)
  • if there are multiple AMD GPUs with same PCI BDFs except the PCI domain, attach all of them to root
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant