Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oversubscribed gasnet with GPU support is broken #25989

Open
e-kayrakli opened this issue Sep 24, 2024 · 4 comments
Open

Oversubscribed gasnet with GPU support is broken #25989

e-kayrakli opened this issue Sep 24, 2024 · 4 comments

Comments

@e-kayrakli
Copy link
Contributor

The runtime doesn't seem to report the correct number of devices in this config.

for loc in Locales do on loc {
  writeln(here, " ", here.gpus.size);
}

reports 0 GPUs for each locale when run with more than 1 locale. If you run this with -nl1 in the given config we get the correct number of GPUs. Things must be fine with actual multilocale config as we have a ton of nightly testing for that, but not really for the oversubscribed config with GPUs.

How to share multiple GPUs in an oversubscribed setting is not something we have completely answered. However, we have been giving all locales all GPUs and letting the GPU driver figure things out, which I believe just serializes requests from different processes. I think we should fix this and go back to that world.

@e-kayrakli
Copy link
Contributor Author

@jhh67 I added you as an assignee as I believe this is fallout from #25734. I think chpl_topo_selectMyDevices sees 0 devices with the problematic setting, which causes the issue down the line. Could you take a look when you get the chance?

@e-kayrakli
Copy link
Contributor Author

If anyone else bumps into this, I am working with the following hack to get this mode to relatively more workable state:

diff --git a/runtime/src/gpu/nvidia/gpu-nvidia.c b/runtime/src/gpu/nvidia/gpu-nvidia.c
index d7e93173f3..0069b87002 100644
--- a/runtime/src/gpu/nvidia/gpu-nvidia.c
+++ b/runtime/src/gpu/nvidia/gpu-nvidia.c
@@ -169,7 +169,7 @@ void chpl_gpu_impl_init(int* num_devices) {
   chpl_topo_pci_addr_t *addrs = chpl_malloc(sizeof(*addrs) * numAddrs);

   int rc = chpl_topo_selectMyDevices(allAddrs, addrs, &numAddrs);
-  if (rc) {
+  if (true) {
     chpl_warning("unable to select GPUs for this locale, using them all",
                  0, 0);
     for (int i = 0; i < numAllDevices; i++) {

@jhh67
Copy link
Contributor

jhh67 commented Sep 26, 2024

How do I replicate this problem?

@e-kayrakli
Copy link
Contributor Author

Probably the key configs are:

CHPL_LLVM: system  # for GPU support
CHPL_LOCALE_MODEL: gpu
CHPL_COMM: gasnet
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: everything
GASNET_SPAWNFN: L

Compiling and running the code in the OP with -nl2 should generate the incorrect result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants