Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos::Tools::Experimental::device_id() seems to return ids only associated with cuda device 0 on multi-gpu nodes #224

Open
rppawlo opened this issue Dec 12, 2023 · 5 comments
Labels

Comments

@rppawlo
Copy link

rppawlo commented Dec 12, 2023

I have a kokkos tool that checks the stream and looks to see if a fence call is using the default stream. If I call the Kokkos function below to check what stream I am on:

auto phalanx_default_stream_device_id = Kokkos::Tools::Experimental::device_id(Kokkos::Cuda());

and then run a test that calls:

Kokkos::Cuda().fence()

where I have the registered the below callback in the kokkos-tools, I get a consistent stream id and an exception thrown as expected:

  void phalanx_kt_fence_callback(char const *label, uint32_t device_id,
                                 uint64_t * /*fence_id*/)
  {
    TEUCHOS_TEST_FOR_EXCEPTION(device_id == phalanx_default_stream_device_id,
                               std::runtime_error,
                               "\"ERROR: the fence \"" << label
                               << "\" with device id=" << device_id
                               << " is the same as the default stream id="
                               << phalanx_default_stream_device_id);
  }

However, if I run the executable with --kokkos-device-id=3 to pick a different GPU on a node then the function does not return a consistent id. It looks like the device_id() function always returns the stream id for cuda device id 0. Is this intended? How do I get the default stream id for the device this particular mpi process has chosen?

@rppawlo
Copy link
Author

rppawlo commented Dec 13, 2023

@crtrott @dalabre

@masterleinad
Copy link
Contributor

masterleinad commented Dec 13, 2023

The problem is that the fence callback forwards, say CudaInternal.impl_get_instance_id(), instead of Kokkos::Tools::Experimental::device_id(exec).

@vlkale
Copy link
Contributor

vlkale commented Feb 21, 2024

@rppawlo Is this resolved for you given the comment from @masterleinad above?

@masterleinad
Copy link
Contributor

@rppawlo Is this resolved for you given the comment from @masterleinad above?

No, it's not. The information we are forwarding doesn't include the number of the device we are using. It shouldn't be too hard to fix, though, but it's also important to clarify what the Tools interface expects to get as the identifier. If you can write a check that all callbacks forward the whole identification, that would be very helpful.

@vlkale
Copy link
Contributor

vlkale commented Feb 22, 2024

@rppawlo Is this resolved for you given the comment from @masterleinad above?

No, it's not. The information we are forwarding doesn't include the number of the device we are using. It shouldn't be too hard to fix, though, but it's also important to clarify what the Tools interface expects to get as the identifier. If you can write a check that all callbacks forward the whole identification, that would be very helpful.

OK, got it and thanks for clarifying that. Yes, I will work on putting in a check for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants