Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating tool to extract kernel launch configuration (block, grid, launch mechanism, ...) #238

Open
maartenarnst opened this issue Feb 9, 2024 · 4 comments

Comments

@maartenarnst
Copy link
Contributor

It can be of interest to developers of Kokkos applications to have some insight into the configuration that Kokkos uses to launch kernels (block, grid, launch mechanism, ...).

However, currently, in Kokkos, the determination of such a launch configuration is implemented typically inside the body of an execute() function. Hence, it cannot be accessed directly. And it seems that to access launch configurations, developers currently are led to use tools like ncu and rocprof. Another option (not sustainable) is to copy-paste pieces of the bodies of the execute() functions to custom functions.

An option may be to extract these functionalities from the execute() functions in Kokkos and put them into dedicated functions that could become part of the api of Kokkos. However, implementing the launch configurations inside the bodies of execute() functions, and thus choosing not to expose them, may have been a deliberate design decision in Kokkos (?).

Thus putting the question here how best to proceed to make it possible to extract launch configuration properties?

If it's not an option to expose them in Kokkos itself, it appears interesting to explore whether gaining insight into launch configurations could be made a part of Kokkos tools. I.e., whether it would be of interest to define new callbacks that can provide the launch configuration and develop a new Kokkos tools connector to collect such information.

@romintomasetti

@dalg24, @masterleinad, @vlkale

@vlkale
Copy link
Contributor

vlkale commented Feb 15, 2024

@maartenarnst

Thanks for this. It is a good point.

I don't know how easy it would be to make modifications to Kokkos core and extract out launch configuration code from the execute() function.

I think the solution should involve a new Kokkos Tools callback. I haven't sketched it out in detail but you would need to make changes in profiling/all/ to add this new callback.

@cwpearson
Copy link
Contributor

cwpearson commented Feb 29, 2024

@maartenarnst what do these tools like ncu and rocprof need as inputs to extract this information (e.g. a pointer to the kernel function?). If its runtime information like that, my first thought would be that Kokkos should pass that information to Tools through an appropriate interface and then Tools can use it as needed.

I guess one issue I see is that Core will do things like launch mechanism and parameters before actually launching the kernel, so we'd have to resolve how to give Tools enough information to correlate that with the following kernel launch and plumb that through Core.

If it's static information, perhaps it can be integrated with the PR @dalg24 referenced above.

@romintomasetti
Copy link
Contributor

With @maartenarnst, we think there is a bigger picture question we should answer before we go on.

What should Kokkos Tools be able to do ?

It seems that backend-specific information like launch grid, scratch size and so on can always be extracted using the backend-vendor tools (e.g. ncu for CUDA or rocprof for HIP).

So one question we have is:

What should Kokkos Tools be able to provide ? Should it also be able to provide information that Kokkos has (e.g. grid size) but that can be extracted using vendor tools?

In other words:

What is the scope of Kokkos Tools ? Should it collect backend-specific information that backend tools can already provide ?

In other words:

Is Kokkos Tools a a drop-in replacement (e.g. for easy and direct to kernel info in preliminary benchmark studies), or just a substitute for "missing" features of vendor tools ?

For instance, it seems the functor size is not easy to retrieve with ncu (because ncu only "sees" the driver), so it would make sense to provide it with Kokkos Tools. But the launch grid is easy to retrieve with vendor tools, so is it in the scope of Kokkos Tools to provide such details?

@crtrott @dalg24 @maartenarnst @masterleinad @cwpearson @vlkale

@vlkale
Copy link
Contributor

vlkale commented Mar 26, 2024

@romintomasetti @ALL

tl;dr to answer @romintomasetti question: I also think launch grid configuration is in the scope, assuming a Kokkos user can gain some insight from apples-to-apples comparison of launch configurations across different vendor tools.

I elaborate below, though we may want to move this elaboration to another Kokkos Tools github issue:


You can take a look at the Kokkos Tools documentation README.md and the wiki for the scope and purpose of Kokkos Tools, but let me summarize and target it in the context of your question:

  • The single most important role of Kokkos Tools is to provide a performance portabile tooling for each backend, and this complements the performance portable programming capabilities of Kokkos core.
  • Kokkos Tools is meant to obtain information (timing) or perform some tooling operation (tuning) from the Kokkos runtime functions which sit above the backend (and that are not exposed to the Kokkos user). You can see this because any kokkosp_... function is an event callback of a Kokkos runtime function, e.g., kokkosp_begin_parallel_for(...) corresponds to the Kokkos runtime function BeginParallelFor(...). Note that a tool corresponding to any particular HPC Software, e.g., PMPI for MPI, should behave this way.

Consider the problem of Kokkos function name demangling that one would have without Kokkos Tools. The problem is not (just) that reading the function name is hard for a Kokkos user running on one particular backend. I think the more fundamental problem comes in portable tooling: How does one compare timings of a particular Kokkos::parallel_for run on an AMD GPU (with the HIP backend) with that of an NVIDIA GPU (with CUDA backend)? Kokkos Tools provides for an apples-to-apples - portable - comparison of a labeled Kokkos kernel across the two different vendor GPUs. Otherwise, the Kokkos programmer has to take time doing such a comparison on their own (note how this directly corresponds to effort of programming and maintaining CUDA and HIP backend if he/she didn't have Kokkos).

So, to answer the question: I think launch grid configuration is in the scope, but this is assuming a Kokkos user can gain some insight from apples-to-apples comparison of launch configurations across different vendor tools. More generally, any tooling for Kokkos program is in scope if it has meaning across different Kokkos backends is in Kokkos Tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants