Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Kokkos Dev Documentation #3046

Merged
merged 1 commit into from
Oct 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 175 additions & 7 deletions components/eamxx/docs/developer/kokkos_ekat.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,180 @@
# Building Blocks
# Kokkos and EKAT

Here we can discuss EKAT, Kokkos, and all of the highly-technical non-scientific
stuff that makes our heads hurt.
## Kokkos

## Kokkos Views
EAMxx uses Kokkos for performance portable abstractions for parallel execution of code and data management to various HPC platforms, including OpenMP, Cuda, HIP, and SYCL. Here we give a brief overview of the important concepts for understanding Kokkos in EAMxx. For a more in depth description, see the [Kokkos wiki](https://kokkos.org/kokkos-core-wiki).

### Kokkos::Device

`Kokkos::Device` is a struct which contain the type definitions for two main Kokkos concepts: execution space (`Kokkos::Device::execution_space`), the place on-node where parallel operations (like for-loops, reductions, etc.) are executed, and the memory space (`Kokkos::Device::memory_space`), the memory location on-node where data is stored. Given your machine architecture, Kokkos defines a default "device" space, given by
```
Kokkos::Device<Kokkos::DefaultExecutionSpace, Kokkos::DefaultExecutionSpace::memory_space>
```
where all performance critical code should be executed (e.g., on an NVIDIA machine, this device would be the GPU accelerators) and a default "host" space, given by
```
Kokkos::Device<Kokkos::DefaultHostExecutionSpace, Kokkos::DefaultHostExecutionSpace::memory_space>
```
where data can be accessed by the CPU cores and is necessary for I/O interfacing, for example. Currently, these default spaces are the ones used by EAMxx. On CPU-only machines, host and device represent the same space.

### Kokkos Views

The main data struct provided by Kokkos used in EAMxx in the `Kokkos::View`. This is a multi-dimensional data array that can live on either device or host memory space. These Views are necessary when running on GPU architectures as data structures like `std::vector` and `std::array` will be unavailable on device.

Views are constructed in EAMxx most commonly with the following template and input arguments
```
Kokkos::View<DataType, LayoutType, DeviceType>(const std::string& label, int dim0, int dim1, ...)
```
where

- `DataType`: scalar type of the view, given as `ScalarType`+`*`(x's number of run-time dimensions). E.g., a 2D view of doubles will have `DataType = double**`. There is also an ability to define compile-time dimensions by using `[]`, see [Kokkos wiki section on views](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html).
- `LayoutType`: mapping of indices into the underlying 1D memory storage. Types are:
- `LayoutRight` (used in EAMxx): strides increase from the right most to the left most dimension, right-most dimension is contiguous
- `LayoutLeft`: strides increase from the left most to the right most dimension, left-most dimension is contiguous
- `LayoutStride`: strides can be arbitrary for each dimension
- `DeviceType`: provides space where data live, defaults to the default device

The following example defines a view "temperature" which has dimensions columns and levels:
```
Kokkos::View<double**, Kokkos::LayoutRight, Kokkos::DefaultDevice> temperature("temperature", ncols, nlevs);
```

### Deep Copy

Kokkos provides `Kokkos::deep_copy(dst, src)` which copies data between views of the same dimensions, or a scalar values into a view. Common uses
```
Kokkos::deep_copy(view0, view1); // Copy all data from view1 into view0
Kokkos::deep_copy(view0, 5); // Set all values of view0 to 5
```
As seen in the next section, we can use `deep_copy()` to copy data between host and device.

### Mirror Views

We will often need to have memory allocation the resides on device (for computation), and then need that identical data on host (say, for output). Kokkos has a concept of mirror views, where data can be copied from host to device and vice versa.

Here is an example using the device view `temperature` from above
```
// Allocate view on host that exactly mirrors the size of layout of the device view
auto host_temperature = Kokkos::create_mirror_view(temperature);

// Copy all data from device to host
Kokkos::deep_copy(host_temperature, temperature);
```
Kokkos also offers an all-in-one option
```
// Note: must hand the host device instance as first argument
auto host_temperature = Kokkos::create_mirror_view_and_copy(Kokkos::DefaultHostDevice(), temperature);
```

### Parallel Execution

The most basic parallel execution pattern used by EAMxx is the `Kokkos::parallel_for` which defines a for-loop with completely independent iterations. The `parallel_for` takes in an optional label for debugging, an execution policy, which defines a range and location (host or device) for the code to be run, and a lambda describing the body of code to be executed. The following are execution policies used in EAMxx

- `int count`: 1D iteration range `[0, count)`
- `RangePolicy<ExecSpace>(int beg, int end)`: 1D iteration range for indices `[beg, end)`
- `MDRangePolicy<ExecSpace, Kokkos::Rank<N>>(int[N] beg, int[N] end)`: multi-dimensional iteration range `[beg, end)`
- `TeamPolicy<ExecSpace>(int league_size, int team_size, int vector_size)`: 1D iteration over `league_size`, assigned to thread teams of size `team_size`, each with `vector_size` vector lanes. Both `team_size` and `vector_size` can be given `Kokkos::AUTO` as input for Kokkos to automatically compute.

If no `ExecSpace` template is given, the default execution space is used.

For lambda capture, use `KOKKOS_LAMBDA` macro which sets capture automatically based on architecture.

Example using `RangePolicy` to initialize a view
```
Kokkos::View<double**, Kokkos::LayoutRight> temperature("temperature", ncols, nlevs);
Kokkos::parallel_for("Init_temp",
Kokkos::RangePolicy(0, ncols*nlevs),
KOKKOS_LAMBDA (const int idx) {
int icol = idx/nlevs;
int ilev = idx%nlevs;

temperature(icol, ilev) = 0;
});
```
Same example with `TeamPolicy`
```
Kokkos::parallel_for("Init_temp",
Kokkos::TeamPolicy(ncols*nlevs, Kokkos::AUTO, Kokkos::AUTO),
KOKKOS_LAMBDA (const TeamPolicy::member_type& team) {
// league_rank() gives the index for this team
int icol = team.league_rank()/nlevs;
int ilev = team.league_rank()%nlevs;

temperature(icol, ilev) = 0;
});
```

### Hierarchical Parallelism

Using `TeamPolicy`, we can have up to three nested levels of parallelism: team parallelism, thread parallelism, vector parallelism. These nested policies can be called within the lambda body using the following execution policies

- `TeamThreadRange(team, begin, end)`: execute over threads of a team
- `TeamVectorRange(team, begin, end)`: execute over threads and vector lanes of a team
- `ThreadVectorRange(team, begin, end)`: execute over vector lanes of a thread

An example of using these policies
```
Kokkos::View<double***> Q("tracers", ncols, ntracers, nlevs);
Kokkos::parallel_for(Kokkos::TeamPolicy(ncols, Kokkos::AUTO),
KOKKOS_LAMBDA (TeamPolicy::member_type& team) {
int icol = team.league_rank();
Kokkos::parallel_for(Kokkos::TeamVectorRange(team, nlevs), [&](int ilev) {
temperature(icol, ilev) = 0;
});

Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlevs), [&](int ilev) {
Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, ntracers), [&](int iq) {
Q(icol, iq, ilev) = 0;
});
});
});
```

IMPORTANT! Nested policies cannot be used in arbitrary order. `ThreadVectorRange` must be used inside a `TeamThreadRange`, and `TeamVectorRange` must be the only level of nested parallelism.
```
Kokkos::parallel_for(TeamPolicy(...), ... {
// OK
Kokkos::parallel_for(TeamThreadRange, ... {

});

// OK
Kokkos::parallel_for(TeamThreadRange, ... {
Kokkos::parallel_for(ThreadVectorRange, ... {

});
});

// OK
Kokkos::parallel_for(TeamVectorRange, ...{

});

// WRONG,ThreadVectorRange must be nested in TeamThreadRange
Kokkos::parallel_for(ThreadVectorRange, ...{

});

// WRONG, a TeamVectorRange must be the only nested level
Kokkos::parallel_for(TeamVectorRange, ...{
Kokkos::parallel_for(ThreadVectorRange, ... {

});
});
});
```
Using these incorrectly can be very tricky to debug as the code almost certainly will _not_ error out, but race conditions will exist among threads.

## EKAT

### KokkosTypes

### ExeSpaceUtils

### Vectorization: Packs

### Scratch Memory: WorspaceManager

### Algorithms

## Vectorization: Packs

## Fields and the Field Manager

### Preconditions, Postconditions, and Invariants