Skip to content

Related Benchmarks

Patrick Lavin edited this page Jan 21, 2025 · 45 revisions

Memory-Specific Benchmarks.

STREAM contains 4 kernels.

  • Copyz = y

  • Scale z = βy

  • Add z = y + x

  • Triad z = βy + x


From the authors: "STRIDE is designed to severely stress the memory subsystem on a node using a series of sequential kernels."

Stride computes various linear algebra kernels and covers a lot of the same ground as Stream, while allowing for strided accesses, and even indexed accesses in one case. In the kernels below, x, y, and z are vectors, α is a scalar computed as 1/j on the fly (j being the loop index), and β and γ are constant scalars.

  • cachedot.f computes a dot product xTy

  • striddot.f computes a dot product, but the accesses have uniform stride (for both x and y vectors)

  • cache.f/cache.c computes a triad, y=y+αx.

  • strid3.f/strid3.c computes a triad, but the accesses have uniform stride (for both input and output)

  • vecop.f/vecop.c contains 6 different kernels.

    • v1s1m3 computes a vector scale y = αx

    • v1s2m3 computes y = αy + βx

    • v1s3m3 computes y = α(βy + γx)

    • v2s2m3 computes y += α(y + γx)

    • v2s2m4 computes y += α(z + βx)

    • v1s1i3 computes y[idx] += αx[idx], where idx is an index vector that causes the accesses to be stride 4.


Apex-Max is a multi-node memory benchmark that models access patterns with two parameters, spatial and temporal locality.

The sequential kernel is:

for (i in 0..N)
    j=ind(i)
    compute(src[j:j+L])

where compute is a vector sum reduction, and ind is a precomputed index buffer, which is determined by temporal locality parameter, α. This parameter controls a power-law distribution of indices, meaning an α near 1 will approach true random accesses, and an α near 0 will cause most accesses to be near the base address. The spatial locality parameter is L, which controls the size of the block accessed.

In the parallel case, the data is block distributed across nodes and each node will shift its base address by an amount dependent on its rank, placing the base of the data buffer at its local portion of the array. This means that in the small α case, access will occur in local memory with highest probability.

Here is the parallel kernel:

for (i in 0..N)
    j=ind(i)
    if (src[j:j+L] is remote):
        get(src[j:j+L])
    compute(src[j:j+L])

  • Hammond's simd-memtest has code to tests many different AVX intrinsics, including Gather and Scatter. He has code that tests these instructions with various strides, from stride 1 through stride 64 (elements). Thus the number of testable patterns is far fewer than spatter but the number of instructions that can be tested is far higher. There is also code to test many (all?) other mov-type instructions.

The RandomAccess benchmark reports the number of giga-updates per second (GUPS) that the machine can sustain. It is a single or multi-node benchmark that ships with code for both.

The benchmark description is rather complex so I will provide what I believe is a better explanation here.

Shared Memory Kernel

First, we need to be precise about the size of the data array T[] used during the benchmark. The benchmark designers specify that it shall be size 2n, with n chosen such that an array of size 2n+1 would be strictly more than half of system memory. What they are not clear about is whether this 2n number is in bytes or elements, and the answer, gleaned from the pictures they provide, is that it refers to elements. Therefore, If you have Q bytes of memory, n should be maximized with respect to the constraint 2n*8 ≤ Q/2, as each array element is a 64-bit number.

Together with the data array T[], the benchmark requires a pseudo-random stream of 64-bit integers, A, which is of length 2n+2. Similar to their own website, we will define A[i] to be the i-th element of A, and A[i][63:64-n] to be the n most significant bits of A[i] (that is, the second index accesses the bits of A[i]).

For now, assume we have all of A as that makes describing the kernel easier. In practice, A is generated on the fly.

Now, describing the benchmark kernel is simple:

for i in len(A):
    k = A[i][63:64-n]
    T[k] = T[k] ^ A[i]

where ^ represents exclusive-or.

As you may have realized, this is extremely difficult to parallelize, as the access locations are dependent on the random stream, A. To ease pressure on the developers, the designers of the benchmark allow multiple updates to be performed at the same time (for instance with a vector instruction or unrolled loop) so long as the error rate is less than 1%. The multi-threaded case is analogous to the vectorized case - the programmer need not worry about race conditions, so long as the end result has a sufficiently lower error rate.

Distributed Memory Kernel

The benchmark designers allow for a embarrassingly parallel version of this benchmark to be run but that requires no further explanation here. The cooperative version of this benchmark is more interesting.

To leave room for optimizing the communication, they allow each processor to have a look-ahead of 1024, meaning they can view 1024 elements of A before they are required to perform local updates and send messages regarding updates that must take place on other nodes. Similarly, they can store up to 1024 received messages before they are required to begin processing them. These constraints give programmers room to optimize communication between nodes. Obviously, this can change the results. However, I do not believe there is a similar 1% error rate limit for the distributed version, as even defining such an error rate would be difficult. Thus the 1024 look-ahead limit stands in its place.

The designers provide a two MPI versions of the benchmark, neither of which attempts to optimize communication with coalescing. One version is designed for any number of nodes, and the other requires that the number of nodes be a power of 2. The latter version simplifies target address calculation.

Generating A

So how is A generated? I don't fully understand the mathematics of it, but it is a pseudo-random number generator with low computational cost. The generator they chosen can easily skip to the n-th random number, making it easy to use for a distributed or multi-threaded version of the code. The specifics of how many numbers ahead the code may generate is not specified and indeed is irrelevant. It is only important to generate the indices on the fly because otherwise you would have nowhere to store them -- they would take up 2x your system memory!

Information on generating A can be found here (page 2).


More information will be added on the kernels in these benchmarks soon.

pChase and multichase

Volkov's Kernel

  • Described in Chapter 5

Hopscotch

MEGA-STREAM

Other Benchmarks

This is a collection of benchmarks and benchmark suites that may be of interest to Spatter users. It is intended as a long list of benchmarks that have interesting memory patterns or are relevant to HPC.

Mirovia: A Benchmarking Suite for Modern Heterogeneous Computing

PolyBench: A Benchmark Suite of 30 Numerical Computations

Rodinia: Accelerating Compute-Intensive Applications with Accelerators

The Scalable Heterogeneous Computing (SHOC) benchmark suite (Github)

  • In particular, the level 1 benchmarks BFS, Reduction, Scan, Sort, SpMV, and Triad are of interest.

ECP Proxy Applications

Coral-2 Benchmarks

SpecCPU 2017

HPC Challenge (HPCC)

  • RandomAccess is described above.

Balloon Hash

  • This one isn't a benchmark but I feel it could become one.

List of benchmarks evaluated by the Army Research Labratory

  • This serves as a nice list of HPC benchmarks and benchmark suites. I will make a separate bullet point for each benchmark in this document in the future.

  • Spoiler: They chose SHOC to use for system evaluation, due to it covering a wide range of computations and being written in a single language (OpenCL).

BenchIT

ParRes: Parallel Research Kernels

RAJAPerf

Mantevo

SciML Bench

Clone this wiki locally