This repository contains a collection of heterogeneous computing benchmarks written with CUDA, HIP, SYCL/DPC++, OpenMP-4.5 target offloading, and Kokkos for studying performance, portability, and productivity.
Certain SYCL benchmarks require oneDPL, oneTBB, Syclomatic, or oneMKL interfaces.
Each benchmark falls into a single category. While such classification is not accurate, the arrangement serves as a starting point for users of the benchmark suite. Please see the Reference for more information about each benchmark.
daphne
cmembench, babelstream, memcpy, memtest, randomAccess, shmembench, triad
all-pairs-distance, bsw, ccs, cm, deredundancy, diamond, epistasis, extend2, frna, fsm, ga, logan, minibude, minimap2, nbnxm, nw, pcc, prna, sa, snake
affine, aobench, asmooth, background-subtract, bezier-surface, bilateral, bm3d, boxfilter, cbsfil, car, ced, colorwheel, convolution1D, convolutionSeperable, dct8x8, debayer, depixel, degrid, doh, dpid, egs, face, flame, gabor, gamma-correction, hogbom, mandelbrot, marchCubes, match, medianfilter, morphology, mriQ, ne, perlin, sobel, tonemapping, recursiveGaussian, resize, sad, seam-carving, spm, srad, ssim, stencil1d, stencil3d, surfel, zoom
aes, bitcracker, chacha20, columnarSolver, ecdh, keccaktreehash, merkle, present
atomicAggregate, atomicCAS, atomicCost, atomicIntrinsics, atomicPerf, atomicSystemWide, bitpacking, bscan, bwt, compute-score, contract, dxt1, filter, fpc, histogram, minmax, mpc, mtf, rle, sc, scan, scan2, scan3, segment-reduce
ans, crc64, crs, entropy, jenkins-hash, ldpc, md5hash, murmurhash3
aop, black-scholes, binomial, bonds, libor
aidw, coordinates, geodesic, hausdorff, haversine, stsg
cc, floydwarshall, floydwarshall2, gc, hbc, hungarian, mis, sssp, rsmt
aligned-types, asta, collision, concurrentKernels, conversion, copy, dispatch, ert, interleave, layout, mallocFree, maxFlops, mixbench, mkl-sgemm, nosync, openmp, overlap, p2p, pad, pitch, popcount, prefetch, reverse, ring, saxpy-ompt, shuffle, simpleMultiDevice, tensorAccessor, threadfence, vote, wordcount, zerocopy
accuracy, adam, addBiasResidualLayerNorm, attention, attentionMultiHead, backprop, bincount, bn, channelShuffle, channelSum, clink, concat, crossEntropy, dense-embedding, dropout, dwconv, expdist, flip, gd, gelu, ge-spmm, glu, gmm, gru, kalman, kmc, kmeans, knn, lda, lif, logprob, lr, lrn, mask, matern, maxpool3d, mcpr, meanshift, mf-sgd, mmcsf, mnist, mrc, multinomial, nlll, nonzero, overlay, p4, page-rank, perplexity, pointwise, pool, qtclustering, remap, relu, resnet-kernels, rowwiseMoments, sampling, scel, softmax, stddev, streamcluster, swish, unfold, vol2col, wedford, winograd, word2vec
atan2, complex, cross, determinant, divergence, dp, eigenvalue, f16max, f16sp, frechet, fresnel, fwt, gaussian, geam, gemmEx, hellinger, hmm, idivide, interval, jaccard, jacobi, kurtosis, lanczos, langford, lci, lebesgue, leukocyte, lfib4, log2, lud, michalewicz, matrix-rotate, matrixT, minkowski, mr, norm2, nqueen, ntt, phmm, pnpoly, rfs, romberg, rsc, sddmm-batch, secp256k1, simpleSpmv, slu, spd2s, spgeam, spgemm, spmm, spnnz, sps2d, spsort, sptrsv, thomas, wyllie, zeropoint
mt, permutate, qrg, rng-wallace, sobol, urng
bfs, bsearch, b+tree, grep, keogh, s8n, ss, tsp
extrema, fft, lombscargle, sosfil, zmddft
ace, adv, amgmk, axhelm, bh, bspline-vgh, burger, cooling, ccsd-trpdrv, che, chemv, chi2, clenergy, cmp, cobahh, d2q9_bgk, d3q19_bgk, damage, ddbp, dslash, easyWave, eikonal, fdtd3d, feynman-kac, fhd, fluidSim, gibbs, goulash, gpp, grrt, haccmk, halo-finder, heartwall, heat, heat2d, henry, hexicton, hotspot3D, hwt1d, hypterm, ising, iso2dfd, laplace, laplace3d, lavaMD, lid-driven-cavity, loopback, lsqt, lulesh, mcmd, md, mdh, metropolis, miniFE, minimod, minisweep, miniWeather, multimaterial, myocte, nbody, particle-diffusion, particlefilter, particles, pathfinder, pns, projectile, pso, rainflow, reaction, rsbench, rtm8, rushlarsen, s3d, su3sheath, simplemoc, slit, sparkler, sph, sw4ck, tensorT, testSNAP, tissue, tpacf, tqs, tridiagonal, tsa, vanGenuchten, vmc, wlcpow, wsm5, xlqc, xsbench
bitonic-sort, hybridsort, is, merge, quicksort, radixsort, segsort, sort, sortKV, split, warpsort
inversek2j, rodrigues
Navigate to a benchmark in CUDA (benchmark-cuda) and type
`make ARCH=sm_70 run` // run on a NIVIDA GPU device with compute capability 7.0
Navigate to a benchmark in HIP (benchmark-hip) and type
`make run`
Navigate to a benchmark in SYCL (benchmark-sycl) and type
`make CUDA=yes CUDA_ARCH=sm_70 GCC_TOOLCHAIN="" run` (targeting an NVIDIA GPU)
`make HIP=yes HIP_ARCH=gfx908 run` (targeting an AMD GPU)
`make run` or `make CC=icpx run` (targeting an Intel GPU)
NOTE: `--gcc-toolchain` may be required for the SYCL compiler to select the proper GNU toolchain; otherwise unset GCC_TOOLCHAIN
Navigate to a benchmark in OpenMP (benchmark-omp) and type
`make -f Makefile.nvc run` (targeting NVIDIA GPUs)
`make -f Makefile.aomp run` (targeting AMD GPUs)
`make run` (targeting Intel GPUs)
Users may need to set appropriate values (e.g., `sm_80`, `sm_90`, `gfx906`, `gfx1030`) for their target offloading devices
`make -f Makefile.nvc SM=cc80 run`
`make -f Makefile.aomp ARCH=gfx906 run`
Kokkos build was implemented with cmake. To build you have to include kokkos paths. It also includes a run command.
mkdir build
cd build
cmake .. -DDEVICE=ngpu -DKOKKOS_INSTALL_DIR=/opt/kokkos4.1/kokkos/cuda_install/ -DKokkos_DIR=/opt/kokkos4.1/kokkos/cuda_install/lib/cmake/Kokkos/
make
make run
Python scripts that help build, run and gather results from the benchmarks. As well as a basic script to compare results from two different runs.
It works with a `.json` file containing the benchmark names, a regex to
find the timings in the benchmark output and optional arguments that
must be provided to the benchmark binary. The `subset.json` contains
roughly 70 of the benchmarks for cuda, hip and sycl at the moment, more
work would be required to support the rest of the benchmarks. In
addition if there are failing benchmarks in the `.json` list, an
additional text file can be provided with a list of benchmarks to skip
when running all of them. Benchmarks in the text file can still be run
explicitly.
For example to run all the SYCL benchmarks and then all the CUDA
benchmarks and compare the two:
```
./autohecbench.py sycl -o sycl.csv
./autohecbench.py cuda -o cuda.csv
./autohecbench-compare.py sycl.csv cuda.csv
```
It can also be used to run a single benchmark:
```
./autohecbench.py backprop-sycl --verbose
```
By default it will run a warmup iteration before running each benchmark,
and it is possible to run the benchmarks multiple times with `-r`:
```
./autohecbench.py backprop-sycl -r 20 -o mandel.csv
```
And it also has options to pick the SM version or HIP architecture and a
few other parameters.
For Rodinia benchmarks, please download the dataset at http://lava.cs.virginia.edu/Rodinia/download.htm
For other benchmarks, datasets are either included with the repository or could be downloaded through the links to the benchmarks
The programs have not been evaluated on Windows or MacOS
The lastest Intel SYCL compiler (not the Intel oneAPI toolkit) may be needed for building some SYCL programs successfully
Kernel results do not exactly match using these programming languages on a platform for certain programs
Not all programs automate the verification of host and device results
Not all CUDA programs have SYCL, HIP or OpenMP equivalents
Not all programs have OpenMP target offloading implementations
Raw performance of any program may be suboptimal
Some programs may take long to complete on an integrated GPU
Some host programs contain platform-specific intrinsics, so they may cause compile error on a PowerPC platform
When double-precision floating-point operations are not supported on certain Intel GPU devices, software emulation may be enabled. FP64 emulation
I appreciate your feedback when any examples don't look right.
Here are some plotted results
Accuracy of prediction (https://pytorch.org/)
Phase-field simulation of dendritic solidification (https://github.com/myousefi2016/Allen-Cahn-CUDA)
Adaptive moment estimation (https://github.com/hpcaitech/ColossalAI)
Combines the bias, residual of previous block and the computation of layer normalization (https://github.com/NVIDIA/FasterTransformer)
Advection (https://github.com/Nek5000/nekBench/tree/master/adv)
AES encrypt and decrypt (https://github.com/Multi2Sim/m2s-bench-amdsdk-2.5-src)
Affine transformation (https://github.com/Xilinx/SDAccel_Examples/tree/master/vision/affine)
Adaptive inverse distance weighting (Mei, G., Xu, N. & Xu, L. Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search. SpringerPlus 5, 1389 (2016))
Alignment specification for variables of structured types (http://docs.nvidia.com/cuda/cuda-samples/index.html)
All-pairs distance calculation (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2910913/)
The relax kernel in the AMGmk benchmark (https://asc.llnl.gov/CORAL-benchmarks/Micro/amgmk-v1.0.tar.gz)
Asymmetric numeral systems decoding (https://github.com/weissenberger/multians)
A lightweight ambient occlusion renderer (https://code.google.com/archive/p/aobench)
American options pricing (https://github.com/NVIDIA-developer-blog)
Adaptive smoothing (http://www.hcs.harvard.edu/admiralty/)
Array of structure of tiled array for data layout transposition (https://github.com/chai-benchmarks/chai)
Approximate the atan2 math function (https://github.com/cms-patatrack/pixeltrack-standalone)
Atomic aggregate (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/)
Atomic add, subtract, min, max, AND, OR, XOR (http://docs.nvidia.com/cuda/cuda-samples/index.html)
64-bit atomic add, min, and max with compare and swap (https://github.com/treecode/Bonsai/blob/master/runtime/profiling/derived_atomic_functions.h)
Evaluate the cost of atomic add operations
Evaluate atomic add operations over global and shared memory (https://stackoverflow.com/questions/22367238/cuda-atomic-operation-performance-in-different-scenarios)
Integer sum reduction with atomics (https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/reduction)
System-wide atomics (http://docs.nvidia.com/cuda/cuda-samples/index.html)
Ham, T.J., et al., 2020, February. A^ 3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 328-341). IEEE.
Implementation of multi-head attention (https://github.com/IrishCoffee/cudnnMultiHeadAttention)
Helmholtz matrix-vector product (https://github.com/Nek5000/nekBench/tree/master/axhelm)
Measure memory transfer rates for copy, add, mul, triad, dot, and nstream (https://github.com/UoB-HPC/BabelStream)
Background subtraction (Alptekin Temizel et al. Experiences on Image and Video Processing with CUDA and OpenCL, In Applications of GPU Computing Series, GPU Computing Gems Emerald Edition, Morgan Kaufmann, 2011, Pages 547-567)
Backpropagation in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
The Bezier surface (https://github.com/chai-benchmarks/chai)
The breadth-first search in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Simulate the gravitational forces in a star cluster using the Barnes-Hut n-body algorithm (https://userweb.cs.txstate.edu/~burtscher/research/ECL-BH/)
Bilateral filter (https://github.com/jstraub/cudaPcl)
Count the number of values that fall into each bin (https://pytorch.org/)
Evaluate fair call price for a given set of European options under binomial model (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Open-source password cracking tool for storage devices (https://github.com/e-ago/bitcracker.git)
Bitonic sorting (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/)
A bit-level operation that aims to reduce the number of bits required to store each value (https://github.com/NVIDIA/nvcomp)
The Black-Scholes simulation (https://github.com/cavazos-lab/FinanceBench)
Block-matching and 3D filtering method for image denoising (https://github.com/DawyD/bm3d-gpu)
Bayesian network learning (https://github.com/OSU-STARLAB/UVM_benchmark/blob/master/non_UVM_benchmarks)
Fixed-rate bond with flat forward curve (https://github.com/cavazos-lab/FinanceBench)
Box filtering (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Binary scan in a block (Harris, M. and Garland, M., 2012. Optimizing parallel prefix operations for the Fermi architecture. In GPU Computing Gems Jade Edition (pp. 29-38). Morgan Kaufmann.)
Classic and vectorizable binary search algorithms (https://www.sciencedirect.com/science/article/abs/pii/S0743731517302836)
Bspline value gradient hessian (https://github.com/QMCPACK/miniqmc/blob/OMP_offload/src/OpenMP/main.cpp)
GPU accelerated Smith-Waterman for performing batch alignments (https://github.com/mgawan/ADEPT)
2D Burger's equation (https://github.com/soumyasen1809/OpenMP_C_12_steps_to_Navier_Stokes)
Burrows-Wheeler transform (https://github.com/jedbrooke/cuda_bwt)
B+Tree in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Content adaptive resampling (https://github.com/sunwj/CAR)
Cubic b-spline filtering (https://github.com/DannyRuijters/CubicInterpolationCUDA)
Connected components (https://userweb.cs.txstate.edu/~burtscher/research/ECL-CC/)
Condition-dependent Correlation Subgroups (https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS)
The CCSD tengy kernel, which was converted from Fortran to C by Jeff Hammond, in NWChem (https://github.com/jeffhammond/nwchem-ccsd-trpdrv)
Canny edge detection (https://github.com/chai-benchmarks/chai)
The CFD solver in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
ChaCha20 stream cipher (https://github.com/983/ChaCha20)
Divide the channels in a tensor into groups and rearrange them (https://pytorch.org/)
Per-channel sum of values (https://pytorch.org/)
Phase-field simulation of spinodal decomposition using the Cahn-Hilliard equation (https://github.com/myousefi2016/Cahn-Hilliard-CUDA)
Complex hermitian matrix-vector multiplication (https://repo.or.cz/ppcg.git)
The Chi-square 2-df test. (https://web.njit.edu/~usman/courses/cs677_spring19/)
Direct coulomb summation kernel (http://www.ks.uiuc.edu/Training/Workshop/GPU_Aug2010/resources/clenergy.tar.gz)
Compact LSTM inference kernel (http://github.com/UCLA-VAST/CLINK)
Gene expression connectivity mapping (https://pubmed.ncbi.nlm.nih.gov/24112435/)
The constant memory microbenchmark (https://github.com/ekondis/gpumembench)
Seismic processing using the classic common midpoint (CMP) method (https://github.com/hpg-cepetro/IPDPS-CRS-CMP-code)
Simulation of Random Network of Hodgkin and Huxley Neurons with Exponential Synaptic Conductances (https://dl.acm.org/doi/10.1145/3307339.3343460)
Check collision of duplicate values (https://github.com/facebookarchive/fbcuda)
Color encoding of flow vectors (https://vision.middlebury.edu/flow/code/flow-code/colorcode.cpp)
Dimitrov, M. and Esslinger, B., 2021. CUDA Tutorial--Cryptanalysis of Classical Ciphers Using Modern GPUs and CUDA. arXiv preprint arXiv:2103.13937.
Complex numbers arithmetics (https://github.com/tpn/cuda-samples/blob/master/v8.0/include/cuComplex.h)
Document filtering (https://www.intel.com/content/www/us/en/programmable/support/support-resources/design-examples/design-software/opencl/compute-score.html)
Concatenation of two tensors (https://github.com/bytedance/lightseq)
Demonstrate the use of streams for concurrent execution of several kernels with dependency on a device (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/concurrentKernels)
Second-order tensor aggregation with an adjacency matrix (https://github.com/HyTruongSon/GraphFlow)
Conversion among common data types (intel/llvm#7195)
1D convolution (Kirk, D.B. and Wen-Mei, W.H., 2016. Programming massively parallel processors: a hands-on approach. Morgan kaufmann)
Convolution filter of a 2D image with separable kernels (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Primordial hydrogen/helium cooling curve (https://github.com/cholla-hydro/cholla)
Coordinates(latitude and longitude) transformation using the STL transform (https://github.com/rapidsai/cuspatial)
Memory copies using direct, zero, and managed memory accesses
64-bit cyclic-redundancy check (https://xgitlab.cels.anl.gov/hfinkel/hpcrc64/-/wikis/home)
Cross product of two 2D tensors (https://pytorch.org/)
Cross entropy loss in the backward phase (intel/llvm#5969)
Cauchy Reed-Solomon encoding (https://www.comp.hkbu.edu.hk/~chxw/gcrs.html)
A lattice boltzmann scheme with a 2D grid, 9 velocities, and Bhatnagar-Gross-Krook collision step (https://github.com/WSJHawkins/ExploringSycl)
Lattice Boltzmann simulation framework based on C++ parallel algorithms (https://gitlab.com/unigehpfs/stlbm)
The Darmstadt automotive parallel heterogeneous benchmark suite (https://github.com/esa-tu-darmstadt/daphne-benchmark)
The continuum level damage in a peridynamic body (https://github.com/alan-turing-institute/PeriPy)
Discrete Cosine Transform (DCT) and inverse DCT for 8x8 blocks (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Distance-driven backprojection (https://github.com/LAVI-USP/DBT-Reconstruction)
Convert a Bayer mosaic raw image to RGB (https://github.com/GrokImageCompression/latke)
Radio astronomy degridding (https://github.com/NVIDIA/SKA-gpu-degrid)
Dense embedding add operations (https://pytorch.org/)
Check connectivity and remove crosses in depixelization of pixel art (https://github.com/yzhwang/depixelization)
Gene sequence de-redundancy is a precise gene sequence de-redundancy software that supports heterogeneous acceleration (https://github.com/JuZhenCS/gene-sequences-de-redundancy)
Calculate the determinant of a matrix using library-based decomposition and strided reduction (https://github.com/OrangeOwlSolutions/Linear-Algebra)
Mask sequences kernel in Diamond (https://github.com/bbuchfink/diamond)
Kernel dispatch rate and latency (https://github.com/ROCm-Developer-Tools/HIP-CPU)
Barrel distortion (https://github.com/Cuda-Chen/barrel-distortion-cuda)
CPU and GPU divergence test (https://github.com/E3SM-Project/divergence_cmdvse)
Determinant of a Hessian matrix (https://github.com/rapidsai/cucim)
Dot product (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Detail-preserving image downscaling (https://github.com/mergian/dpid)
Randomly zero some elements of the input array with a probability using samples from a uniform distribution (https://github.com/pytorch/)
A Lattice QCD Dslash operator proxy application derived from MILC (https://gitlab.com/NERSC/nersc-proxies/milc-dslash)
DXT1 compression (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Depth-wise convolution (https://pytorch.org/)
Simulation of tsunami generation and propagation in the context of early warning (https://git.gfz-potsdam.de/id2/geoperil/easyWave)
Elliptic curve Diffie-Hellman key exchange (https://github.com/jaw566/ECDH)
Parallel implementation of EGSnrc's photon transport mechanism (https://jonaslippuner.com/research/cuda-egs/)
Calculate the eigenvalues of a tridiagonal symmetric matrix (https://github.com/OpenCL/AMD_APP_samples)
Fast iterative method for Eikonal equations on structured volumes (https://github.com/SCIInstitute/StructuredEikonal)
Compute the entropy for each point of a 2D matrix using a 5x5 window (https://lan-jing.github.io/parallel%20computing/system/entropy/)
Epistasis detection (https://github.com/rafatcampos/bio-epistasis-detection)
Modified microkernel in the empirical roofline tool (https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/)
Compute the Bhattacharya cost function (https://github.com/benvanwerkhoven/kernel_tuner)
Smith-Waterman (SW) extension in Burrow-wheeler aligner for short-read alignment (https://github.com/lh3/bwa)
Find local maxima (https://github.com/rapidsai/cusignal/)
Compute the maximum of half-precision floating-point numbers using bit operations (https://x.momo86.net/en?p=113)
Half-precision scalar product (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Face detection using the Viola-Jones algorithm (https://sites.google.com/site/5kk73gpu2012/assignment/viola-jones-face-detection)
FDTD-3D (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Use of Feynman-Kac algorithm to solve Poisson's equation in a 2D ellipse (https://people.sc.fsu.edu/~jburkardt/c_src/feynman_kac_2d/feynman_kac_2d.html)
A case study: advanced magnetic resonance imaging reconstruction (https://ict.senecacollege.ca/~gpu610/pages/content/cudas.html)
Filtering by a predicate (https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/)
FFT in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Fractal flame (http://gpugems.hwu-server2.crhc.illinois.edu/)
Tensor flip (https://pytorch.org/)
Floyd-Warshall Pathfinding sample (https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/HIP-Examples-Applications/FloydWarshall/)
Fast Floyd-Warshall for all-pairs-shortest paths (https://userweb.cs.txstate.edu/~burtscher/research/ECL-APSP/)
2D Fluid Simulation using the Lattice-Boltzman method (https://github.com/OpenCL/AMD_APP_samples)
Frequent pattern compression ( Base-delta-immediate compression: practical data compression for on-chip caches. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques (pp. 377- 388). ACM.)
Compute the discrete Frechet distance between two curves specified by discrete ordered points in n-dimensional space (https://github.com/mp4096/discrete-frechet-distance)
Fresnel integral (http://www.mymathlib.com/functions/fresnel_sin_cos_integrals.html)
Accelerate the fill step in predicting the lowest free energy structure and a set of suboptimal structures (http://rna.urmc.rochester.edu/Text/Fold-cuda.html)
A GPU-accelerated implementation of a genetic algorithm for finding well-performing finite-state machines for predicting binary sequences (https://userweb.cs.txstate.edu/~burtscher/research/FSM_GA/)
Fast Walsh transformation (http://docs.nvidia.com/cuda/cuda-samples/index.html)
Gene alignment (https://github.com/NUCAR-DEV/Hetero-Mark)
Gabor filter function (https://github.com/fercer/gaborfilter)
Gamma correction (https://github.com/intel/BaseKit-code-samples)
Gaussian elimination in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Graph coloring via shortcutting (https://userweb.cs.txstate.edu/~burtscher/research/ECL-GC/)
Gradient descent (https://github.com/CGudapati/BinaryClassification)
Matrix transpose using the BLAS-extension functions (https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-geam)
Apply the Gaussian error linear units function (https://github.com/NVIDIA/FasterTransformer)
Geodesic distance (https://www.osti.gov/servlets/purl/1576565)
General-purposed sparse matrix-matrix multiplication on GPUs for graph neural networks (https://github.com/hgyhungry/ge-spmm)
General matrix-matrix multiplication on GPUs (https://godweiyang.com/2021/08/24/gemm/)
Implementation of a Gibbs-Metropolis sampling algorithm (https://github.com/arendsee/cuda-gibbs-example)
The gated linear unit function (https://pytorch.org/docs/stable/generated/torch.nn.GLU.html)
Expectation maximization with Gaussian mixture models (https://github.com/Corv/CUDA-GMM-MultiGPU)
Simulate the dynamics of a small part of a cardiac myocyte, specifically the fast sodium m-gate (https://github.com/LLNL/goulash)
General Plasman Pole Self-Energy Simulation the BerkeleyGW software package (https://github.com/NERSC/gpu-for-science-day-july-2019)
Regular expression matching (https://github.com/bkase/CUDA-grep)
General-relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime (https://github.com/hungyipu/Odyssey)
Forward operations of a gated recurrent unit (https://pytorch.org/)
The HACC microkernel (https://asc.llnl.gov/CORAL-benchmarks/#haccmk)
Parallel halo finder operation (https://gem5.googlesource.com/public/gem5-resources)
Hausdorff distance (https://github.com/arohamirai/Hausdorff-Distance-Match)
Haversine distance (https://github.com/rapidsai/cuspatial)
Hybrid methods for parallel betweenness centrality (https://github.com/Adam27X/hybrid_BC)
Heart Wall in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
The heat equation solver (https://github.com/UoB-HPC/heat_sycl)
Discrete 2D laplacian operation a number of times on a given vector (https://github.com/gpucw/cuda-lapl)
Hellinger distance (https://github.com/rapidsai/raft)
Henry coefficient (https://github.com/CorySimon/HenryCoefficient)
A Portable and Scalable Solver-Framework for the Hierarchical Equations of Motion (https://github.com/noma/hexciton_benchmark)
Histogram (http://github.com/NVlabs/cub/tree/master/experimental)
Hidden markov model (http://developer.download.nvidia.com/compute/DevZone/OpenCL/Projects/oclHiddenMarkovModel.tar.gz)
The benchmark implements the kernel of the Hogbom Clean deconvolution algorithm (https://github.com/ATNF/askap-benchmarks/)
Hotspot3D in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Fast block distributed Implementation of the Hungarian Algorithm (https://github.com/paclopes/HungarianGPU)
1D Haar wavelet transformation (https://github.com/OpenCL/AMD_APP_samples)
Hybridsort in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
A routine from the ExpCNS Compressible Navier-Stokes mini-application (https://github.com/pssrawat/ppopp-artifact)
Fast interger divide (https://github.com/milakov/int_fastdiv)
Interleaved and non-interleaved global memory accesses (Shane Cook. 2012. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs (1st. ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.)
Interval arithmetic operators example (https://docs.nvidia.com/cuda/cuda-samples/index.html)
The inverse kinematics for 2-joint arm (http://axbench.org/)
Integer sort (https://github.com/GMAP/NPB-GPU)
Monte-Carlo simulations of 2D Ising Model (https://github.com/NVIDIA/ising-gpu/)
Isotropic 2-dimensional Finite Difference (https://github.com/intel/HPCKit-code-samples/)
Jaccard index for a sparse matrix (https://github.com/rapidsai/nvgraph/blob/main/cpp/src/jaccard_gpu.cu)
Jacobi relaxation (https://github.com/NVIDIA/multi-gpu-programming-models/blob/master/single_gpu/jacobi.cu)
Bob Jenkins lookup3 hash function (https://android.googlesource.com/platform/external/jenkins-hash/+/75dbeadebd95869dd623a29b720678c5c5c55630/lookup3.c)
Kalman filter (https://github.com/rapidsai/cuml/)
A Keccak tree hash function (http://sites.google.com/site/keccaktreegpu/)
Keogh's lower bound (https://github.com/gravitino/cudadtw)
Kernel matrix compute (https://github.com/MKLab-ITI/CUDA)
K-means in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
K-nearest neighbor (https://github.com/OSU-STARLAB/UVM_benchmark/blob/master/non_UVM_benchmarks)
Compute the kurtosis of two variables (https://github.com/d-d-j/ddj_store)
Lanczos tridiagonalization (https://github.com/linhr/15618)
Count planar Langford sequences (https://github.com/boris-dimitrov/z4_planar_langford_multigpu)
A Laplace solver using red-black Gaussian Seidel with SOR solver (https://github.com/kyleniemeyer/laplace_gpu)
Solve Laplace equation on a regular 3D grid (https://github.com/gpgpu-sim/ispass2009-benchmarks)
LavaMD in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
AoS and SoA comparison (https://github.com/OpenCL/AMD_APP_samples)
Landau collisional integral (https://github.com/vskokov/Landau_Collisional_Integral)
Latent Dirichlet allocation (https://github.com/js1010/cusim)
QC-LDPC decoding (https://github.com/robertwgh/cuLDPC)
Estimate the Lebesgue constant (https://people.math.sc.edu/Burkardt/c_src/lebesgue/lebesgue.html)
Leukocyte in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Marsa-LFIB4 pseudorandom number generator (https://bitbucket.org/przemstp/gpu-marsa-lfib4/src/master/)
A LIBOR market model Monte Carlo application (https://people.maths.ox.ac.uk/~gilesm/cuda_old.html)
GPU solver for a 2D lid-driven cavity problem (https://github.com/kyleniemeyer/lid-driven-cavity_gpu)
A leaky integrate-and-fire neuron model (https://github.com/e2crawfo/hrr-scaling)
A simple lock-free hash table (https://github.com/nosferalatu/SimpleGPUHashTable)
Approximate the log2 math function (https://adacenter.org/sites/default/files/milspec/Transcendentals.zip)
GPU-based X-Drop alignment (https://github.com/albertozeni/LOGAN)
Convert logits to probabilities (https://github.com/NVIDIA/FasterTransformer)
Lomb-Scargle periodogram (https://github.com/rapidsai/cusignal/)
Lookback option simulation (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application)
Linear regression (https://github.com/ChenyangZhang-cs/iMLBench)
Local response normalization (intel/llvm#8292)
Linear scaling quantum transport (https://github.com/brucefan1983/gpuqt)
LU decomposition in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Livermore unstructured Lagrangian explicit shock hydrodynamics (https://github.com/LLNL/LULESH)
Memory allocation and deallocation samples (https://github.com/ROCm-Developer-Tools/HIP/)
The Mandelbrot set in the HPCKit code samples (https://github.com/intel/HPCKit-code-samples/)
A practical isosurfacing algorithm for large data on many-core architectures (https://github.com/LRLVEC/MarchingCubes)
Masking operators in Pytorch (https://pytorch.org/)
Compute matching scores for two 16K 128D feature points (https://github.com/Celebrandil/CudaSift)
Sum using the Matern kernel (https://tbetcke.github.io/hpc_lecture_notes/rbf_evaluation.html)
In-place matrix rotation
Matrix transposition (https://docs.nvidia.com/cuda/cuda-samples/index.html)
3D Maxpooling (https://github.com/nachiket/papaa-opencl)
Maximum floating-point operations in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Monte Carlo and Molecular Dynamics Simulation Package (https://github.com/khavernathy/mcmd)
Multi-category probit regression (https://github.com/berkeley-scf/gpu-workshop-2016)
Molecular dynamics function in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Simple multiple Debye-Huckel kernel in fast molecular electrostatics algorithms on GPUs (http://gpugems.hwu-server2.crhc.illinois.edu/)
MD5 hash function in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Mean shift clustering (https://github.com/w00zie/mean_shift)
Two-dimensional 3x3 median filter of RGBA image (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Merkle tree construction using rescue prime hash (https://github.com/itzmeanjan/ff-gpu)
A benchmark for memory copy from a host to a device
Selected memory tests (https://github.com/ComputationalRadiationPhysics/cuda_memtest)
Merge two unsorted arrays into a sorted array (https://github.com/ogreen/MergePathGPU)
Simulation of an ensemble of replicas with Metropolis–Hastings computation in the trial run (https://github.com/crinavar/trueke)
Matrix factorization with stochastic gradient descent (https://github.com/cuMF/cumf_sgd)
Evaluate the Michalewicz function (https://www.sfu.ca/~ssurjano/michal.html)
MiniFE Mantevo mini-application (https://github.com/Mantevo/miniFE)
The core computation of the Bristol University Docking Engine (BUDE) (https://github.com/UoB-HPC/miniBUDE)
Hardware acceleration of long read pairwise overlapping in genome sequencing (https://github.com/UCLA-VAST/minimap2-acceleration)
A finite difference solver for seismic modeling (https://github.com/rsrice/gpa-minimod-artifacts)
A deterministic Sn radiation transport miniapp (https://github.com/wdj/minisweep)
A parallel programming training mini-app simulating weather-like flows (https://github.com/mrnorman/miniWeather)
Minkowski distance (https://github.com/rapidsai/raft)
Find the smallest and largest elements (https://github.com/rapidsai/cuspatial)
Maximal independent set (http://www.cs.txstate.edu/~burtscher/research/ECL-MIS/)
A read-only version of mixbench (https://github.com/ekondis/mixbench)
Single-precision floating-point matrix multiply using Intel® Math Kernel Library
MTTKRP kernel using mixed-mode CSF (https://github.com/isratnisa/MM-CSF)
Chapter 4.2: Converting CUDA CNN to HIP (https://developer.amd.com/wp-content/resources)
Morphological operators: Erosion and Dilation (https://github.com/yszheda/CUDA-Morphology)
The Miller-Rabin primality test (https://github.com/wizykowski/miller-rabin)
Computation of a matrix Q used in a 3D magnetic resonance image reconstruction (https://github.com/abduld/Parboil/blob/master/benchmarks/mri-q/DESCRIPTION)
Margin ranking criterion operation (https://pytorch.org)
Mersenne Twister (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Move-to-front transform (https://github.com/bzip2-cuda/bzip2-cuda)
Multi-material simulations (https://github.com/reguly/multimaterial)
Multinomial sampling (https://pytorch.org)
MurmurHash3 yields a 128-bit hash value (https://github.com/aappleby/smhasher/wiki/MurmurHash3)
Myocte in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Computing non-bonded pair interactions (https://manual.gromacs.org/current/doxygen/html-full/page_nbnxm.xhtml)
Nbody simulation (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2B/N-BodyMethods/Nbody)
Normal estimation in 3D (https://github.com/PointCloudLibrary/pcl)
The negative log likelihood 2D loss reduction (https://pytorch.org/)
Work-efficient parallel non-maximum suppression kernels (https://github.com/hertasecurity/gpu-nms)
Nearest neighbor in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Return a tensor containing the indices of all non-zero elements of input (https://pytorch.org/)
Compute the Euclidean norm of a vector (https://docs.nvidia.com/cuda/cublas)
Stream synchronization in Thrust and oneDPL (https://github.com/NVIDIA/thrust/tree/main/examples)
N-Queens (https://github.com/tcarneirop/ChOp)
Number-theoretic transform (https://github.com/vernamlab/cuHE)
Needleman-Wunsch in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Multi-threading over a single device (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Overlap data copies with compute kernels (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Overlay grid in the DetectNet (https://github.com/dusty-nv/jetson-inference)
Simple peer-to-peer accesses (https://docs.nvidia.com/cuda/cuda-samples/index.html)
PointPillar post-processing (https://github.com/NVIDIA-AI-IOT/CUDA-PointPillars)
In-place padding (https://github.com/chai-benchmarks/chai)
PageRank (https://github.com/Sable/Ostrich/tree/master/map-reduce/page-rank)
Particle diffusion in the HPCKit code samples (https://github.com/intel/HPCKit-code-samples/)
Particle Filter in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Particles collision simulation (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
PathFinder in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Compute pairwise Pearson’s correlation coefficient (https://github.com/pcdslab/Fast-GPU-PCC)
Perlin noise generator (https://github.com/silverweed/perlin_cuda)
Parallel implementation of the permutation testing in NIST SP 800-90B (https://github.com/yeah1kim/yeah_GPU_SP800_90B_IID)
Perplexity search (https://github.com/rapidsai/cuml/)
Pair hidden Markov model (https://github.com/lienliang/Pair_HMM_forward_GPU)
Pitched memory allocation (https://docs.nvidia.com/cuda/cuda-c-programming-guide)
Solve the point-in-polygon problem using the crossing number algorithm (https://github.com/benvanwerkhoven/kernel_tuner)
Petri-net simulation (https://github.com/abduld/Parboil/)
Fused point-wise operations (https://developer.nvidia.com/blog/optimizing-recurrent-neural-networks-cudnn-5/)
Pooling layer (https://github.com/PaddlePaddle/Paddle)
Implementations of population count (Jin, Z. and Finkel, H., 2020, May. Population Count on Intel® CPU, GPU and FPGA. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 432-439). IEEE.)
Concurrent managed accesses (https://github.com/ROCm-Developer-Tools/HIP/)
Lightweight cryptography (https://github.com/bozhu/PRESENT-C/blob/master/present.h)
Calculate a partition function for a sequence, which can be used to predict base pair probabilities (http://rna.urmc.rochester.edu/Text/partition-cuda.html)
Projectile motion is a program that implements a ballistic equation (https://github.com/intel/BaseKit-code-samples)
A modified implementation of particle swarm optimization using Levy function (https://github.com/wiseodd/cuda-pso, https://github.com/chensohg/GPU_CUDA_PSO)
Niederreiter quasirandom number generator and Moro's Inverse Cumulative Normal Distribution generator (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
quality threshold clustering (https://github.com/vetter/shoc/)
A parallel radix sort (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
A library-based sort-by-key (https://github.com/NVIDIA/cuda-samples)
A fast rainflow cycle counting algorithm (https://github.com/carlos-souto/rainflow-cycle-counting)
Random memory access (https://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/)
3D Gray-Scott reaction diffusion (https://github.com/ifilot/wavefuse)
2-dimensional Gaussian Blur Filter of RGBA image (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Map unique values to indices (https://pytorch.org/)
Rectified linear unit (https://github.com/tensorflow)
Resize images (https://github.com/opencv/)
ResNet kernels for inference (https://github.com/xuqiantong/CUDA-Winograd)
Reverse an input array of size 256 using shared memory
Reproducible floating sum (https://github.com/facebookarchive/fbcuda)
Non-P2P transfers in a circular manner among GPU devices
Computes a run-length encoding of a sequence (https://github.com/NVIDIA/cub)
Random number generation using the Wallace algorithm (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application)
Rodrigues' rotation (https://github.com/DIDSR/VICTRE_MCGPU)
Romberg's method (https://github.com/SwayambhuNathRay/Parallel-Romberg-Integration)
Compute row-wise moments (https://pytorch.org/)
A proxy application for full neutron transport application like OpenMC that support multipole cross section representations (https://github.com/ANL-CESAR/RSBench/)
Random sample consensus based on task partitioning (https://github.com/chai-benchmarks/chai)
Rectilinear Steiner minimum tree (https://userweb.cs.txstate.edu/~burtscher/research/SFP/)
A structured-grid applications in the oil and gas industry (https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/rtm8)
An ODE solver using the Rush-Larsen scheme (https://bitbucket.org/finsberg/gotran/src/master)
Chemical rates computation used in the simulation of combustion (https://github.com/vetter/shoc/)
Stacked 8-neighborhood search finds nearest neighbors in each of the eight octants partitioned by ordering of three coordinates (https://github.com/MVIG-SJTU/pointSIFT/tree/master)
Dynamic parallel skew algorithm for suffix array on GPU (https://github.com/gmzang/Parallel-Suffix-Array-on-GPU)
Naive template matching with SAD (https://github.com/gholomia/CTMC)
Shapley sampling values explanation method (https://github.com/rapidsai/cuml)
Perform the SAXPY operation on host and device (https://github.com/pc2/OMP-Offloading)
Stream compaction (https://github.com/chai-benchmarks/chai)
Scan with bank-conflict-aware optimization (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda)
Scan a large array (https://github.com/OpenCL/AMD_APP_samples)
Scan a large array using vendors' library (https://github.com/OpenCL/AMD_APP_samples)
Sigmoid cross-entropy with logits (https://pytorch.org/)
Find the top scores (https://github.com/opencv/)
Batched dense matrix - dense matrix multiplication into sparse matrix (https://docs.nvidia.com/cuda/cusparse/index.html#cusparsesddmm)
Part of BIP39 solver (https://github.com/johncantrell97/bip39-solver-gpu)
Seam carving (https://github.com/pauty/CUDA_seam_carving)
Segmented reduction using Thrust and oneDPL (https://github.com/c3sr/tcu_scope)
Fast segmented sort on a GPU (https://github.com/Funatiq/bb_segsort)
Plasma sheath simulation with the particle-in-cell method (https://www.particleincell.com/2016/cuda-pic/)
The shared local memory microbenchmark (https://github.com/ekondis/gpumembench)
Shuffle instructions with subgroup sizes of 8, 16, and 32 (https://github.com/cpc/hipcl/tree/master/samples/4_shfl)
Set intersection with matrix multiply (https://github.com/chribell/set_intersection)
The attentuation of neutron fluxes across an individual geometrical segment (https://github.com/ANL-CESAR/SimpleMOC-kernel)
Execute kernels on multiple devices (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Simple sparse matrix vector multiply (https://github.com/passlab/CUDAMicroBench)
Slit experiment to compute diffraction patterns (https://github.com/bamaratunga/cuda_fft.git)
Sparse LU factorization (https://github.com/sheldonucr/GLU_public)
Genome pre-alignment filtering (https://github.com/CMU-SAFARI/SneakySnake)
Sobel filter (https://github.com/OpenCL/AMD_APP_samples)
Sobol quasi-random generator (https://docs.nvidia.com/cuda/cuda-samples/index.html)
The softmax function (https://github.com/pytorch/glow/tree/master/lib/Backends/OpenCL)
Radix sort in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Sort by key using Thrust and oneDPL
Second-order IIR digital filtering (https://github.com/rapidsai/cusignal/)
A miniapp for the CoMet comparative genomics application (https://github.com/wdj/sparkler)
The simple n^2 SPH simulation (https://github.com/olcf/SPH_Simple)
The split operation in radix sort (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Image registration calculations for the statistical parametric mapping (SPM) system (http://mri.ee.ntust.edu.tw/cuda/)
Conversion of a dense matrix to a sparse matrix (https://docs.nvidia.com/cuda/cusparse/index.html#cusparsedensetosparse)
Library-based out-of-place transpose of a sparse matrix (https://docs.nvidia.com/cuda/cusparse/index.html, https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-1/oneapi-mkl-sparse-omatcopy.html)
Library-based sparse matrix - dense matrix multiplication (https://docs.nvidia.com/cuda/cusparse/index.html)
Library-based sparse matrix - sparse matrix multiplication (https://docs.nvidia.com/cuda/cusparse/index.html)
Count the number of nonzero elements per row or column and the total number of nonzero elements in a dense matrix (https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-t-nnz)
Conversion of a sparse matrix to a dense matrix (https://docs.nvidia.com/cuda/cusparse/index.html#cusparsesparsetodense)
Sort a sparse matrix represented in CSR format (https://docs.nvidia.com/cuda/cusparse/index.html#cusparsexcsrsort)
A thread-Level synchronization-free sparse triangular solver (https://github.com/JiyaSu/CapelliniSpTRSV)
SRAD (version 1) in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
String search (https://github.com/OpenCL/AMD_APP_samples)
Compute structual similarity index measure (https://github.com/VIDILabs/instantvnr)
The single-source shortest path (https://github.com/chai-benchmarks/chai)
Standard deviation (https://github.com/rapidsai/raft)
1D stencil (https://www.olcf.ornl.gov/wp-content/uploads/2019/12/02-CUDA-Shared-Memory.pdf)
3D stencil (https://github.com/LLNL/cardioid)
Streamcluster in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Spatial-temporal Savitzky-Golay method for reconstructing high-quality NDVI time series (https://github.com/HPSCIL/cuSTSG)
Lattice QCD SU(3) matrix-matrix multiply microbenchmark (https://gitlab.com/NERSC/nersc-proxies/su3_bench)
Surfel rendering (https://github.com/jstraub/cudaPcl)
Compute the singular value decomposition of 3x3 matrices (https://github.com/kuiwuchn/3x3_SVD_CUDA)
SW4 curvilinear kernels are five stencil kernels that account for ~50% of the solution time in SW4 (https://github.com/LLNL/SW4CK)
The Swish activate functions (https://pytorch.org/)
A demo of tensor accessors in Pytorch (https://pytorch.org/)
Tensor transposition (https://github.com/Jokeren/GPA-Benchmark/tree/master/ExaTENSOR)
A proxy for the SNAP force calculation in the LAMMPS molecular dynamics package (https://github.com/FitSNAP/TestSNAP)
Solve tridiagonal systems of equations using the Thomas algorithm (https://pm.bsc.es/gitlab/run-math/cuThomasBatch/tree/master)
Memory fence function (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions)
Accumulate contributions of tissue source strengths and previous solute levels to current tissue solute levels (https://github.com/secomb/GreensTD19_GPU)
Tone mapping (https://github.com/OpenCL/AMD_APP_samples)
The 2-point correlation function (https://users.ncsa.illinois.edu/kindr/projects/hpca/index.html)
Simulation of a task queue system (https://github.com/chai-benchmarks/chai)
Triad in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Matrix solvers for large number of small independent tridiagonal linear systems(http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Trotter-Suzuki approximation (https://bitbucket.org/zzzoom/trottersuzuki/src/master/)
Solving the symmetric traveling salesman problem with iterative hill climbing (https://userweb.cs.txstate.edu/~burtscher/research/TSP_GPU/)
Unfold the view of original tensor as slices (https://pytorch.org/)
Uniform random noise generator (https://github.com/OpenCL/AMD_APP_samples)
Genuchten conversion of soil moisture and pressure (https://github.com/HydroComplexity/Dhara)
Computes expectation values (6D integrals) associated with the helium atom (https://github.com/wadejong/Summer-School-Materials/tree/master/Examples/vmc)
Volume-to-column transform (https://pytorch.org/)
Demonstrate the usage of the vote intrinsics (https://github.com/NVIDIA/cuda-samples/)
Sort small numbers (https://github.com/facebookarchive/fbcuda)
Compute mean and variance using the Welford algorithm (https://github.com/hpcaitech/ColossalAI)
Winograd convolution (https://github.com/ChenyangZhang-cs/iMLBench)
Compute spring forces in a worm-like chain model with a power function (https://github.com/AnselGitAccount/USERMESO-2.0)
Implementation of word2vec with Continuous Bag-of-Words (https://github.com/cudabigdata/word2vec_cuda)
Count the number of words in a text (https://github.com/NVIDIA/thrust/blob/main/examples/)
Parallel weather and research forecast single moment 5-class (https://github.com/gpgpu-sim/ispass2009-benchmarks/tree/master/wp)
List ranking with Wyllie's algorithm (Rehman, M. & Kothapalli, Kishore & Narayanan, P.. (2009). Fast and Scalable List Ranking on the GPU. Proceedings of the International Conference on Supercomputing. 235-243. 10.1145/1542275.1542311.)
Hartree-Fock self-consistent-field (SCF) calculation of H2O (https://github.com/recoli/XLQC)
A proxy application for full neutron transport application like OpenMC (https://github.com/ANL-CESAR/XSBench/)
Kernels may read and write directly to pinned system memory from a user perspective (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/simpleZeroCopy)
Find zero-points and scales in quantization (https://pytorch.org/)
3D complex FFT in a 256^3 cube (https://github.com/spiral-software/fftx)
Zoom in and zoom out an image (https://github.com/rapidsai/cucim)
Authored by Youssef Faqir-Rhazoui. This work is an extension from Zheming Jin original work.
This work has been supported by the EU (FEDER), the Spanish MINECO and CM under grants S2018/TCS-4423, PID2021-126576NB-I00 funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”.