MMA-izing the prolongator and restrictor kernels #1497

hummingtree · 2024-09-27T13:57:15Z

As the name suggests, this PR adds initial support for MMA-izing the prolongator and restrictor kernels. In addition,

Initial supports are also added to use tensor memory accelerator (TMA) for the memory movement on the Hopper and Blackwell GPUs;
Some general shared memory load/store patterns are optimized.
As a general cleanup, the MMA types can now be specified as a CMake parameter, e.g.,

cmake ...
  -DQUDA_MULTIGRID_MMA_DSLASH_TYPE_HALF=2 \
  -DQUDA_MULTIGRID_MMA_PROLONGATOR_TYPE_HALF=3 \
  -DQUDA_MULTIGRID_MMA_RESTRICTOR_TYPE_HALF=3 \
  -DQUDA_MULTIGRID_MMA_SETUP_TYPE_HALF=0 \
  -DQUDA_MULTIGRID_MMA_DSLASH_TYPE_SINGLE=2 \
  -DQUDA_MULTIGRID_MMA_PROLONGATOR_TYPE_SINGLE=3 \
  -DQUDA_MULTIGRID_MMA_RESTRICTOR_TYPE_SINGLE=3 \
  -DQUDA_MULTIGRID_MMA_SETUP_TYPE_SINGLE=0 \
  ...

The encoding is the following:

    "1->SIMT; 2->SMMA; 3->1xFP16; 4->3xFP16; 5->1xTF32; 6->3xTF32; 7->3xBF16; 0->DEFAULT

The default types are:

	half/default	single/default
Setup	3xFP16	3xFP16
Coarse Dslash	3xBF16 (SIMT for < SM80)	3xTF32 (SIMT for < SM80)
Restrictor	3xBF16 (SIMT for < SM80)	3xTF32 (SIMT for < SM80)
Prolongator	3xBF16 (SIMT for < SM80)	3xTF32 (SIMT for < SM80)

For coarse dslash, prolongator and restrictor, the code will automatically find the suitable nVec instantiation to use, e.g., if nVec = 16,32 are instantiated, for nRHS = 5, nVec = 16 will be picked; for nRHS = 24, nVec = 32 will be picked; for nRHS = 96, the nVec = 32 kernel will be called 3 times to divide and conquer.

Remaining to-dos:

Add a command line interface for toggling MMA-ized prolongator and restrictor kernels.
Make the prolongator and restrictor working for staggered (the spin/parity handling is different).
A comprehensive testing of the MG/MMA workflows.
Add doxygen.
Clang-format

- Still need to add the spin factor of 2; - Still need to cover the to_non_rel = true;

…for loading from gmem.

…eature/prolongator-mma

… the process add an additional template to `kernel_param`.

…o feature/prolongator-mma

…underlying code such that FP16 works with rescaling.

…fields.

…eature/prolongator-mma

lib/dslash_coarse.in.cpp

lib/multigrid.in.hpp

maddyscientist

Thanks for the requested fixes done on this @hummingtree. Aside from a trivial comment I just made (logQuda) this is good to go as far as I am concerned.

weinbe2 · 2025-01-15T22:07:46Z

Good news: this passes a visual review! Bad news: I hit an issue that's only present with --mg-dslash-use-mma enabled, single precision, Nc = 96... and it goes away if I disable auto-tuning, so I've attached the tunecache as well... This is on Hopper SXM 80GB. I'm not sure if I can trigger it with a single GPU build since it's tuning specific...

cmake command:

cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
  -DQUDA_GPU_ARCH=sm_90 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \
  -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 \
  -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
  /scratch/local/quda

Command---with the tunecache I have, it only triggers with single precision, --mg-dslash-use-mma 3 true, 4 level solve... and the issue only hits on the coarsest level; printout of error below.

PREC="single"

mpirun -np 1 ./staggered_invert_test \
  --prec single --prec-sloppy single --prec-null $PREC --prec-precondition $PREC \
  --mass 0.2 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long false --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 32 --nsrc-tile 16 --niter 24 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
  --mg-transfer-use-mma 0 false --mg-transfer-use-mma 1 false --mg-transfer-use-mma 2 false --mg-transfer-use-mma 3 false \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

Here's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of --nsrc-tile? So maybe there's some weird corner of parameter space. I can only trigger it with Nc = 96 on the coarsest level, if it's on the intermediate level (replace --mg-nvec 1 64 with 96) things are fine...

[...]
MG level 2 (GPU): GCR:     0 iterations, n = 8, <r,r> = 4.825448e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 9, <r,r> = 4.814075e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 10, <r,r> = 4.881138e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 11, <r,r> = 4.742637e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 12, <r,r> = 4.829302e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 13, <r,r> = 4.812252e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 14, <r,r> = 4.755250e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 15, <r,r> = 4.762029e+04, |r|/|b| = 1.000000e+00
MG level 3 (GPU): CA-GCR:     0 iterations, n = 0, <r,r> =       nan, |r|/|b| =       nan
MG level 3 (GPU): ERROR: Solver appears to have diverged for n = 0 (rank 0, host ipp2-0709.nvidia.com, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
MG level 3 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=12288,aux=zero,color_spinor_field.cpp,409)
MG level 3 (GPU): Saving 1207 sets of cached parameters to /scratch/local/build/tests/tunecache/tunecache_error.tsv

Reference tunecache: tunecache_fail.tar.gz

Commit id: 49c0a58

weinbe2 · 2025-01-15T22:23:25Z

Infinitely cleaner command... thanks @hummingtree

            for PREC in half single
            do
                mpirun -n 1 ./multigrid_benchmark_test --test 0 --dim 2 2 2 2 --niter 10 --nsrc 8 --prec-sloppy ${PREC} --mg-nvec 0 96 --mg-dslash-use-mma 0 true
            done

…s as they do not work; Add the logic to make sure the box sizes are not larger than the limit.

hummingtree · 2025-01-16T17:50:06Z

Good news: this passes a visual review! Bad news: I hit an issue that's only present with --mg-dslash-use-mma enabled, single precision, Nc = 96... and it goes away if I disable auto-tuning, so I've attached the tunecache as well... This is on Hopper SXM 80GB. I'm not sure if I can trigger it with a single GPU build since it's tuning specific...

cmake command:

cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
  -DQUDA_GPU_ARCH=sm_90 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \
  -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 \
  -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
  /scratch/local/quda

Command---with the tunecache I have, it only triggers with single precision, --mg-dslash-use-mma 3 true, 4 level solve... and the issue only hits on the coarsest level; printout of error below.

PREC="single"

mpirun -np 1 ./staggered_invert_test \
  --prec single --prec-sloppy single --prec-null $PREC --prec-precondition $PREC \
  --mass 0.2 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long false --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 32 --nsrc-tile 16 --niter 24 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
  --mg-transfer-use-mma 0 false --mg-transfer-use-mma 1 false --mg-transfer-use-mma 2 false --mg-transfer-use-mma 3 false \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

Here's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of --nsrc-tile? So maybe there's some weird corner of parameter space. I can only trigger it with Nc = 96 on the coarsest level, if it's on the intermediate level (replace --mg-nvec 1 64 with 96) things are fine...

[...]
MG level 2 (GPU): GCR:     0 iterations, n = 8, <r,r> = 4.825448e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 9, <r,r> = 4.814075e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 10, <r,r> = 4.881138e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 11, <r,r> = 4.742637e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 12, <r,r> = 4.829302e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 13, <r,r> = 4.812252e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 14, <r,r> = 4.755250e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 15, <r,r> = 4.762029e+04, |r|/|b| = 1.000000e+00
MG level 3 (GPU): CA-GCR:     0 iterations, n = 0, <r,r> =       nan, |r|/|b| =       nan
MG level 3 (GPU): ERROR: Solver appears to have diverged for n = 0 (rank 0, host ipp2-0709.nvidia.com, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
MG level 3 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=12288,aux=zero,color_spinor_field.cpp,409)
MG level 3 (GPU): Saving 1207 sets of cached parameters to /scratch/local/build/tests/tunecache/tunecache_error.tsv

Reference tunecache: tunecache_fail.tar.gz

Commit id: 49c0a58

Thanks Evan for the tests! This should have been fixed in e8ca869.

attempt to fix build

weinbe2

With the recent bugfixes this is a go---the recent issue I filed is orthogonal to this work. Awesome work @hummingtree !

…r-mma

…kernels.

weinbe2 · 2025-01-21T14:37:59Z

cscs-ci run

…he transfers.

hummingtree and others added 28 commits August 27, 2024 12:20

Add MMA version of prolongator.

88a108e

Dagger'ed the equation to make better use of MMA.

c0cb6e0

Make nColor = 3 works:

dd55d92

- Still need to add the spin factor of 2; - Still need to cover the to_non_rel = true;

Add from_to_non_rel to block transpose to complete the circle.

57b6f95

More cleanup of the MMA code. Apply vector gmem loads when possible.

efa9c15

Apply more vector gmem loads when possible.

82d532b

Add MMA version for restrictor.

e6a5d34

Add expands for restrictor with MMA.

f0f88d5

Add shared memory caching for the the restrictor kernel.

92b1c39

Allow restrictor_mma to have N % bN != 0; More generic optimizations …

e7b9361

…for loading from gmem.

Add rescaling to prolongator.

6bb02f6

Modify the MMA types.

2d8867e

Add rescale for restrictor.

db6c030

Abstract the MMA expansions into a class.

ce257ef

Add more abstraction; make TF32 the default for SM80 and later.

f38882a

Merge branch 'feature/mrhs-solvers' of github.com:lattice/quda into f…

69e15d6

…eature/prolongator-mma

Merge branch 'feature/mrhs-solvers' of github.com:lattice/quda into f…

e03ca23

…eature/prolongator-mma

Fix block transpose by having no bound checks on the kernel level; in…

0a2bd2a

… the process add an additional template to `kernel_param`.

Merge branch 'feature/prolongator-mma' of github.com:lattice/quda int…

e68dfa9

…o feature/prolongator-mma

Add some aggregate sizes to MMA restrictor

de05ebf

Make aggregate_size a runtime variable.

bcdfb86

Apply clang-format.

189f78e

Soften the restriction for nrhs from multiple of 16 to multiple of 8.

8636160

Set the default precision in coarse dslash mma to TF32/FP16; Fix the …

f3a42bd

…underlying code such that FP16 works with rescaling.

Clean up the MMA code.

82a61c6

Short cut the rescaling code to use scale_inv for fixed point format …

6515177

…fields.

Clean up code.

a9f6d7c

Merge branch 'feature/mrhs-solvers' of github.com:lattice/quda into f…

29593b1

…eature/prolongator-mma

hummingtree requested review from a team as code owners September 27, 2024 13:57

hummingtree added 2 commits January 3, 2025 13:05

When using TF32 and BF16 check if the CC is less than 80.

df97e1a

Apply the Arg::check_bounds to the other kernels.

b0b4355

hummingtree requested a review from a team as a code owner January 3, 2025 21:34

hummingtree added 4 commits January 6, 2025 22:47

Further reduece the number of float division by using scale.

75b24c2

Allow specify MMA types per precision.

254a79b

Break the MMA type into half/single ones.

af4c9a0

Print the nVec used only in debug verbosity.

49c0a58

maddyscientist reviewed Jan 7, 2025

View reviewed changes

lib/dslash_coarse.in.cpp Outdated Show resolved Hide resolved

maddyscientist reviewed Jan 7, 2025

View reviewed changes

lib/multigrid.in.hpp Outdated Show resolved Hide resolved

maddyscientist approved these changes Jan 7, 2025

View reviewed changes

hummingtree added 3 commits January 16, 2025 17:15

Remove the divide and conquer code for larger than limit TMA box size…

e8ca869

…s as they do not work; Add the logic to make sure the box sizes are not larger than the limit.

Change the default setup MMA types to 3xfp16.

a3e3384

Use logQuda.

f823560

hummingtree and others added 2 commits January 16, 2025 10:43

Change from - uses: actions/checkout@v3 to - uses: actions/checkout@v4.

30d69f5

Update cuda_githubactions_build.yml

d62842c

attempt to fix build

weinbe2 approved these changes Jan 16, 2025

View reviewed changes

maddyscientist mentioned this pull request Jan 16, 2025

Hotfix/bicgstab null #1534

Merged

hummingtree added 5 commits January 17, 2025 07:05

Merge remote-tracking branch 'origin/develop' into feature/prolongato…

6f98322

…r-mma

Not generating the Nc=6 files for MMA transfers.

cd67249

Disable compiling for fineColor=6 and coarseColor=6 for the transfer …

d91a823

…kernels.

Apply clang-format.

dcad5c8

Resolve the cmake/clang-format conflicts, hopefully.

d8adfa2

hummingtree and others added 2 commits January 21, 2025 22:36

Whitelist instead of blacklisting the fineColor and coarseColor for t…

0231629

…he transfers.

Added apt-get install clang-14 to github actions

fadb49a

weinbe2 merged commit 60624bc into develop Jan 22, 2025
6 of 7 checks passed

weinbe2 deleted the feature/prolongator-mma branch January 22, 2025 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMA-izing the prolongator and restrictor kernels #1497

MMA-izing the prolongator and restrictor kernels #1497

hummingtree commented Sep 27, 2024 •

edited

Loading

maddyscientist left a comment

weinbe2 commented Jan 15, 2025

weinbe2 commented Jan 15, 2025

hummingtree commented Jan 16, 2025

weinbe2 left a comment

weinbe2 commented Jan 21, 2025

MMA-izing the prolongator and restrictor kernels #1497

MMA-izing the prolongator and restrictor kernels #1497

Conversation

hummingtree commented Sep 27, 2024 • edited Loading

maddyscientist left a comment

Choose a reason for hiding this comment

weinbe2 commented Jan 15, 2025

weinbe2 commented Jan 15, 2025

hummingtree commented Jan 16, 2025

weinbe2 left a comment

Choose a reason for hiding this comment

weinbe2 commented Jan 21, 2025

hummingtree commented Sep 27, 2024 •

edited

Loading