Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMA-izing the prolongator and restrictor kernels #1497

Merged
merged 88 commits into from
Jan 22, 2025

Conversation

hummingtree
Copy link
Member

@hummingtree hummingtree commented Sep 27, 2024

As the name suggests, this PR adds initial support for MMA-izing the prolongator and restrictor kernels. In addition,

  • Initial supports are also added to use tensor memory accelerator (TMA) for the memory movement on the Hopper and Blackwell GPUs;
  • Some general shared memory load/store patterns are optimized.
  • As a general cleanup, the MMA types can now be specified as a CMake parameter, e.g.,
cmake ...
  -DQUDA_MULTIGRID_MMA_DSLASH_TYPE_HALF=2 \
  -DQUDA_MULTIGRID_MMA_PROLONGATOR_TYPE_HALF=3 \
  -DQUDA_MULTIGRID_MMA_RESTRICTOR_TYPE_HALF=3 \
  -DQUDA_MULTIGRID_MMA_SETUP_TYPE_HALF=0 \
  -DQUDA_MULTIGRID_MMA_DSLASH_TYPE_SINGLE=2 \
  -DQUDA_MULTIGRID_MMA_PROLONGATOR_TYPE_SINGLE=3 \
  -DQUDA_MULTIGRID_MMA_RESTRICTOR_TYPE_SINGLE=3 \
  -DQUDA_MULTIGRID_MMA_SETUP_TYPE_SINGLE=0 \
  ...

The encoding is the following:

    "1->SIMT; 2->SMMA; 3->1xFP16; 4->3xFP16; 5->1xTF32; 6->3xTF32; 7->3xBF16; 0->DEFAULT

The default types are:

half/default single/default
Setup 3xFP16 3xFP16
Coarse Dslash 3xBF16 (SIMT for < SM80) 3xTF32 (SIMT for < SM80)
Restrictor 3xBF16 (SIMT for < SM80) 3xTF32 (SIMT for < SM80)
Prolongator 3xBF16 (SIMT for < SM80) 3xTF32 (SIMT for < SM80)
  • For coarse dslash, prolongator and restrictor, the code will automatically find the suitable nVec instantiation to use, e.g., if nVec = 16,32 are instantiated, for nRHS = 5, nVec = 16 will be picked; for nRHS = 24, nVec = 32 will be picked; for nRHS = 96, the nVec = 32 kernel will be called 3 times to divide and conquer.

Remaining to-dos:

  • Add a command line interface for toggling MMA-ized prolongator and restrictor kernels.
  • Make the prolongator and restrictor working for staggered (the spin/parity handling is different).
  • A comprehensive testing of the MG/MMA workflows.
  • Add doxygen.
  • Clang-format

coarse_dslash_tma

transfer_mma

hummingtree and others added 28 commits August 27, 2024 12:20
- Still need to add the spin factor of 2;
- Still need to cover the to_non_rel = true;
… the process add an additional template to `kernel_param`.
…underlying code such that FP16 works with rescaling.
@hummingtree hummingtree requested review from a team as code owners September 27, 2024 13:57
@hummingtree hummingtree requested a review from a team as a code owner January 3, 2025 21:34
lib/multigrid.in.hpp Outdated Show resolved Hide resolved
Copy link
Member

@maddyscientist maddyscientist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the requested fixes done on this @hummingtree. Aside from a trivial comment I just made (logQuda) this is good to go as far as I am concerned.

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 15, 2025

Good news: this passes a visual review! Bad news: I hit an issue that's only present with --mg-dslash-use-mma enabled, single precision, Nc = 96... and it goes away if I disable auto-tuning, so I've attached the tunecache as well... This is on Hopper SXM 80GB. I'm not sure if I can trigger it with a single GPU build since it's tuning specific...

cmake command:

cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
  -DQUDA_GPU_ARCH=sm_90 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \
  -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 \
  -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
  /scratch/local/quda

Command---with the tunecache I have, it only triggers with single precision, --mg-dslash-use-mma 3 true, 4 level solve... and the issue only hits on the coarsest level; printout of error below.

PREC="single"

mpirun -np 1 ./staggered_invert_test \
  --prec single --prec-sloppy single --prec-null $PREC --prec-precondition $PREC \
  --mass 0.2 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long false --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 32 --nsrc-tile 16 --niter 24 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
  --mg-transfer-use-mma 0 false --mg-transfer-use-mma 1 false --mg-transfer-use-mma 2 false --mg-transfer-use-mma 3 false \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

Here's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of --nsrc-tile? So maybe there's some weird corner of parameter space. I can only trigger it with Nc = 96 on the coarsest level, if it's on the intermediate level (replace --mg-nvec 1 64 with 96) things are fine...

[...]
MG level 2 (GPU): GCR:     0 iterations, n = 8, <r,r> = 4.825448e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 9, <r,r> = 4.814075e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 10, <r,r> = 4.881138e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 11, <r,r> = 4.742637e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 12, <r,r> = 4.829302e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 13, <r,r> = 4.812252e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 14, <r,r> = 4.755250e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 15, <r,r> = 4.762029e+04, |r|/|b| = 1.000000e+00
MG level 3 (GPU): CA-GCR:     0 iterations, n = 0, <r,r> =       nan, |r|/|b| =       nan
MG level 3 (GPU): ERROR: Solver appears to have diverged for n = 0 (rank 0, host ipp2-0709.nvidia.com, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
MG level 3 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=12288,aux=zero,color_spinor_field.cpp,409)
MG level 3 (GPU): Saving 1207 sets of cached parameters to /scratch/local/build/tests/tunecache/tunecache_error.tsv

Reference tunecache: tunecache_fail.tar.gz

Commit id: 49c0a58

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 15, 2025

Infinitely cleaner command... thanks @hummingtree

            for PREC in half single
            do
                mpirun -n 1 ./multigrid_benchmark_test --test 0 --dim 2 2 2 2 --niter 10 --nsrc 8 --prec-sloppy ${PREC} --mg-nvec 0 96 --mg-dslash-use-mma 0 true
            done

…s as they do not work; Add the logic to make sure the box sizes are not larger than the limit.
@hummingtree
Copy link
Member Author

Good news: this passes a visual review! Bad news: I hit an issue that's only present with --mg-dslash-use-mma enabled, single precision, Nc = 96... and it goes away if I disable auto-tuning, so I've attached the tunecache as well... This is on Hopper SXM 80GB. I'm not sure if I can trigger it with a single GPU build since it's tuning specific...

cmake command:

cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
  -DQUDA_GPU_ARCH=sm_90 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \
  -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 \
  -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
  /scratch/local/quda

Command---with the tunecache I have, it only triggers with single precision, --mg-dslash-use-mma 3 true, 4 level solve... and the issue only hits on the coarsest level; printout of error below.

PREC="single"

mpirun -np 1 ./staggered_invert_test \
  --prec single --prec-sloppy single --prec-null $PREC --prec-precondition $PREC \
  --mass 0.2 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long false --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 32 --nsrc-tile 16 --niter 24 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
  --mg-transfer-use-mma 0 false --mg-transfer-use-mma 1 false --mg-transfer-use-mma 2 false --mg-transfer-use-mma 3 false \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

Here's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of --nsrc-tile? So maybe there's some weird corner of parameter space. I can only trigger it with Nc = 96 on the coarsest level, if it's on the intermediate level (replace --mg-nvec 1 64 with 96) things are fine...

[...]
MG level 2 (GPU): GCR:     0 iterations, n = 8, <r,r> = 4.825448e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 9, <r,r> = 4.814075e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 10, <r,r> = 4.881138e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 11, <r,r> = 4.742637e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 12, <r,r> = 4.829302e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 13, <r,r> = 4.812252e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 14, <r,r> = 4.755250e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 15, <r,r> = 4.762029e+04, |r|/|b| = 1.000000e+00
MG level 3 (GPU): CA-GCR:     0 iterations, n = 0, <r,r> =       nan, |r|/|b| =       nan
MG level 3 (GPU): ERROR: Solver appears to have diverged for n = 0 (rank 0, host ipp2-0709.nvidia.com, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
MG level 3 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=12288,aux=zero,color_spinor_field.cpp,409)
MG level 3 (GPU): Saving 1207 sets of cached parameters to /scratch/local/build/tests/tunecache/tunecache_error.tsv

Reference tunecache: tunecache_fail.tar.gz

Commit id: 49c0a58

Thanks Evan for the tests! This should have been fixed in e8ca869.

Copy link
Contributor

@weinbe2 weinbe2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the recent bugfixes this is a go---the recent issue I filed is orthogonal to this work. Awesome work @hummingtree !

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 21, 2025

cscs-ci run

@weinbe2 weinbe2 merged commit 60624bc into develop Jan 22, 2025
6 of 7 checks passed
@weinbe2 weinbe2 deleted the feature/prolongator-mma branch January 22, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants