GPU Path optimizations #9

lukasm91 · 2022-10-14T06:25:44Z

As discussed, I am going to open this pull request with the changes we did in the previous weeks.

Let me know how we want to proceed. Note that in principle, each single commit can be compiled and should work as is.

FTDIR should directly write into the out buffer. Truncation is now implicitly handled (G%NMEN holds truncated loop bounds)

- Verified that V100 passes - Small fixes in returned arrays (1 element too large sometimes)

lukasm91 · 2022-10-14T06:28:39Z

AUTHORS

@@ -1,20 +1,21 @@
 Authors and Contributors
 ========================

+- P. Courtier (ECMWF)


In case you wonder - I just sorted this file when adding myself to the contributors because the file tells us it should be sorted.

lukasm91 · 2022-10-14T06:31:00Z

src/programs/CMakeLists.txt

                                  gpu
                                  OpenACC::OpenACC_Fortran 
                                  ${LAPACK_LIBRARIES}
                                  nvhpcwrapnvtx
                          )
-	ecbuild_add_executable(TARGET  driver-spectrans-CA-${prec}
-                             SOURCES driver-spectraltransform.F90
+      set_property( TARGET driver-spectrans-CA-${prec} PROPERTY CUDA_ARCHITECTURES 70 )


CMake changes are just to make it work for me! I don't see myself as a ecbuild expert; and I think there have been changes in master.

lukasm91 · 2022-10-14T06:32:01Z

src/programs/driver-spectraltransform.F90

@@ -243,7 +250,7 @@ PROGRAM TRANSFORM_TEST
 ! Participating processors limited by -P option

 !--------------------------
-CALL MPL_INIT()
+CALL MPL_INIT(LDENV=.false.)


I am not sure if this is a problem for you in some case but propagating the full environment is a bad thing if we run with nsys (because it sets some per-rank environment variables)

lukasm91 · 2022-10-14T06:33:42Z

src/trans/gpu/external/inv_trans.F90

-    IF(IUBOUND(1) < NPROMA) THEN
-      WRITE(NOUT,*)'INV_TRANS:FIRST DIM. OF PGP2 TOO SMALL ',IUBOUND(1),NPROMA
-      CALL ABORT_TRANS('INV_TRANS:FIRST DIMENSION OF PGP2 TOO SMALL ')
+    IF(IUBOUND(1) /= NPROMA) THEN


Just making the interface a bit more strict. This is not a problem for full IFS and I think in general it is good to make it strict.

lukasm91 · 2022-10-14T06:37:23Z

src/trans/gpu/internal/allocator_mod.F90

@@ -0,0 +1,214 @@
+! (C) Copyright 2022- NVIDIA.


The allocator is replacing the whole "pre-allocation" strategy that was in ectrans initially. We use double-buffering between the different steps of ectrans; the size of the allocations is done at the beginning to a call of ectrans and check against earlier allocations. The allocation is increased if needed.

lukasm91 · 2022-10-14T06:41:05Z

src/trans/gpu/internal/dir_trans_ctl_mod.F90

-USE TPM_TRANS       ,ONLY : FOUBUF_IN, NF_SC2, NF_SC3A, NF_SC3B
-!USE TPM_DISTR
+      WRITE(NOUT,*) 'ltdir_ctl:TRLTOM_CUDAAWARE'
+      CALL TRLTOM_PACK(ALLOCATOR,HTRLTOM_PACK,PREEL_COMPLEX,FOUBUF_IN,KF_FS)


note that I moved the pack/unpack routines here on purpose because it makes more sense. The pack/unpack were initially called through TRLTOM/TRMTOG but I think this is not a good decision because these function live in between e.g. FTDIR and TRLTOM without preference for either function.

lukasm91 · 2022-10-14T06:44:22Z

src/trans/gpu/internal/gstats_label_ifs.F90

-CALL GSTATS_LABEL(431,'   ','INV COPIES')
-CALL GSTATS_LABEL(440,'   ','FULL DIRTRANS')
-CALL GSTATS_LABEL(441,'   ','FULL INVTRANS')
+CALL GSTATS_LABEL(410,'   ','DIR COMPLETE')


Might need some useful cleanup.

lukasm91 · 2022-10-14T06:46:31Z

src/trans/gpu/internal/ledir_mod.F90

+      NS(KMLOC0) = 0
+      KS(KMLOC0) = 0
+    ENDIF
+    CALL CUDA_GEMM_BATCHED( &


GEMMs implementation is a lot different than before: Also through CUDA graphs without zero padding. Note that ZAA0 and ZAS0 are still zero padded; this could still be fixed (which would free up some more memory)

lukasm91 · 2022-10-14T06:48:34Z

src/trans/gpu/internal/tpm_fftc.F90

+  !$ACC END HOST_DATA
+END SUBROUTINE
+
+SUBROUTINE EXECUTE_INV_FFT_DOUBLE(PREEL_COMPLEX,PREEL_REAL,KFIELD,LOENS,OFFSETS)


The FFT plan mangagement was moved to the CUDA code (because it not only a FFT plan cache anymore, but actually a CUDA graph+fft plan cache)

lukasm91 · 2022-10-14T06:49:17Z

src/trans/gpu/internal/tpm_stats.F90

+CALL GSTATS_LABEL(KNUM,CTYPE,CDESC)
+END SUBROUTINE
+
+SUBROUTINE GSTATS_NVTX(KNUM,KSWITCH)


Not sure you actually want this. This is a wrapper around the normal gstats to enable nvtx support (nsight system)

lukasm91 · 2022-10-14T06:50:29Z

src/trans/gpu/internal/trgtol_mod.F90

+    HTRGTOL%HCOMBUFR_AND_REEL = RESERVE(ALLOCATOR, NELEM)
+  END FUNCTION
+
+  SUBROUTINE TRGTOL(ALLOCATOR,HTRGTOL,PREEL_REAL,KF_FS,KF_GP,KF_UV_G,KF_SCALARS_G,&


I dropped the non-cuda aware path and moved everything to OpenACC here. This is pretty much a rewrite of this function. Note that the extra routines (inigptr) have been dropped and integrated here. I gave my best to properly comment the code because this memory layouts are really tough to undersatnd. Also all codes in TRLTOG/TRGTOL are made sure to be very clearly "reversed" each other.

samhatfield · 2024-06-07T09:47:20Z

This PR will essentially be satisfied once redgreen-optimized is merged into develop.

lukasm91 · 2024-06-25T07:30:29Z

The remaining gap is in this draft PR: marsdeno#4 (not meant to be merged, just documentation purpose)

Fixed a bug. No longer getting segfaults, but the norms are incorrect.

lukasm91 added 30 commits October 4, 2022 06:55

Fix openacc

ff4340a

Do not compute KMLOC0 twice

56ea3b2

Simplify final write

e5d2b06

Simplify matrix multiplications

03d9977

Format LEDIR

1f4ac75

Make output loops smaller

0f081b7

Cleanup uvtvd

d03e67b

Cleanup ldfou2

2b07cf9

Cleanup prfi2b

e415b09

simple rename

4ce3772

Improve performance for scaling/ kernel

184350b

avoid some over-computation in ftdir

6f694a7

Restructure fourier_in - little slow down but better readibility

c9bdc9e

Cleanup trgtol writes into ZGTF

784e623

Simplify truncation

66731ed

Remove redundant temporaries for ledir

5de8ed5

Move the OpenACC Updates for trltom to where they belong to

4676c9c

Various small improvements / renamings

a575ed2

tight packing in trmtol/trltom for FOUBUF/FOUBUF_IN

cb109d5

Move FOURIER_OUT into FTDIR, and cleanup

cf90e4d

FTDIR should directly write into the out buffer. Truncation is now implicitly handled (G%NMEN holds truncated loop bounds)

Directly write FOUBUF_IN

2ae19b9

assume that trgtol outputs on device

75b74c4

ZGTF is completely written!

baa42be

pin buffers

2de7adf

Add option to disable file dumps

aac7e2a

improve data regions / add some async

dda2816

add 2 more labels

7606b23

add quite some new barriers/labels

5230b16

Model FOURIER_IN according to FOURIER_OUT

18c0e99

Merge the many kernels in FSC

6416140

lukasm91 added 23 commits October 4, 2022 06:56

Minor cleanup with module use

0a1177e

Remove FSPGL_INT_MOD

bd09839

simplify control logic in in main driver routines

7f12a1c

ldenv=.false for nsys

4499c26

disable OpenMP dependent domain decomposition computation in driver

a9f8892

Change output of program driver

dd71da0

add second executable

1f747f1

Add a call to gpnorm

5b0af2a

Make dump optional

e137897

add dump directory as env variable

974069d

fix size of zinp/zout in ledir

eb7e29b

fix size of zinp/zout in leinv

187d17f

use same strategy as for other offset arrays

14b7f03

Remove direct transform

cbbb666

Add functionality to allocator to set all data to NaN

fcf48ca

- Verified that V100 passes - Small fixes in returned arrays (1 element too large sometimes)

Add some trickery for full app

f42b15b

Fix to support different resolutions

7b7b7b3

Typo in driver

2ac1b3b

Cleanup setup_trans / do no re-allocate arrays

c771877

Tiny mix in allocator (not acually used in production)

d7f8b37

FIX: GPNORM issue when NLEV changes across calls

b2953e4

Remove redundant transfers

bcae6f9

Add missing copyrights

a017a5b

lukasm91 commented Oct 14, 2022

View reviewed changes

Fix install of interface

19b4d13

samhatfield closed this Jun 7, 2024

dmitrypek pushed a commit to dmitrypek/ectrans that referenced this pull request Aug 20, 2024

Merge pull request ecmwf-ifs#9 from dmitrypek/dmitry-overlap-changes

7a34e29

Fixed a bug. No longer getting segfaults, but the norms are incorrect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Path optimizations #9

GPU Path optimizations #9

lukasm91 commented Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

lukasm91 Oct 14, 2022

samhatfield commented Jun 7, 2024

lukasm91 commented Jun 25, 2024

GPU Path optimizations #9

GPU Path optimizations #9

Conversation

lukasm91 commented Oct 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samhatfield commented Jun 7, 2024

lukasm91 commented Jun 25, 2024