-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Path optimizations #9
Conversation
FTDIR should directly write into the out buffer. Truncation is now implicitly handled (G%NMEN holds truncated loop bounds)
- Verified that V100 passes - Small fixes in returned arrays (1 element too large sometimes)
@@ -1,20 +1,21 @@ | |||
Authors and Contributors | |||
======================== | |||
|
|||
- P. Courtier (ECMWF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you wonder - I just sorted this file when adding myself to the contributors because the file tells us it should be sorted.
gpu | ||
OpenACC::OpenACC_Fortran | ||
${LAPACK_LIBRARIES} | ||
nvhpcwrapnvtx | ||
) | ||
ecbuild_add_executable(TARGET driver-spectrans-CA-${prec} | ||
SOURCES driver-spectraltransform.F90 | ||
set_property( TARGET driver-spectrans-CA-${prec} PROPERTY CUDA_ARCHITECTURES 70 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes are just to make it work for me! I don't see myself as a ecbuild expert; and I think there have been changes in master.
@@ -243,7 +250,7 @@ PROGRAM TRANSFORM_TEST | |||
! Participating processors limited by -P option | |||
|
|||
!-------------------------- | |||
CALL MPL_INIT() | |||
CALL MPL_INIT(LDENV=.false.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is a problem for you in some case but propagating the full environment is a bad thing if we run with nsys (because it sets some per-rank environment variables)
IF(IUBOUND(1) < NPROMA) THEN | ||
WRITE(NOUT,*)'INV_TRANS:FIRST DIM. OF PGP2 TOO SMALL ',IUBOUND(1),NPROMA | ||
CALL ABORT_TRANS('INV_TRANS:FIRST DIMENSION OF PGP2 TOO SMALL ') | ||
IF(IUBOUND(1) /= NPROMA) THEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making the interface a bit more strict. This is not a problem for full IFS and I think in general it is good to make it strict.
@@ -0,0 +1,214 @@ | |||
! (C) Copyright 2022- NVIDIA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocator is replacing the whole "pre-allocation" strategy that was in ectrans initially. We use double-buffering between the different steps of ectrans; the size of the allocations is done at the beginning to a call of ectrans and check against earlier allocations. The allocation is increased if needed.
USE TPM_TRANS ,ONLY : FOUBUF_IN, NF_SC2, NF_SC3A, NF_SC3B | ||
!USE TPM_DISTR | ||
WRITE(NOUT,*) 'ltdir_ctl:TRLTOM_CUDAAWARE' | ||
CALL TRLTOM_PACK(ALLOCATOR,HTRLTOM_PACK,PREEL_COMPLEX,FOUBUF_IN,KF_FS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that I moved the pack/unpack routines here on purpose because it makes more sense. The pack/unpack were initially called through TRLTOM/TRMTOG but I think this is not a good decision because these function live in between e.g. FTDIR and TRLTOM without preference for either function.
CALL GSTATS_LABEL(431,' ','INV COPIES') | ||
CALL GSTATS_LABEL(440,' ','FULL DIRTRANS') | ||
CALL GSTATS_LABEL(441,' ','FULL INVTRANS') | ||
CALL GSTATS_LABEL(410,' ','DIR COMPLETE') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need some useful cleanup.
NS(KMLOC0) = 0 | ||
KS(KMLOC0) = 0 | ||
ENDIF | ||
CALL CUDA_GEMM_BATCHED( & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GEMMs implementation is a lot different than before: Also through CUDA graphs without zero padding. Note that ZAA0 and ZAS0 are still zero padded; this could still be fixed (which would free up some more memory)
!$ACC END HOST_DATA | ||
END SUBROUTINE | ||
|
||
SUBROUTINE EXECUTE_INV_FFT_DOUBLE(PREEL_COMPLEX,PREEL_REAL,KFIELD,LOENS,OFFSETS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FFT plan mangagement was moved to the CUDA code (because it not only a FFT plan cache anymore, but actually a CUDA graph+fft plan cache)
CALL GSTATS_LABEL(KNUM,CTYPE,CDESC) | ||
END SUBROUTINE | ||
|
||
SUBROUTINE GSTATS_NVTX(KNUM,KSWITCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure you actually want this. This is a wrapper around the normal gstats to enable nvtx support (nsight system)
HTRGTOL%HCOMBUFR_AND_REEL = RESERVE(ALLOCATOR, NELEM) | ||
END FUNCTION | ||
|
||
SUBROUTINE TRGTOL(ALLOCATOR,HTRGTOL,PREEL_REAL,KF_FS,KF_GP,KF_UV_G,KF_SCALARS_G,& |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped the non-cuda aware path and moved everything to OpenACC here. This is pretty much a rewrite of this function. Note that the extra routines (inigptr) have been dropped and integrated here. I gave my best to properly comment the code because this memory layouts are really tough to undersatnd. Also all codes in TRLTOG/TRGTOL are made sure to be very clearly "reversed" each other.
This PR will essentially be satisfied once redgreen-optimized is merged into develop. |
The remaining gap is in this draft PR: marsdeno#4 (not meant to be merged, just documentation purpose) |
Fixed a bug. No longer getting segfaults, but the norms are incorrect.
As discussed, I am going to open this pull request with the changes we did in the previous weeks.
Let me know how we want to proceed. Note that in principle, each single commit can be compiled and should work as is.