Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Path optimizations #9

Closed
wants to merge 263 commits into from
Closed

GPU Path optimizations #9

wants to merge 263 commits into from

Conversation

lukasm91
Copy link
Collaborator

As discussed, I am going to open this pull request with the changes we did in the previous weeks.

Let me know how we want to proceed. Note that in principle, each single commit can be compiled and should work as is.

@@ -1,20 +1,21 @@
Authors and Contributors
========================

- P. Courtier (ECMWF)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you wonder - I just sorted this file when adding myself to the contributors because the file tells us it should be sorted.

gpu
OpenACC::OpenACC_Fortran
${LAPACK_LIBRARIES}
nvhpcwrapnvtx
)
ecbuild_add_executable(TARGET driver-spectrans-CA-${prec}
SOURCES driver-spectraltransform.F90
set_property( TARGET driver-spectrans-CA-${prec} PROPERTY CUDA_ARCHITECTURES 70 )
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes are just to make it work for me! I don't see myself as a ecbuild expert; and I think there have been changes in master.

@@ -243,7 +250,7 @@ PROGRAM TRANSFORM_TEST
! Participating processors limited by -P option

!--------------------------
CALL MPL_INIT()
CALL MPL_INIT(LDENV=.false.)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is a problem for you in some case but propagating the full environment is a bad thing if we run with nsys (because it sets some per-rank environment variables)

IF(IUBOUND(1) < NPROMA) THEN
WRITE(NOUT,*)'INV_TRANS:FIRST DIM. OF PGP2 TOO SMALL ',IUBOUND(1),NPROMA
CALL ABORT_TRANS('INV_TRANS:FIRST DIMENSION OF PGP2 TOO SMALL ')
IF(IUBOUND(1) /= NPROMA) THEN
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making the interface a bit more strict. This is not a problem for full IFS and I think in general it is good to make it strict.

@@ -0,0 +1,214 @@
! (C) Copyright 2022- NVIDIA.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocator is replacing the whole "pre-allocation" strategy that was in ectrans initially. We use double-buffering between the different steps of ectrans; the size of the allocations is done at the beginning to a call of ectrans and check against earlier allocations. The allocation is increased if needed.

USE TPM_TRANS ,ONLY : FOUBUF_IN, NF_SC2, NF_SC3A, NF_SC3B
!USE TPM_DISTR
WRITE(NOUT,*) 'ltdir_ctl:TRLTOM_CUDAAWARE'
CALL TRLTOM_PACK(ALLOCATOR,HTRLTOM_PACK,PREEL_COMPLEX,FOUBUF_IN,KF_FS)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that I moved the pack/unpack routines here on purpose because it makes more sense. The pack/unpack were initially called through TRLTOM/TRMTOG but I think this is not a good decision because these function live in between e.g. FTDIR and TRLTOM without preference for either function.

CALL GSTATS_LABEL(431,' ','INV COPIES')
CALL GSTATS_LABEL(440,' ','FULL DIRTRANS')
CALL GSTATS_LABEL(441,' ','FULL INVTRANS')
CALL GSTATS_LABEL(410,' ','DIR COMPLETE')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need some useful cleanup.

NS(KMLOC0) = 0
KS(KMLOC0) = 0
ENDIF
CALL CUDA_GEMM_BATCHED( &
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GEMMs implementation is a lot different than before: Also through CUDA graphs without zero padding. Note that ZAA0 and ZAS0 are still zero padded; this could still be fixed (which would free up some more memory)

!$ACC END HOST_DATA
END SUBROUTINE

SUBROUTINE EXECUTE_INV_FFT_DOUBLE(PREEL_COMPLEX,PREEL_REAL,KFIELD,LOENS,OFFSETS)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FFT plan mangagement was moved to the CUDA code (because it not only a FFT plan cache anymore, but actually a CUDA graph+fft plan cache)

CALL GSTATS_LABEL(KNUM,CTYPE,CDESC)
END SUBROUTINE

SUBROUTINE GSTATS_NVTX(KNUM,KSWITCH)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you actually want this. This is a wrapper around the normal gstats to enable nvtx support (nsight system)

HTRGTOL%HCOMBUFR_AND_REEL = RESERVE(ALLOCATOR, NELEM)
END FUNCTION

SUBROUTINE TRGTOL(ALLOCATOR,HTRGTOL,PREEL_REAL,KF_FS,KF_GP,KF_UV_G,KF_SCALARS_G,&
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the non-cuda aware path and moved everything to OpenACC here. This is pretty much a rewrite of this function. Note that the extra routines (inigptr) have been dropped and integrated here. I gave my best to properly comment the code because this memory layouts are really tough to undersatnd. Also all codes in TRLTOG/TRGTOL are made sure to be very clearly "reversed" each other.

@samhatfield
Copy link
Collaborator

This PR will essentially be satisfied once redgreen-optimized is merged into develop.

@samhatfield samhatfield closed this Jun 7, 2024
@lukasm91
Copy link
Collaborator Author

The remaining gap is in this draft PR: marsdeno#4 (not meant to be merged, just documentation purpose)

dmitrypek pushed a commit to dmitrypek/ectrans that referenced this pull request Aug 20, 2024
Fixed a bug. No longer getting segfaults, but the norms are incorrect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants