Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Path optimizations #9

Closed
wants to merge 263 commits into from
Closed
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
263 commits
Select commit Hold shift + click to select a range
ff4340a
Fix openacc
lukasm91 May 24, 2022
56ea3b2
Do not compute KMLOC0 twice
lukasm91 May 24, 2022
e5d2b06
Simplify final write
lukasm91 May 24, 2022
03d9977
Simplify matrix multiplications
lukasm91 May 24, 2022
1f4ac75
Format LEDIR
lukasm91 May 24, 2022
0f081b7
Make output loops smaller
lukasm91 May 24, 2022
d03e67b
Cleanup uvtvd
lukasm91 May 24, 2022
2b07cf9
Cleanup ldfou2
lukasm91 May 24, 2022
e415b09
Cleanup prfi2b
lukasm91 May 24, 2022
4ce3772
simple rename
lukasm91 May 24, 2022
184350b
Improve performance for scaling/ kernel
lukasm91 May 24, 2022
6f694a7
avoid some over-computation in ftdir
lukasm91 May 24, 2022
c9bdc9e
Restructure fourier_in - little slow down but better readibility
lukasm91 May 24, 2022
784e623
Cleanup trgtol writes into ZGTF
lukasm91 May 24, 2022
66731ed
Simplify truncation
lukasm91 May 24, 2022
5de8ed5
Remove redundant temporaries for ledir
lukasm91 May 24, 2022
4676c9c
Move the OpenACC Updates for trltom to where they belong to
lukasm91 May 24, 2022
a575ed2
Various small improvements / renamings
lukasm91 May 24, 2022
cb109d5
tight packing in trmtol/trltom for FOUBUF/FOUBUF_IN
lukasm91 May 24, 2022
cf90e4d
Move FOURIER_OUT into FTDIR, and cleanup
lukasm91 May 24, 2022
2ae19b9
Directly write FOUBUF_IN
lukasm91 May 24, 2022
75b74c4
assume that trgtol outputs on device
lukasm91 May 24, 2022
baa42be
ZGTF is completely written!
lukasm91 May 24, 2022
2de7adf
pin buffers
lukasm91 May 24, 2022
aac7e2a
Add option to disable file dumps
lukasm91 May 24, 2022
dda2816
improve data regions / add some async
lukasm91 May 24, 2022
7606b23
add 2 more labels
lukasm91 May 24, 2022
5230b16
add quite some new barriers/labels
lukasm91 May 24, 2022
18c0e99
Model FOURIER_IN according to FOURIER_OUT
lukasm91 May 24, 2022
6416140
Merge the many kernels in FSC
lukasm91 May 24, 2022
77ce5c2
Adapt FSC Layout to what we have in FOURIER_IN
lukasm91 May 24, 2022
db141f8
Avoid over-computation in FSC
lukasm91 May 24, 2022
9bf4f7b
Truncation is implicitly handled because we only fill the relevant data
lukasm91 May 24, 2022
92a8131
Move ZGTF_START_INDEX to tpm_fields and initialize in ftinv
lukasm91 May 24, 2022
9217c28
Compute only the INVFFTs that are actually needed
lukasm91 May 24, 2022
5df2a7b
Cleanup FTINV_MOD.F90
lukasm91 May 24, 2022
60a0e8e
Changes for TRMTOL
lukasm91 May 24, 2022
5326ba6
Apply the usual kernel pattern to leinv
lukasm91 May 24, 2022
6a4652c
simplify recombination inv le
lukasm91 May 24, 2022
13f4c3a
First cleanup prfi1b
lukasm91 May 24, 2022
0698f15
Remove ZN
lukasm91 May 24, 2022
5afa4aa
Remove ZALPIN and ZEPSMN
lukasm91 May 24, 2022
9e2bd80
Restructure loop
lukasm91 May 24, 2022
b1ca5b9
Usual restructuring for SPNSDE
lukasm91 May 24, 2022
cdeae31
Simplify GEMMS
lukasm91 May 24, 2022
5d3c0d8
LEINV/LEDIR are more similar now and reallocate data
lukasm91 May 24, 2022
365eb74
Use same indexing and PIA/POA size.
lukasm91 May 24, 2022
afc7380
Inline asre1b into LEDIR
lukasm91 May 24, 2022
11497e6
Zero risk cleanup for ltinv and leinv
lukasm91 May 24, 2022
729ff7c
Merge write back and remove ZAOA and ZSOA
lukasm91 May 24, 2022
22b06bc
Merge FOUBUF_IN filling for LEINV
lukasm91 May 24, 2022
868d88d
Merge FOUBUF_IN reading for LEDIR
lukasm91 May 24, 2022
f7ab48f
Cleanup setup_trans and allocate less data
lukasm91 May 24, 2022
c2f828d
Document inigptr and trgtol
lukasm91 May 24, 2022
9158587
Simplify summing over blocks
lukasm91 May 24, 2022
6be9c7c
cleanup interface
lukasm91 May 24, 2022
8d63730
Simplify some index computations in TRGTOL
lukasm91 May 24, 2022
56de657
FIX: Fix index computation
lukasm91 May 24, 2022
5592d71
Some minor cleanup in TRGTOL
lukasm91 May 24, 2022
93eb16a
Split packing loop similar to self transpose
lukasm91 May 24, 2022
cbde347
Strucutre pack and self-send similarly
lukasm91 May 24, 2022
cd05641
Simplify GP_XXX indexing
lukasm91 May 24, 2022
4fe9543
Minor non-critical cleanup in TRGTOL
lukasm91 May 24, 2022
8c93b89
Make ZCOMBUFS/ZCOMBUFR properly sized
lukasm91 May 24, 2022
d9c28ac
Not critical: tiny cleanup
lukasm91 May 24, 2022
faea503
Simplify filling of receiver side in trgtol
lukasm91 May 24, 2022
c33af42
Merge KINDEX code (and rename)
lukasm91 May 24, 2022
a79f8fc
Not critical: bunch of renaming
lukasm91 May 24, 2022
5bdaa71
Tiny cleanups in TRGTOL as preparation for TRLTOG
lukasm91 May 24, 2022
531ba2f
Align TRLTOG with TRGTOL (huge change, but exactly reversed to TRLTOG)
lukasm91 May 24, 2022
33f2b05
Reallocate ZOA2 inside ltdir and adapt interfaces accoringly
lukasm91 May 24, 2022
3cc867c
LTINV reallocates PIA now
lukasm91 May 24, 2022
7c7300e
Fix allocations in LTINV
lukasm91 May 24, 2022
a65fbf6
LEINV can now infer # fields; and we can pas FOUBUF_IN
lukasm91 May 24, 2022
97bbf8b
Simple interface improvements
lukasm91 May 24, 2022
f0b5e5c
Improve allocation of FOURIER_IN
lukasm91 May 24, 2022
e688ab3
Non critical simple clean up work
lukasm91 May 24, 2022
d061744
Add FOUBUF to LTINV_CTL arguments
lukasm91 May 24, 2022
5154a76
Make FOUBUF allocatable and pass through
lukasm91 May 24, 2022
b8a09c6
Pass KFIELD through to Fourier Transform
lukasm91 May 24, 2022
79dc709
ZGTF is now reallocated and properly sized in inv_trans
lukasm91 May 24, 2022
77748f4
Avoid copy of ZGTF
lukasm91 May 24, 2022
6048053
Add missing synchronization in ftdir
lukasm91 May 24, 2022
0fb20ea
Make interface slightly more restrictive. If we want to have this fle…
lukasm91 May 24, 2022
ec7e47d
Slightly reduce the interfaces
lukasm91 May 24, 2022
78514a5
Typo: Wrong offsets to FSC
lukasm91 May 24, 2022
7bf22c0
Cleanup TRLTOG vertical offsets
lukasm91 May 24, 2022
f5b47a2
Explicitly pass arrays into FTDIR
lukasm91 May 24, 2022
be78530
Add back FOURIER_OUT function/file
lukasm91 May 24, 2022
e99a030
ZGTF is now a local variable
lukasm91 May 24, 2022
8a79c6a
Implement pointer swap in ftdir
lukasm91 May 24, 2022
d189d72
Non-critical: FTDIR and FTINV perfectly shadow eachother now
lukasm91 May 24, 2022
29dd4c3
Minor changes to make FOURIER_IN and FOURIER_OUT more ismilar
lukasm91 May 24, 2022
2951caa
Pass through FOUBUF_IN
lukasm91 May 24, 2022
8b60f83
Re-allocate FOUBUF_IN in DIR_TRANS
lukasm91 May 24, 2022
4328cbf
Reallocate FOUBUF in DIR_TRANS
lukasm91 May 24, 2022
27bdcfd
Reallocate POA1 in LEDIR
lukasm91 May 24, 2022
f77daa3
Remove some allocations from setup_trans
lukasm91 May 24, 2022
fe1cee4
Remove redundand variables from fields and dir files
lukasm91 May 24, 2022
57a5e60
No more need to compute divergence if vorticity is needed
lukasm91 May 24, 2022
2a4a6fd
Remove redundant variables
lukasm91 May 24, 2022
7887b0d
Use pointers for clarity
lukasm91 May 24, 2022
3b66b49
Accidentally added to many FFTs again
lukasm91 May 24, 2022
6445bc1
Interface changes between complex/non-complex field counts
lukasm91 May 24, 2022
40d7596
Tiny cleanup in modules
lukasm91 May 24, 2022
9d2e2b3
Put copyins and copyouts at the same place for INV and DIR
lukasm91 May 24, 2022
a571122
Refactor 4XX GSTATS (NVIDIA GSTATS)
lukasm91 May 24, 2022
b28ad84
Remove barrier that are not ours
lukasm91 May 24, 2022
6000cce
Redirect some GSTATS function to add nvtx
lukasm91 May 24, 2022
d692ed5
Add missing GEMM label
lukasm91 May 24, 2022
67263d7
Incase parallelism again for some slow kernels in DIR
lukasm91 May 24, 2022
3458066
Pimp a bit the NVTX coloring
lukasm91 May 24, 2022
699cf3e
Try improve LEDIR GEMM array packing
lukasm91 May 24, 2022
7574013
Remove scalar copyins
lukasm91 May 24, 2022
8f4a7d3
clang-format
lukasm91 May 24, 2022
77b7658
CUFFT: Use workspace
lukasm91 May 24, 2022
38d98aa
The complex part of ZGTF is compact now
lukasm91 May 24, 2022
e076def
CUFFT: Fix memory layout and reduce memory overhead for dirtrans (CHA…
lukasm91 May 24, 2022
983769c
Fix memory layout and reduce memory overhead for invtrans (CHANGE2: 6)
lukasm91 May 24, 2022
cac89b3
slightly reduce data regions overlap in ftdir
lukasm91 May 24, 2022
c507767
slightly reduce data region overlap in ftinv
lukasm91 May 24, 2022
9dfdc88
Do not zero out full PREEL, but only the parts that will not be set
lukasm91 May 24, 2022
958d0f8
use proper size for FOUBUFS/R
lukasm91 May 24, 2022
8d4f385
Cleanup copyins
lukasm91 May 24, 2022
740075f
Fix intent when allocatable state is being changed
lukasm91 May 24, 2022
a6873aa
FIX: Do not over-compute ZOUT0
lukasm91 May 24, 2022
34f5c63
Compute m=0 in double precision for inverse transform (CHANGE1: 5) (C…
lukasm91 May 24, 2022
93746eb
Use cudaGraphs for FFTs
lukasm91 May 24, 2022
dc3c6f2
Linearize PREEL_XXX for dir_trans
lukasm91 May 24, 2022
16e4320
linearize large parts of preel for inv
lukasm91 May 24, 2022
cfb03c3
linearize PREEL in FOURIER_IN for INV
lukasm91 May 24, 2022
93158d1
FSC is nomore pointer based (ready for transposition)
lukasm91 May 24, 2022
4f0f9b4
linearize PREEL in FSC
lukasm91 May 24, 2022
4994f31
slightly simplify offset computation in FSC
lukasm91 May 24, 2022
92e080f
Prepare for FFT transposition
lukasm91 May 24, 2022
f8938df
DIR FFT is transposed now (CHANGE1: 6) (CHANGE2: 8)
lukasm91 May 24, 2022
1182224
fft dir: transpose complex part and remove intermediate
lukasm91 May 24, 2022
d4b5a41
FFT Dir trans: move the temporary real buffer to trgtol and remove do…
lukasm91 May 24, 2022
57febf9
Re-enable GPNORM
lukasm91 Jun 2, 2022
3de130e
FFT Dir: Integrate transformed preel into trgtol
lukasm91 May 24, 2022
34c3106
INV: Prepare for FFT transposition
lukasm91 May 24, 2022
56efaf5
INV FFT is transposed now (CHANGE1: 7) (CHANGE2: 9)
lukasm91 May 24, 2022
9deea48
Redundant/temporary duplication in the fft wrappers not needed anymore
lukasm91 May 24, 2022
5eb4bc5
FFT INV: Remove double buffer for preel_real and transpose in trltog
lukasm91 May 24, 2022
41c098b
INV: Move transposition into ftinv_ctl_mod
lukasm91 May 24, 2022
b529aa3
INV: In-place FFT
lukasm91 May 24, 2022
337d453
INV: Adapt FSC to the tranposed layout
lukasm91 May 24, 2022
8280e2c
INV: Fourier_in is transposed, too
lukasm91 May 24, 2022
3b498fe
INV: remove now redundant field
lukasm91 May 24, 2022
a3f7d5f
Clean up unused functions from fft wrappers
lukasm91 May 24, 2022
3d8d069
Avoid reallocating PREEL (needed for cudaGraph)
lukasm91 May 24, 2022
8f57633
TRGTOL: Use contiguous memory accesses
lukasm91 May 24, 2022
a8870cc
TRLTOG: Use contiguous memory accesses
lukasm91 May 24, 2022
65d5b30
Merge loops in Fourier_IN
lukasm91 May 24, 2022
d0eed6e
Merge loops in FOURIER_OUT
lukasm91 May 24, 2022
41a57d4
Fix memory accesses and merge loops for FSC
lukasm91 May 24, 2022
7b6af6c
Merge loops for leinv (now same as ledir)
lukasm91 May 24, 2022
b043081
INV: Fourier_in should not over compute preel
lukasm91 May 24, 2022
b93c6a1
INV: FSC should not over compute preel
lukasm91 May 24, 2022
c62aa33
Remove any extra padding in PREEL (CHANGE1: 8) (CHANGE2: 10)
lukasm91 May 24, 2022
3aef241
Improve fourier_* (tiling for transposition)
lukasm91 May 24, 2022
3a369cd
Mnor cleanup in sump_trans_od
lukasm91 May 24, 2022
50952c4
Slightly simplify the foubuf indexing by storing global indices
lukasm91 May 24, 2022
3ee8454
Route all GEMMs through CUDA_GEMM_BATCHED interface
lukasm91 May 24, 2022
a3043d9
Partial cleanup in algor folder
lukasm91 May 24, 2022
930168a
Remove cuda_device_mod (use cudafor instead)
lukasm91 May 24, 2022
bf6e607
Remove unused functions from GEMM wrapper
lukasm91 May 24, 2022
e6506d0
Move culas*gemmBatched to cublas*gemmStridedBatched
lukasm91 May 24, 2022
6089809
Cleanup GEMM interfaces
lukasm91 May 24, 2022
2d1b671
VDTUV parallelized properly
lukasm91 May 24, 2022
b05f2b7
Parallelize prfi1b properly
lukasm91 May 24, 2022
dfcc932
Paralllelize spnsde properly
lukasm91 May 24, 2022
41fc8fe
Use single GEMM calls, slow because lots of syncs (CHANGE1: 9) (CHANG…
lukasm91 May 24, 2022
b44baa6
Multiple GEMM calls (CHANGE1: 10) (CHANGE2: 12)
lukasm91 May 24, 2022
96889cc
Do not synchronize after each GEMM
lukasm91 May 24, 2022
ba65812
Add grouped GEMM in LEINV
lukasm91 May 24, 2022
31dc1d0
Add grouped GEMM in LEDIR
lukasm91 May 24, 2022
7362996
remove TDZAA/TDZAS and add "strides" variables to leinv/dir
lukasm91 May 24, 2022
b3e0625
Add alignment option
lukasm91 May 24, 2022
8936ce0
add first cutlass implementation (CHANGE1: 11) (CHANGE2: 13)
lukasm91 May 24, 2022
10d0883
change grouped gemm from stride to offset based
lukasm91 May 24, 2022
5ffc23e
Skip KMLOC0 if on my proc
lukasm91 May 24, 2022
63cf518
Update arch to Sm70 for pre-Ampere archs
lukasm91 May 24, 2022
40ab86a
add cuda graphs for GEMMS (slow at this point because no caching)
lukasm91 May 24, 2022
353e6e8
re-use buffers for leinv
lukasm91 May 24, 2022
e0008ed
re-use buffers for ledir
lukasm91 May 24, 2022
832dce1
Rename reuse pointer
lukasm91 May 24, 2022
8ecc13d
Add option to use openacc streams
lukasm91 May 24, 2022
69f12ce
Add ZINPS0/ZINPA0 to have same semantics in ledir
lukasm91 May 24, 2022
6c28dd6
Merge kernels for asymm/ledir
lukasm91 May 24, 2022
29e1a45
Merge kernels for symm/ledir
lukasm91 May 24, 2022
48962f0
Run DGEMMs before SGEMMs in ledir
lukasm91 May 24, 2022
fc419af
add async statements in ledir
lukasm91 May 24, 2022
a69c969
Add ZOUTS0/ZOUTA0 to have same semantics in leinv
lukasm91 May 24, 2022
0396062
Move around kernels in leinv
lukasm91 May 24, 2022
08a53cc
Merge input kernels in leinv
lukasm91 May 24, 2022
715a1d6
Merge output kernels in leinv
lukasm91 May 24, 2022
3fd83ff
Run DGEMMs before SGEMMs in leinv
lukasm91 May 24, 2022
19d31f0
add async statements in leinv
lukasm91 May 24, 2022
43d5294
enable 3XTF32 on ampere
lukasm91 May 24, 2022
4762963
Remove unneeded zero init
lukasm91 May 24, 2022
0352bf7
Allow shortcut if only one process for trltom
lukasm91 May 24, 2022
665c81a
Move TRLTOM to dir_trans
lukasm91 May 24, 2022
547315b
Remove empty ltdir_ctl wrapper
lukasm91 May 24, 2022
171e43f
move fourier_out outwards
lukasm91 May 24, 2022
363e7da
Restructure LEDIR
lukasm91 May 24, 2022
bd5e907
Move ledir pack to ltdir
lukasm91 May 24, 2022
fec161f
Move packing for legendre transform into dir_trans
lukasm91 May 24, 2022
d2607da
move self copy in trgtol, remove trgtol (not cudaaware)
lukasm91 Jun 2, 2022
2576d41
Add allocator / incomplete cleanup but working
lukasm91 May 24, 2022
e5fbb9e
Simplify KVSET computation in dirtrans
lukasm91 May 24, 2022
c3181dd
Remove ftdir_ctl wrapper
lukasm91 May 24, 2022
0475c8f
simplify ledir a bit
lukasm91 May 24, 2022
30c9ece
share reuse_ptr
lukasm91 May 24, 2022
36581de
Merge fourier_out and pack_buffs into new file
lukasm91 May 24, 2022
057c4c3
remove ltinv_ctl_mod calls
lukasm91 May 24, 2022
6d4ccc2
merge ftinv_ctl into inv_trans temporarily
lukasm91 May 24, 2022
1a779dc
Inverse transform: Add empty handles
lukasm91 May 24, 2022
e9dde3b
Initial allocation buffering implementation for inv trans
lukasm91 May 24, 2022
4e379be
Split leinv into leinv and leinv_pack
lukasm91 May 24, 2022
8869d26
Finish split leinv and leinv_pack
lukasm91 May 24, 2022
bd03377
Move index computations into TRLTOG
lukasm91 May 24, 2022
4c3fa0e
Fix alignment in allocator and add implicit none
lukasm91 May 24, 2022
3aa054c
Reformat some files
lukasm91 May 24, 2022
de1ef2e
Re-enable gpnorm after breaking in 'Add allocator / incomplete cleanu…
lukasm91 Jun 2, 2022
2d30553
Remove adjoint fuctions (we should write them if needed)
lukasm91 May 24, 2022
0a1177e
Minor cleanup with module use
lukasm91 May 24, 2022
bd09839
Remove FSPGL_INT_MOD
lukasm91 May 24, 2022
7f12a1c
simplify control logic in in main driver routines
lukasm91 May 24, 2022
4499c26
ldenv=.false for nsys
lukasm91 May 24, 2022
a9f8892
disable OpenMP dependent domain decomposition computation in driver
lukasm91 May 24, 2022
dd71da0
Change output of program driver
lukasm91 May 24, 2022
1f747f1
add second executable
lukasm91 May 24, 2022
5b0af2a
Add a call to gpnorm
lukasm91 Jun 2, 2022
e137897
Make dump optional
lukasm91 Jul 27, 2022
974069d
add dump directory as env variable
lukasm91 Jul 29, 2022
eb7e29b
fix size of zinp/zout in ledir
lukasm91 Jul 29, 2022
187d17f
fix size of zinp/zout in leinv
lukasm91 Jul 29, 2022
14b7f03
use same strategy as for other offset arrays
lukasm91 Jul 29, 2022
cbbb666
Remove direct transform
lukasm91 Jul 29, 2022
fcf48ca
Add functionality to allocator to set all data to NaN
lukasm91 Aug 9, 2022
f42b15b
Add some trickery for full app
lukasm91 Aug 10, 2022
7b7b7b3
Fix to support different resolutions
lukasm91 Aug 11, 2022
2ac1b3b
Typo in driver
lukasm91 Aug 11, 2022
c771877
Cleanup setup_trans / do no re-allocate arrays
lukasm91 Aug 11, 2022
d7f8b37
Tiny mix in allocator (not acually used in production)
lukasm91 Aug 30, 2022
b2953e4
FIX: GPNORM issue when NLEV changes across calls
lukasm91 Sep 1, 2022
bcae6f9
Remove redundant transfers
lukasm91 Sep 2, 2022
a017a5b
Add missing copyrights
lukasm91 Oct 4, 2022
19b4d13
Fix install of interface
lukasm91 Oct 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions AUTHORS
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
Authors and Contributors
========================

- P. Courtier (ECMWF)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you wonder - I just sorted this file when adding myself to the contributors because the file tells us it should be sorted.

- W. Deconinck (ECMWF)
- D. Degrauwe (RMI)
- D. Dent (ECMWF)
- P. Dueben (ECMWF)
- R. El Khatib (Meteo France)
- D. Giard (Meteo France)
- J. Hague (ECMWF)
- M. Hamrud (ECMWF)
- M. Hortal (ECMWF)
- L. Isaksen (ECMWF)
- G. Mozdzynski (ECMWF)
- P. Marguinaud (Meteo France)
- L. Mosimann (NVIDIA)
- G. Mozdzynski (ECMWF)
- A. Mueller (ECMWF)
- M. Hortal (ECMWF)
- P. Courtier (ECMWF)
- D. Degrauwe (RMI)
- D. Giard (Meteo France)
- G. Radnoti (ECMWF)
- D. Salmond (ECMWF)
- Y. Seity (Meteo France)
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
cmake_minimum_required( VERSION 3.12 FATAL_ERROR )
find_package( ecbuild 3.4 REQUIRED HINTS ${CMAKE_CURRENT_SOURCE_DIR} ${CMAKE_CURRENT_SOURCE_DIR}/../ecbuild )

project( ectrans LANGUAGES C Fortran )
project( ectrans LANGUAGES C CXX Fortran )
include( ectrans_macros )

ecbuild_enable_fortran( REQUIRED NO_MODULE_DIRECTORY )
Expand Down
31 changes: 12 additions & 19 deletions src/programs/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# (C) Copyright 2020- ECMWF.
# (C) Copyright 2022- NVIDIA.
#
# This software is licensed under the terms of the Apache Licence Version 2.0
# which can be obtained at http://www.apache.org/licenses/LICENSE-2.0.
Expand All @@ -25,20 +26,16 @@ if( HAVE_TOOLS AND TARGET eccodes_f90 )
LIBS ${trans} eccodes_f90
LINKER_LANGUAGE Fortran
DEFINITIONS ECTRANS_TOOLS_RTABLE_PATH="${ECTRANS_TOOLS_RTABLE_PATH}" )

endforeach()


endif()


set( HAVE_dp ${HAVE_DOUBLE_PRECISION} )
set( HAVE_sp ${HAVE_SINGLE_PRECISION} )

if( HAVE_GPU )
foreach( prec sp dp )
if( HAVE_${prec} )
ecbuild_add_executable(TARGET driver-spectrans-${prec}
ecbuild_add_executable(TARGET driver-spectrans-CA-${prec}
SOURCES driver-spectraltransform.F90
INCLUDES
${MPI_Fortran_INCLUDE_PATH}
Expand All @@ -47,14 +44,18 @@ if( HAVE_GPU )
fiat parkind_${prec}
eccodes_f90 eccodes_memfs
${MPI_Fortran_LIBRARIES}
trans_gpu_static_${prec}
trans_gpu_static_CA_${prec}
gpu
OpenACC::OpenACC_Fortran
${LAPACK_LIBRARIES}
nvhpcwrapnvtx
)
ecbuild_add_executable(TARGET driver-spectrans-CA-${prec}
SOURCES driver-spectraltransform.F90
set_property( TARGET driver-spectrans-CA-${prec} PROPERTY CUDA_ARCHITECTURES 70 )
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes are just to make it work for me! I don't see myself as a ecbuild expert; and I think there have been changes in master.

target_compile_options( driver-spectrans-CA-${prec} PRIVATE $<$<COMPILE_LANGUAGE:Fortran>:-g -acc -Minfo=acc -gpu=cc70,lineinfo,deepcopy,fastmath,nordc,pinned -cudalib=cufft,cublas -fpic> )
set_target_properties(driver-spectrans-CA-${prec} PROPERTIES LINK_FLAGS "-acc -cudalib=cufft,cublas -fpic -gpu=cc70,pinned")

ecbuild_add_executable(TARGET driver-spectrans-CA-${prec}-indiv
SOURCES driver-spectraltransform_indiv.F90
INCLUDES
${MPI_Fortran_INCLUDE_PATH}
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/../trans/gpu/include/ectrans>
Expand All @@ -68,17 +69,9 @@ if( HAVE_GPU )
${LAPACK_LIBRARIES}
nvhpcwrapnvtx
)
#trans_gpu_static_${prec}
#gpu
#${CMAKE_BINARY_DIR}/lib/libtrans_gpu_static_${prec}.a
#${CMAKE_BINARY_DIR}/lib/libgpu.a
#target_link_libraries( driver-spectrans PRIVATE OpenACC::OpenACC_Fortran )
set_property( TARGET driver-spectrans-${prec} PROPERTY CUDA_ARCHITECTURES 70 )
set_property( TARGET driver-spectrans-CA-${prec} PROPERTY CUDA_ARCHITECTURES 70 )
target_compile_options( driver-spectrans-${prec} PRIVATE $<$<COMPILE_LANGUAGE:Fortran>:-g -acc -Minfo=acc -gpu=cc70,lineinfo,deepcopy,fastmath,nordc -cudalib=cufft,cublas -fpic> )
target_compile_options( driver-spectrans-CA-${prec} PRIVATE $<$<COMPILE_LANGUAGE:Fortran>:-g -acc -Minfo=acc -gpu=cc70,lineinfo,deepcopy,fastmath,nordc -cudalib=cufft,cublas -fpic> )
set_target_properties(driver-spectrans-${prec} PROPERTIES LINK_FLAGS "-acc -cudalib=cufft,cublas -fpic")
set_target_properties(driver-spectrans-CA-${prec} PROPERTIES LINK_FLAGS "-acc -cudalib=cufft,cublas -fpic")
set_property( TARGET driver-spectrans-CA-${prec}-indiv PROPERTY CUDA_ARCHITECTURES 70 )
target_compile_options( driver-spectrans-CA-${prec}-indiv PRIVATE $<$<COMPILE_LANGUAGE:Fortran>:-g -acc -Minfo=acc -gpu=cc70,lineinfo,deepcopy,fastmath,nordc,pinned -cudalib=cufft,cublas -fpic> )
set_target_properties(driver-spectrans-CA-${prec}-indiv PROPERTIES LINK_FLAGS "-acc -cudalib=cufft,cublas -fpic -gpu=cc70,pinned")
message("Building ${prec} GPU driver")
endif()
endforeach()
Expand Down
121 changes: 107 additions & 14 deletions src/programs/driver-spectraltransform.F90
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
! (C) Copyright 2014- ECMWF.
! (C) Copyright 2022- NVIDIA.
!
! This software is licensed under the terms of the Apache Licence Version 2.0
! which can be obtained at http://www.apache.org/licenses/LICENSE-2.0.
Expand Down Expand Up @@ -76,13 +77,17 @@ PROGRAM TRANSFORM_TEST
REAL(KIND=JPRB), POINTER :: ZT(:,:,:) => NULL()
REAL(KIND=JPRB), ALLOCATABLE :: ZSP(:,:)

REAL(KIND=JPRB),ALLOCATABLE :: PAVE(:)
REAL(KIND=JPRB),ALLOCATABLE :: PMIN(:)
REAL(KIND=JPRB),ALLOCATABLE :: PMAX(:)

LOGICAL :: LSTACK
LOGICAL :: LDONE,LSTDEV
LOGICAL :: LUSERPNM, LKEEPRPNM, LUSEFLT
LOGICAL :: LTRACE_STATS,LSTATS_OMP, LSTATS_COMMS, LSTATS_MPL
LOGICAL :: LSTATS,LBARRIER_STATS, LBARRIER_STATS2, LDETAILED_STATS
LOGICAL :: LSTATS_ALLOC, LSYNCSTATS, LSTATSCPU, LSTATS_MEM
LOGICAL :: LXML_STATS
LOGICAL :: LXML_STATS, LDUMP
LOGICAL :: LFFTW
INTEGER(KIND=JPIM) :: NSTATS_MEM, NTRACE_STATS, NPRNT_STATS
! 0 - no output, 1 - init and final result, 2 - every timestep
Expand Down Expand Up @@ -140,14 +145,15 @@ PROGRAM TRANSFORM_TEST
& LUSERPNM, LKEEPRPNM, LUSEFLT, NQ, NLIN, IMAX_FLDS_IN, &
& NPRINTNORMS, ITERS, ZMAXERR_CHECK, NPROMA, NPROMATR, LEQ_REGIONS, &
& NPRINTLEV, NPRTRW, NPRTRV, NSPECRESMIN, NFLEVG, MBX_SIZE, LSTACK, &
& LFFTW
& LFFTW, LDUMP

! ------------------------------------------------------------------

#include "setup_trans0.h"
#include "setup_trans.h"
#include "inv_trans.h"
#include "dir_trans.h"
#include "gpnorm_trans.h"
#include "dist_spec.h"
#include "gath_grid.h"
#include "trans_inq.h"
Expand Down Expand Up @@ -200,6 +206,7 @@ PROGRAM TRANSFORM_TEST
LBARRIER_STATS2=.FALSE.
LSTATSCPU=.FALSE.
LSYNCSTATS=.FALSE.
LDUMP=.TRUE.
LXML_STATS=.FALSE.
LTRACE_STATS=.FALSE.
NSTATS_MEM=0
Expand Down Expand Up @@ -243,7 +250,7 @@ PROGRAM TRANSFORM_TEST
! Participating processors limited by -P option

!--------------------------
CALL MPL_INIT()
CALL MPL_INIT(LDENV=.false.)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is a problem for you in some case but propagating the full environment is a bad thing if we run with nsys (because it sets some per-rank environment variables)

!IF( LSTATS ) CALL GSTATS(0,0)
ZTINIT=TIMEF()

Expand Down Expand Up @@ -308,7 +315,11 @@ PROGRAM TRANSFORM_TEST
IF( NPRTRV*NPRTRW /= NPROC ) CYCLE
IF( NPRTRV > NPRTRW ) EXIT
IF( NPRTRW > NSPECRESMIN ) CYCLE
! With CUDA AWARE MPI we don't need any OpenMP so there is no need for this! Effectively this is even
! undesireable because it may trigger different domain decompositions for no reasons on different machines
#ifndef USE_CUDA_AWARE_MPI_FT
IF( NPRTRW <= NSPECRESMIN/(2*OML_MAX_THREADS()) ) EXIT
#endif
ENDDO
! GO FOR APPROX SQUARE PARTITION FOR BACKUP
IF( NPRTRV*NPRTRW /= NPROC .OR. NPRTRW > NSPECRESMIN .OR. NPRTRV > NPRTRW ) THEN
Expand Down Expand Up @@ -771,6 +782,10 @@ PROGRAM TRANSFORM_TEST
ALLOCATE(ZGMV(NPROMA,NFLEVG,NDIMGMV,NGPBLKS))
ALLOCATE(ZGMVS(NPROMA,NDIMGMVS,NGPBLKS))

ALLOCATE(PMIN(NFLEVG))
ALLOCATE(PMAX(NFLEVG))
ALLOCATE(PAVE(NFLEVG))

ALLOCATE(ZNORMSP(1))
ALLOCATE(ZNORMSP1(1))
ALLOCATE(ZNORMVOR(NFLEVG))
Expand Down Expand Up @@ -857,10 +872,9 @@ PROGRAM TRANSFORM_TEST
ZTSTEP1(JSTEP)=(TIMEF()-ZTSTEP1(JSTEP))/1000.0_JPRD

! Dump a field to a binary file
CALL DUMP_GRIDPOINT_FIELD(JSTEP, MYPROC, NPROMA, NGPBLKS, ZGMVS(:,1,:), 'S', NOUTDUMP)
CALL DUMP_GRIDPOINT_FIELD(JSTEP, MYPROC, NPROMA, NGPBLKS, ZWINDS(:,NFLEVG,3,:), 'U', NOUTDUMP)
CALL DUMP_GRIDPOINT_FIELD(JSTEP, MYPROC, NPROMA, NGPBLKS, ZWINDS(:,NFLEVG,4,:), 'V', NOUTDUMP)
CALL DUMP_GRIDPOINT_FIELD(JSTEP, MYPROC, NPROMA, NGPBLKS, ZGMV(:,NFLEVG,5,:), 'T', NOUTDUMP)
IF (LDUMP) CALL DUMP_GRIDPOINT_FIELD_3D(JSTEP, MYPROC, ZGMVS(:,:,:), 'S', NOUTDUMP)
IF (LDUMP) CALL DUMP_GRIDPOINT_FIELD_4D(JSTEP, MYPROC, ZWINDS(:,:,:,:), 'W', NOUTDUMP)
IF (LDUMP) CALL DUMP_GRIDPOINT_FIELD_4D(JSTEP, MYPROC, ZGMV(:,:,:,:), 'M', NOUTDUMP)

ZTSTEP2(JSTEP)=TIMEF()
CALL DIR_TRANS(PSPVOR=ZVOR,PSPDIV=ZDIV,&
Expand All @@ -871,6 +885,12 @@ PROGRAM TRANSFORM_TEST
& PGP3A=ZGMV(:,:,5:5,:))
ZTSTEP2(JSTEP)=(TIMEF()-ZTSTEP2(JSTEP))/1000.0_JPRD

! Dump a field to a binary file
IF (LDUMP) CALL DUMP_GRIDPOINT_FIELD_2D(JSTEP, MYPROC, ZVOR(:,:), 'V', NOUTDUMP)
IF (LDUMP) CALL DUMP_GRIDPOINT_FIELD_2D(JSTEP, MYPROC, ZDIV(:,:), 'D', NOUTDUMP)
IF (LDUMP) CALL DUMP_GRIDPOINT_FIELD_2D(JSTEP, MYPROC, ZSP(:,:), 'P', NOUTDUMP)
IF (LDUMP) CALL DUMP_GRIDPOINT_FIELD_3D(JSTEP, MYPROC, ZT(:,:,:), 'T', NOUTDUMP)

ZTSTEP(JSTEP)=(TIMEF()-ZTSTEP(JSTEP))/1000.0_JPRD

ZTSTEPAVG=ZTSTEPAVG+ZTSTEP(JSTEP)
Expand Down Expand Up @@ -921,8 +941,32 @@ PROGRAM TRANSFORM_TEST
ELSE
WRITE(NOUT,'("time step ",I6," took", F8.4)') JSTEP,ZTSTEP(JSTEP)
ENDIF
flush(nout)
! call acc_present_dump()
! print *, "going to free in 3 seconds"
! call sleep (1)
! print *, "going to free in 2 seconds"
! call sleep (1)
! print *, "going to free in 1 seconds"
! call sleep (1)
! !call acc_clear_freelists()
! call sleep (5)
! !call acc_present_dump()
! !call sleep (10000)
ENDDO

CALL GPNORM_TRANS(ZWINDS(:,:,2,:),NFLEVG,KPROMA=NPROMA,PAVE=PAVE,PMIN=PMIN,PMAX=PMAX,LDAVE_ONLY=.false.,KRESOL=1)
if (myproc == 1) then
OPEN(800+myproc, FORM="UNFORMATTED")
write(800+myproc) "pave", sum(pave)/size(pave)
write(800+myproc) "pmin", sum(pmin)/size(pmin)
write(800+myproc) "pmax", sum(pmax)/size(pmax)
close(800+myproc)
print *, "pave", sum(pave)/size(pave)
print *, "pmin", sum(pmin)/size(pmin)
print *, "pmax", sum(pmax)/size(pmax)
endif

ZTLOOP=(TIMEF()-ZTLOOP)/1000.0_JPRD

WRITE(NOUT,'(" ")')
Expand Down Expand Up @@ -1266,28 +1310,77 @@ SUBROUTINE SORT(A, N)

! ------------------------------------------------------------------

SUBROUTINE DUMP_GRIDPOINT_FIELD(JSTEP, MYPROC, NPROMA, NGPBLKS, FLD, FLDCHAR, NOUTDUMP)
SUBROUTINE DUMP_GRIDPOINT_FIELD_2D(JSTEP, MYPROC, FLD, FLDCHAR, NOUTDUMP)

! Dump a 2D field to a binary file.

INTEGER(KIND=JPIM), INTENT(IN) :: JSTEP ! Time step, used for naming file
INTEGER(KIND=JPIM), INTENT(IN) :: MYPROC ! MPI rank, used for naming file
INTEGER(KIND=JPIM), INTENT(IN) :: NPROMA ! Size of NPROMA
INTEGER(KIND=JPIM), INTENT(IN) :: NGPBLKS ! Number of NPROMA blocks
REAL(KIND=JPRB) , INTENT(IN) :: FLD(NPROMA,NGPBLKS) ! 2D field
REAL(KIND=JPRB) , INTENT(IN) :: FLD(:,:) ! 2D field
CHARACTER , INTENT(IN) :: FLDCHAR ! Single character field identifier
INTEGER(KIND=JPIM), INTENT(IN) :: NOUTDUMP ! Unit number for output file

CHARACTER(LEN=14) :: FILENAME = "X.XXX.XXXX.dat"
CHARACTER(LEN=60) :: DUMP_DIR

WRITE(FILENAME(1:1),'(A1)') FLDCHAR
WRITE(FILENAME(3:5),'(I3.3)') JSTEP
WRITE(FILENAME(7:10),'(I4.4)') MYPROC

CALL GETENV("DUMP_DIR", DUMP_DIR)
IF (TRIM(DUMP_DIR) == "") CALL GETCWD(DUMP_DIR)
OPEN(NOUTDUMP, FILE=TRIM(DUMP_DIR)//'/'//FILENAME, FORM="UNFORMATTED")
WRITE(NOUTDUMP) FLD
CLOSE(NOUTDUMP)

END SUBROUTINE DUMP_GRIDPOINT_FIELD_2D
SUBROUTINE DUMP_GRIDPOINT_FIELD_3D(JSTEP, MYPROC, FLD, FLDCHAR, NOUTDUMP)

! Dump a 3D field to a binary file.

INTEGER(KIND=JPIM), INTENT(IN) :: JSTEP ! Time step, used for naming file
INTEGER(KIND=JPIM), INTENT(IN) :: MYPROC ! MPI rank, used for naming file
REAL(KIND=JPRB) , INTENT(IN) :: FLD(:,:,:) ! 3D field
CHARACTER , INTENT(IN) :: FLDCHAR ! Single character field identifier
INTEGER(KIND=JPIM), INTENT(IN) :: NOUTDUMP ! Unit number for output file

CHARACTER(LEN=14) :: FILENAME = "X.XXX.XXXX.dat"
CHARACTER(LEN=60) :: DUMP_DIR

WRITE(FILENAME(1:1),'(A1)') FLDCHAR
WRITE(FILENAME(3:5),'(I3.3)') JSTEP
WRITE(FILENAME(7:10),'(I4.4)') MYPROC

CALL GETENV("DUMP_DIR", DUMP_DIR)
IF (TRIM(DUMP_DIR) == "") CALL GETCWD(DUMP_DIR)
OPEN(NOUTDUMP, FILE=TRIM(DUMP_DIR)//'/'//FILENAME, FORM="UNFORMATTED")
WRITE(NOUTDUMP) FLD
CLOSE(NOUTDUMP)

END SUBROUTINE DUMP_GRIDPOINT_FIELD_3D
SUBROUTINE DUMP_GRIDPOINT_FIELD_4D(JSTEP, MYPROC, FLD, FLDCHAR, NOUTDUMP)

! Dump a 4D field to a binary file.

INTEGER(KIND=JPIM), INTENT(IN) :: JSTEP ! Time step, used for naming file
INTEGER(KIND=JPIM), INTENT(IN) :: MYPROC ! MPI rank, used for naming file
REAL(KIND=JPRB) , INTENT(IN) :: FLD(:,:,:,:) ! 4D field
CHARACTER , INTENT(IN) :: FLDCHAR ! Single character field identifier
INTEGER(KIND=JPIM), INTENT(IN) :: NOUTDUMP ! Unit number for output file

CHARACTER(LEN=14) :: FILENAME = "X.XXX.XXXX.dat"
CHARACTER(LEN=60) :: DUMP_DIR

WRITE(FILENAME(1:1),'(A1)') FLDCHAR
WRITE(FILENAME(3:5),'(I3.3)') JSTEP
WRITE(FILENAME(7:10),'(I4.4)') MYPROC

OPEN(NOUTDUMP, FILE=FILENAME, FORM="UNFORMATTED")
WRITE(NOUTDUMP) RESHAPE(FLD, (/ NPROMA*NGPBLKS /))
CALL GETENV("DUMP_DIR", DUMP_DIR)
IF (TRIM(DUMP_DIR) == "") CALL GETCWD(DUMP_DIR)
OPEN(NOUTDUMP, FILE=TRIM(DUMP_DIR)//'/'//FILENAME, FORM="UNFORMATTED")
WRITE(NOUTDUMP) FLD
CLOSE(NOUTDUMP)

END SUBROUTINE DUMP_GRIDPOINT_FIELD
END SUBROUTINE DUMP_GRIDPOINT_FIELD_4D

END PROGRAM TRANSFORM_TEST
Loading