Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(0.88.0) MPI communication and computation overlap in the HydrostaticFreeSurfaceModel and NonhydrostaticModel #3125

Merged
merged 576 commits into from
Sep 19, 2023

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Jun 1, 2023

Overlapping MPI communication and computation in the HydrostaticFreeSurfaceModel and the NonhydrostaticModel.

In particular, this PR introduces two keyword arguments for the fill_halo_regions! function, active only in case of distributed halo passing boundary conditions:

  • theasync::Bool keyword argument that allows launching MPI operations without waiting for the communication to complete.
  • the only_local_halos::Bool keyword argument, which fills only the halo in case of a local (i.e., Flux, Value, Gradient, Periodic, and, temporarily, MultiRegionCommunication) boundary condition. This is required for having explicit boundary conditions (like Value or Flux) for turbulent diffusivities (we directly calculate diffusivities in the halos in the case of distributed BC).

This PR allows hiding MPI passing of barotropic auxiliary variables behind the implicit vertical solver and prognostic variables behind the tendency calculations. The latter is done by splitting the tendency kernels into an interior kernel that calculates tendencies between, e.g., i = Hx and i = Nx - Hx, and a boundary kernel, executed once communication is complete, that calculates tendencies adjacent to boundaries.
Before computing tendencies near the boundaries, boundary-adjacent auxiliary diagnostic variables are recalculated (hydrostatic pressure, vertical velocity and diffusivities for the hydrostatic model, hydrostatic pressure and diffusivities for the non-hydrostatic model)

Also, this PR introduces

  • 8-way halo passing for distributed grids (west, east, north, south, southwest, southeast, northwest, northeast), something that we will want to implement also for MultiRegion)
  • Non uniform partitioning for distributed grids
  • the KernelParameters(size, offset) to be passed to the launch! function to start a kernel of size size::Tuple offset by offset::Tuple

todo:

  • Implement Offsets in KernelAbstractions (at the moment implemented a fix in src/Utils/kernel_launching.jl to remove when offsets will be implemented)
  • Adapt that implementation in this PR
  • Generalize the current implementation of kernel offsets for Diffusivities, Pressures and W-velocity
  • Find a general way to tag MPI requests to allow large number of cores (currently Nranks < 100)
  • Remove views halo passing

API changes

  • When calling RectilinearGrid(arch::DistributedArch, size = (Nx, Ny, Nz), ....), (Nx, Ny, Nz) are the per-rank local sizes, not the global size to be divided (easy way to specify non-uniform partitioning, see validation/distributed/mpi_geostrophic_adjustment.jl)
  • added the enable_overlapped_communication keyword to DistributedArch (defaults to true)
  • removed the use_buffers keyword to DistributedArch (always use buffers, as views did not give significant speedup to justify maintaining two implementations)
  • added the keyword argument active_cells_map::Bool = false to ImmersedBoundaryGrid (ex: ImmersedBoundaryGrid(grid, ib, active_cells_map = true)
  • added a required_halo_size::Int keyword argument to ScalarDiffusivity (defaults to 1) and ScalarBiharmonicDiffusivity (defaults to 2) to be specified by the user which sets the required halo for the specific ν or κ function (closures have now an explicitly required number of halos)

Major internals change

  1. The tendencies are calculated at the end of a time step. Therefore at the end of a simulation model.timestepper will hold tendencies for the last time step completed

  2. Removed fill_halo_regions! for hydrostatic pressure in both the non-hydrostatic and the hydrostatic model and for w-velocity in the hydrostatic model. The halos are filled by enlarging the size of the kernels in update_hydrostatic_pressure! and compute_w_from_continuity! to incorporate the needed ghost points.

  3. Removed fill_halo_regions! for diffusivities only for halo-passing BC; the halo calculation is now performed by launching the calculate_diffusivity! kernel inside the ghost nodes before recomputing the tendencies. This requires knowing how many halos each closure requires.

  4. Added a required_halo parameter to AbstractTurbulenceClosure. This means that each parameterization will have to specify explicitly the number of halos required to calculate the diffusivity:
    e.g

abstract type AbstractTurbulenceClosure{TimeDiscretization, BoundaryBuffer} end
abstract type AbstractScalarDiffusivity{TD, F, N} <: AbstractTurbulenceClosure{TD, N} end
struct TwoDimensionalLeith{FT, CR, GM, M} <: AbstractScalarDiffusivity{ExplicitTimeDiscretization, ThreeDimensionalFormulation, 2}

Where Leith closure requires 2 halos (one for the vorticity calculation and an additional one for the vorticity derivative)

Minor internals change

removed the general calculate_nonlinear_viscosity! and calculate_nonlinear_diffusivity! kernels (to each turbulence closure their own kernel)

Requires JuliaGPU/KernelAbstractions.jl#399 # On hold at the moment
Closes #615
Closes #1882
Closes #3067
Closes #3068
Supersedes #2953

@simone-silvestri simone-silvestri marked this pull request as draft June 1, 2023 15:05
@navidcy
Copy link
Collaborator

navidcy commented Jun 3, 2023

definitely bump a release (minor or patch; whatever more appropriate) before this is merged :)

@navidcy navidcy added performance 🏍️ So we can get the wrong answer even faster GPU 👾 Where Oceananigans gets its powers from distributed 🕸️ Our plan for total cluster domination labels Jun 3, 2023
@simone-silvestri simone-silvestri marked this pull request as ready for review June 11, 2023 23:02
@navidcy navidcy changed the title MPI communication and computation overlap in the HydrostaticFreeSurfaceModel Take 2 MPI communication and computation overlap in the HydrostaticFreeSurfaceModel Jun 12, 2023
@navidcy
Copy link
Collaborator

navidcy commented Jun 12, 2023

Perhaps we wait for JuliaGPU/KernelAbstractions.jl#399 to be merged and tag a patch release for KA before we merge this? @vchuravy?

@simone-silvestri
Copy link
Collaborator Author

Sure! Anyways, this PR should be in its final form, so ready for review :) (will still wait for KA changes before merging)

@simone-silvestri
Copy link
Collaborator Author

So, it looks like docs take more to build in this PR when compared to main (4:00 hrs vs 3:40 hrs). Probably good to benchmark a bit. I ll try some benchmarking on GPU and CPU

@navidcy
Copy link
Collaborator

navidcy commented Jun 13, 2023

benchmarking sounds good!

but I thought this PR only made changes to the distributed grids, no?

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Jun 13, 2023

It also rearranges how time-stepping works to allow

  • minimal fill_halo_regions
  • overlapped communication and computation

and removes fill_halo_regions! for computation of w velocity, hydrostatic pressure and diffusivities by enlarging the kernels to include the halo points

Manifest.toml Outdated Show resolved Hide resolved
Copy link
Collaborator

@navidcy navidcy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few beautification remarks

src/Distributed/interleave_comm_and_comp.jl Outdated Show resolved Hide resolved
src/Distributed/interleave_comm_and_comp.jl Outdated Show resolved Hide resolved
src/Distributed/multi_architectures.jl Outdated Show resolved Hide resolved
src/Distributed/multi_architectures.jl Outdated Show resolved Hide resolved
src/Distributed/partition_assemble.jl Outdated Show resolved Hide resolved
src/ImmersedBoundaries/ImmersedBoundaries.jl Outdated Show resolved Hide resolved
src/ImmersedBoundaries/active_cells_map.jl Outdated Show resolved Hide resolved
@navidcy
Copy link
Collaborator

navidcy commented Sep 19, 2023

oh dear, my PR created merge conflicts......

@navidcy
Copy link
Collaborator

navidcy commented Sep 19, 2023

@simone-silvestri I think I resolve the conflicts alright but it wouldn't hurt if you have a look at b51e681...

@simone-silvestri simone-silvestri merged commit a2e83df into main Sep 19, 2023
@simone-silvestri simone-silvestri deleted the ss/load-balance-and-corners branch September 19, 2023 21:54
navidcy added a commit that referenced this pull request Sep 20, 2023
…cFreeSurfaceModel` and `NonhydrostaticModel` (#3125)

* comment

* fixed tag problems

* bugfix

* Update scalar_biharmonic_diffusivity.jl

* Update src/Distributed/multi_architectures.jl

Co-authored-by: Navid C. Constantinou <[email protected]>

* Update src/Distributed/partition_assemble.jl

Co-authored-by: Navid C. Constantinou <[email protected]>

* Update src/ImmersedBoundaries/ImmersedBoundaries.jl

Co-authored-by: Navid C. Constantinou <[email protected]>

* Update src/ImmersedBoundaries/active_cells_map.jl

Co-authored-by: Navid C. Constantinou <[email protected]>

* Update src/Distributed/interleave_comm_and_comp.jl

Co-authored-by: Navid C. Constantinou <[email protected]>

* Clean up batched tridiagonal solver and vertically implicit solver

* Fix bug in batched tridiagonal solver

* bugfix

* Try to fix multi region immersed boundary issue

* Hopefully fix immersed boundary grid constructor

* Another fix

* fixed project and manifest

* convert instead of FT

* export KernelParameters

* remove FT

* removed useless where FT

* small bugfix

* update manifest

* remove unbuffered communication

* little bit of a cleanup

* removed `views` comment

* couple of bugfixes

* fixed tests

* probably done

* same thing for nonhydrostatic model

* include file

* bugfix

* prepare for nonhydrostatic multiregion

* also here

* bugfix

* other bugfix

* fix closures

* bugfix

* simplify

* 2D leith requires 2 halos!

* AMD and Smag require 1 halo!

* wrong order

* correct halo handling for diffusivities

* correct Leith formulation + fixes

* `only_local_halos` kwarg in `fill_halo_regions!`

* bugfix

* FT on GPU

* bugfix

* bugfix

* last bugfix?

* removed all offsets from kernels + fixed all tests

* fix `_compute!`

* finished

* fixed broken tests

* fixed docs

* miscellaneous changes

* bugfix

* removed tests for vertical subdivision

* test corner passing

* correction

* retry

* fixed all problems

* Added a validation example

* fixed tests

* try new test

* fill send buffers in the correct place

* fixed comments

* define async

* pass the grid

* bugfix

* fix show method

* RefValue for mpi_tag

* comment

* add catke preprint

* remove warning; add ref to catke preprint

* some code cleanup

* correct the example

* Update src/TurbulenceClosures/vertically_implicit_diffusion_solver.jl

Co-authored-by: Navid C. Constantinou <[email protected]>

* bugfix

* Refactor unit tests

* Generalize regridding for lat-lon

* bugfix

* Add newline

* small correction

* new tests

* bugfix

* bugfix

* back for testing

* update manifest

* more options

* more

* finished

* test hypothesis

* fixed bug - correct speed now

* add space

* bugfix

* test

* more info

* removed left-right connected computation

* bugfix

* remove info

* improve

* typo

* bugfix

* bugfix

* correct comments

* bugfix

* bugfix prescribed velocities

* fixes

* ok on mac

* bugfix

* bug fixed

* bugfixxed

* new default

* bugfix

* remove <<<<HEAD

* bugfix PrescribedVelocityFields

* default in another PR

* bugfix

* flat grids only in Grids

* last bugfix

* bugfix

* try partial cells

* bugfix

* bugfix

* Update test_turbulence_closures.jl

* small fixes

* rework IBG and MRG

* Update src/TurbulenceClosures/vertically_implicit_diffusion_solver.jl

* small bugfix

* remove multiregion ibg with arrays for the moment

* bugfix

* little cleaner

* fixed tests

* see what the error is

* allow changing halos from checkpointer

* test it

* finally fixed it

* better naming

* bugfix

* bugfix

* bugfix

* bugfix

* removed useless tendency

* small fix

* dummy commit

* fix active cell map

* comment

* bugfix

* bugfix

* removed useless tendency

* maybe just keep it does not harm too much

* should have fixed it?

* let's go now

* done

* bugfix

* no need for this

* convert Δt in time stepping

* maximum

* minimum substeps

* more flexibility

* bugfix

* mutlidimensional

* fallback methods

* test a thing

* change

* chnage

* change

* change

* update

* update

* new offsets + return to previous KA

* bugfix

* bugfixxed

* remove debugging

* going back

* more robus partitioning

* quite new

* bugfix

* updated Manifest

* build with 1.9.3

* switch boundary_buffer to required_halo_size

* bugfix

* Update src/Models/HydrostaticFreeSurfaceModels/single_column_model_mode.jl

Co-authored-by: Gregory L. Wagner <[email protected]>

* Update src/Models/HydrostaticFreeSurfaceModels/update_hydrostatic_free_surface_model_state.jl

Co-authored-by: Gregory L. Wagner <[email protected]>

* bugfix

* biharmonic requires 2 halos

* buggfix

* compute_auxiliaries!

* bugfix

* fixed it

* little change

* some changes

* bugfix

* bugfix

* bugfixxed

* another bugfix

* compute_diffusivities!

* required halo size

* all fixed

* shorten line

* fix comment

* remove abbreviation

* remove unused functions

* better explanation of the MPI tag

* Update src/ImmersedBoundaries/active_cells_map.jl

Co-authored-by: Navid C. Constantinou <[email protected]>

* Update src/Solvers/batched_tridiagonal_solver.jl

Co-authored-by: Navid C. Constantinou <[email protected]>

* change name

* docstring

* name change on rank

* interior active cells

* calculate -> compute

* fixed tests

* do not compute momentum in prescribed velocities

* DistributedComputations

* DistributedComputations part #2

* bugfix

* fixed the docs

---------

Co-authored-by: Navid C. Constantinou <[email protected]>
Co-authored-by: Gregory L. Wagner <[email protected]>
@glwagner
Copy link
Member

do we use arch = Distributed() for distributed grids now? Might be good to summarize in the top description

@tomchor
Copy link
Collaborator

tomchor commented Sep 21, 2023

Requires JuliaGPU/KernelAbstractions.jl#399 # On hold at the moment

At the top comment it says this PR depends on another KA PR that isn't merged. However, this PR is merged (and tagged). Does this not depend on JuliaGPU/KernelAbstractions.jl#399 anymore?

@navidcy
Copy link
Collaborator

navidcy commented Sep 21, 2023

Requires JuliaGPU/KernelAbstractions.jl#399 # On hold at the moment

At the top comment it says this PR depends on another KA PR that isn't merged. However, this PR is merged (and tagged). Does this not depend on JuliaGPU/KernelAbstractions.jl#399 anymore?

Yeap, it doesn't. I edited the first post ;)

@PetrKryslUCSD
Copy link

How do you guys handle MPI tracing? (A la MPE, for instance...)

@glwagner
Copy link
Member

glwagner commented Oct 3, 2024

@PetrKryslUCSD I think we use nvtx, but @simone-silvestri can say more.

@simone-silvestri
Copy link
Collaborator Author

Using nsys it is possible to trace MPI with --trace=mpi, see:
https://docs.nvidia.com/nsight-systems/UserGuide/index.html

@PetrKryslUCSD
Copy link

PetrKryslUCSD commented Oct 7, 2024 via email

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Oct 7, 2024

Yep, we used nsys because our primary objective is to trace GPU execution. I think it will work also on CPU programs.
There is nothing really specific about profiling julia with nsys, provided that MPI is correctly configured (i.e. your script works with MPI already). An example of a batch script that traces MPI calls is

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=16
#SBATCH --mem=500GB
#SBATCH --time 24:00:00
#SBATCH --gres=gpus:4

cat > launch.sh << EoF_s
#! /bin/sh
export CUDA_VISIBLE_DEVICES=0,1,2,3
exec \$*
EoF_s
chmod +x launch.sh

srun nsys profile --trace=nvtx,cuda,mpi --output=report_%q{SLURM_PROCID} ./launch.sh julia --check-bounds=no --project scaling_experiments.jl 

Here, nsys will produce one report per processor. You can use mpirun or mpiexec instead of srun.
If you want to insert GC (garbage collection) annotations in the report through nvtx you need to add the environment variable (ref)

export JULIA_NVTX_CALLBACKS=gc

@PetrKryslUCSD
Copy link

PetrKryslUCSD commented Oct 7, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed 🕸️ Our plan for total cluster domination GPU 👾 Where Oceananigans gets its powers from performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
5 participants