Solving the 3D Navier-Cauchy equations for elastic waves on XPU using ParallelStencil.jl.
As a starting point, we used acoustic_2D_elast3.jl
from the final task of exercise 3 in lecture 7. The file already implemented a 2D Navier-Cauchy wave equation. The 2D (and 3D) Navier-Cauchy equations are best described by the following set of equations taken from the lecture:
To add a 3rd dimension, we added 1 new normal-stress component (τzz) and 2 new shear-stress components (τyz, τzx):
@all(τzz) = @all(τzz) + dt*(2.0*μ*(@d_zi(Vz)/dz) - 1.0/3.0*@inn_xy(∇V))
@all(τyz) = @all(τyz) + dt*μ*(@d_zi(Vy)/dz + @d_yi(Vz)/dy)
@all(τzx) = @all(τzx) + dt*μ*(@d_xi(Vz)/dx + @d_zi(Vx)/dz)
All the velocity calculations also had to be updated to incorporate the new components of the stress tensor:
@inn(Vx) = @inn(Vx) - dt/ρ*(@d_xi(P)/dx - @d_xa(τxx)/dx - @d_ya(τxy)/dy - @d_za(τzx)/dz)
@inn(Vy) = @inn(Vy) - dt/ρ*(@d_yi(P)/dy - @d_ya(τyy)/dy - @d_za(τyz)/dz - @d_xa(τxy)/dx)
@inn(Vz) = @inn(Vz) - dt/ρ*(@d_zi(P)/dz - @d_za(τzz)/dz - @d_xa(τzx)/dx - @d_ya(τyz)/dy)
The only thing left then is to update ∇V
:
@all(∇V) = @d_xa(Vx)/dx + @d_ya(Vy)/dy + @d_za(Vz)/dz
After the solver was up and running, we noticed some things that could be optimized.
Temporary arrays that are used only once (dVxdt
, dVydt
, dVzdt
, dPdt
) were removed to reduce memory footprint and memory access time, while at the same time not increasing the amount of computational work.
∇V
, although being a temporary array, was not inlined because doing so would significantly increase the computational work required, and we are likely compute-bound rather than memory-bound for this problem.
Constant divisions were also replaced by multiplication of their inverse, and consecutive multiplications were merged to reduce the computational work.
The resulting kernels after optimizations:
@all(τxx) = @all(τxx) + @d_xi(Vx)*dt2μ_dx - dt1_3*@inn_yz(∇V)
@all(τyy) = @all(τyy) + @d_yi(Vy)*dt2μ_dy - dt1_3*@inn_xz(∇V)
@all(τzz) = @all(τzz) + @d_zi(Vz)*dt2μ_dz - dt1_3*@inn_xy(∇V)
@all(τxy) = @all(τxy) + @d_yi(Vx)*dtμ_dy + @d_xi(Vy)*dtμ_dx
@all(τyz) = @all(τyz) + @d_zi(Vy)*dtμ_dz + @d_yi(Vz)*dtμ_dy
@all(τzx) = @all(τzx) + @d_xi(Vz)*dtμ_dx + @d_zi(Vx)*dtμ_dz
@inn(Vx) = @inn(Vx) - (@d_xi(P)-@d_xa(τxx))*dt_ρ_dx + @d_ya(τxy)*dt_ρ_dy + @d_za(τzx)*dt_ρ_dz
@inn(Vy) = @inn(Vy) - (@d_yi(P)-@d_ya(τyy))*dt_ρ_dy + @d_za(τyz)*dt_ρ_dz + @d_xa(τxy)*dt_ρ_dx
@inn(Vz) = @inn(Vz) - (@d_zi(P)-@d_za(τzz))*dt_ρ_dz + @d_xa(τzx)*dt_ρ_dx + @d_ya(τyz)*dt_ρ_dy
@all(∇V) = @d_xa(Vx)*_dx + @d_ya(Vy)*_dy + @d_za(Vz)*_dz
@all(P) = @all(P) - dtK*@all(∇V)
Most modern hardware can do many FP64 operations in the same amount of time as it takes to do a single memory operation. Considering that there are almost as many memory operations as there are floating point operations in the optimized kernels, we are likely to be memory-bound for this problem.
The relavant files for this part are listed in the tree below.
.
├── docs
│ ├── img # Automatically-generated plots and animations are saved here
│ └── part2.md # You are here :)
├── scripts-part2
│ ├── elastic_wave_3D.jl # Main file, contains the solver code and physical parameters
│ ├── elastic_wave_3D_benchmark_cpu.jl # CPU performance benchmark (calls the main file)
│ ├── elastic_wave_3D_benchmark_gpu.jl # GPU performance benchmark (calls the main file)
│ ├── elastic_wave_3D_testing.jl # File used for reference testing (calls the main file)
│ └── elastic_wave_3D_visualize.jl # Render the simulation (calls the main file)
└── test
├── part2.jl # Reference tests and unit tests
├── reftest-files
│ └── test_2.bson
└── runtests.jl
All commands to replicate the results below should be run from the repository's top-level directory.
For visualization, we took a 2D slice of the 3D pressure matrix P
at z=Lz/2
.
A resolution of 128x128x128 was used for the visualization. The physical properites used were ρ=1.0
, μ=1.0
, and K=1.0
. To replicate:
julia --project -O3 --check-bounds=no scripts-part2/elastic_wave_3D_visualize.jl
We are using the performance metric proposed in the ParallelStencil.jl library.
In our case, the A_eff
metric was calculated as follows:
# Effective main memory access per iteration [GB]
A_eff = 1e-9 * sizeof(Float64) * (
2 * (length(τxx) + length(τyy) + length(τzz) + length(τxy) + length(τyz) + length(τzx))
+ 2 * (length(Vx) + length(Vy) + length(Vz))
+ 2 * (length(P))
)
∇V
isn't included in the calculation as it is only used as a temporary.
An AMD Ryzen™ 5 5600G (6C12T, T_peak=47.68 GB/s [1]) was used for the CPU performance benchmark.
To run the CPU performance benchmark:
export JULIA_NUM_THREADS=<num_threads>
julia --project -O3 --check-bounds=no scripts-part2/elastic_wave_3D_benchmark_cpu.jl
Replace <num_threads>
with the amount of physical cores in your CPU.
An NVIDIA GeForce RTX™ 3060 (T_peak=360 GB/s, 199 GFLOPS (FP64) [2]) was used for the GPU performance benchmark.
A resolution of 512x512x512
wasn't tested because it didn't fit in the 12 GB of VRAM that the 3060 had.
To run the GPU performance benchmark:
julia --project -O3 --check-bounds=no scripts-part2/elastic_wave_3D_benchmark_gpu.jl
[1] https://en.wikichip.org/wiki/amd/ryzen_5/5600g
[2] https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682