WIP: updated figures, speedup figures and values, formatting. Bit of …

…rearranging
RSE-Sheffield · Aug 8, 2023 · 66dafa1 · 66dafa1
1 parent 82a8720
commit 66dafa1
Show file tree

Hide file tree

Showing 10 changed files with 73 additions and 36 deletions.
diff --git a/_posts/2023-07-21-flamegpu2-h100-a100-v100-benchmarking.md b/_posts/2023-07-21-flamegpu2-h100-a100-v100-benchmarking.md
@@ -22,27 +22,15 @@ Todo list:
 + [ ] @todo - refine / finish text.
 + [ ] @todo - speed up values / table
 + [ ] @todo - V100 CUDA 11.8 results (and associated changes)
-+ [ ] @todo - fix image paths so they work in the website rather than standalone markdown file. I.e. ../assets/ to /assets/
 + [ ] @todo - rebase / squash history to avoid repo bloat
 -->
 
-Within the RSE group, a number of staff have been involved in developing [FLAME GPU 2][flamegpu-website]...
-
-[FLAME GPU 2][flamegpu2-repo] is an open-source GPU accelerated simulator for domain independent complex systems simulations using an agent-based modelling approach.
-Models are implemented using CUDA C++ or Python3, describing the behaviours of individuals within the simulation, and observing the emergent outcomes from these behaviours.
-<!-- Using agent-based modelling providing CUDA C++ and Python3 interfaces, abstracting the complexities of GPU programming away from the modeller. -->
-The underlying use of GPUs allows for much larger scale Agent Based Simulations than traditional CPU-based ABM frameworks with high levels of performance.
-
-Our [recent publication][doi.org/10.1002/spe.3207]: Richmond, P, Chisholm, R, Heywood, P, Chimeh, MK, Leach, M. FLAME GPU 2: A framework for flexible and performant agent based simulation on GPUs. Softw: Pract Exper. 2023; 53(8): 1659–1680. doi: 10.1002/spe.3207, provides a more in-depth look at FLAME GPU 2 including a broader range of benchmarking on V100 GPUs.
-
-We have previously 
-@todo - What is benchmarking, why 
-
 ## H100 GPUs now available at the University of Sheffield
 
 At the University of Sheffield, students and researches have access to a number of GPU resources, available in local (Tier 3) and affiliated regional (Tier 2) HPC systems.
 
-As of the 7th of August 2023, [12 new H100 GPUs][h100-live-email] (6 nodes, 2 GPUs per node) have been added to the Stanage Tier 3 HPC facility and are available for all users. See the [Stanage HPC Documentation][stanage-using-gpus] for how to use these GPUs.
+**As of the 7th of August 2023, [12 new H100 GPUs][h100-live-email] (6 nodes, 2 GPUs per node) have been added to the Stanage Tier 3 HPC facility and are available for all users**. See the [Stanage HPC Documentation][stanage-using-gpus] for how to use these GPUs.
+This means that currently the following GPUs are available, free at the point of use, for members of the University:
 
 + [Stanage][stanage-gpus] (Tier 3, The University of Sheffield):
   + 60 public NVIDIA A100 SXM4 80GB GPUs
@@ -61,6 +49,19 @@ This means that for some workloads, the higher power A100 SXM4 GPUs may offer hi
 
 [Machine Learning focussed benchmarking of the H100 and A100 GPUs in Stanage][h100-rcg-ml-benchmark] has been carried out by Carl Kennedy and Nicholas Musembi of the Research and Innovation team in IT Services.
 
+## FLAME GPU 2
+
+Within the RSE group, a number of staff have been involved in developing [FLAME GPU 2][flamegpu-website]...
+
+[FLAME GPU 2][flamegpu2-repo] is an open-source GPU accelerated simulator for domain independent complex systems simulations using an agent-based modelling approach.
+Models are implemented using CUDA C++ or Python3, describing the behaviours of individuals within the simulation, and observing the emergent outcomes from these behaviours.
+<!-- Using agent-based modelling providing CUDA C++ and Python3 interfaces, abstracting the complexities of GPU programming away from the modeller. -->
+The underlying use of GPUs allows for much larger scale Agent Based Simulations than traditional CPU-based ABM frameworks with high levels of performance.
+
+Our [recent publication][doi.org/10.1002/spe.3207]: Richmond, P, Chisholm, R, Heywood, P, Chimeh, MK, Leach, M. FLAME GPU 2: A framework for flexible and performant agent based simulation on GPUs. Softw: Pract Exper. 2023; 53(8): 1659–1680. doi: 10.1002/spe.3207, provides a more in-depth look at FLAME GPU 2 including a broader range of benchmarking on V100 GPUs.
+
+To understand how FLAME GPU 2 performs across the range of GPUs available at the Univeristy, and to further guide the development of FLAME GPU 2, we have benchmarked the available GPUs using a synthetic agent based model implemented in FLAME GPU 2 
+
 ## FLAME GPU 2 Circles Benchmark
 
 @todo - Circles model benchmark details, cross ref the recent pub.
@@ -69,7 +70,7 @@ https://github.com/FLAMEGPU/FLAMEGPU2-circles-benchmark
 
 Using `v2.0.0-rc`. 
 
-![FLAME GPU 2 Circles Benchmark visualisation screenshots](../assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/flamegpu2-circles-progression-1800-1200.png)
+![Figure 1: FLAME GPU 2 Circles Benchmark visualisation screenshots](/assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/flamegpu2-circles-progression-1800-1200.png)
 
 By default, FLAME GPU 2 is configured to build for all major GPU architectures known by the current version of NVCC.
 However, in the interest of reduced compilation time and binary file size, the benchmark repository was configured with `-DCMAKE_CUDA_ARCHITECTURES=70` for the V100 GPUs in Bessemer, and using `-DCMAKE_CUDA_ARCHITECTURES="80;90"` for the A100 and H100 GPUs in Stanage.
@@ -81,49 +82,85 @@ However, in the interest of reduced compilation time and binary file size, the b
 | H100 PCIe 80G | AMD EPYC 7413        | 11.8 | `80;90`                  |
 | A100 SXM4 80G | AMD EPYC 7413        | 11.8 | `80;90`                  |
 | V100 SXM2 32G | Intel Xeon Gold 6138 | 11.0 | `70`                     |
+{:.table.table-bordered.table-striped.table-hovered}
+
 
 Additionally, the benchmarks were configured with `-DFLAMEGPU_SEATBELTS=OFF` for increased performance, at the cost of much less helpful error messages.
 
+I.e. for the H100 runs:
+
+```bash
+# Clone the benchmark repository and cd into it
+git clone [email protected]:FLAMEGPU/FLAMEGPU2-circles-benchmark
+cd FLAMEGPU2-circles-benchmark
+# Create a build directory
+mkdir -p build && cd build
+# Configure CMake
+cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;90" -DFLAMEGPU_SEATBELTS=OFF
+# Compile the binary
+cmake --build . -j `nproc`
+```
+
+```bash
+# Ensure run time compilation can find the correct include directory, ideally this wouldn't be required
+export FLAMEGPU2_INC_DIR=_deps/flamegpu2-src/include
+# Run the benchmark experiment, outputting csv files into the working directory
+./bin/Release/circles-benchmark
+```
+
 The generated binary runs multiple benchmark experiments to evaluate the performance of the FLAME GPU 2 Circles model.
 The most interesting of the benchmarks carried out scales the size of the 3D environment, while maintaining a consistent initial density of circle agents.
 This sweep parameters chosen resulted in simulations with between X and Y agents being benchmarked.
 
 Additionally, 4 versions of this benchmark were performed, using different communication techniques (Bruteforce, and Spatial 3D communication), and different optimisation techniques (offline compilation, and run-time compilation (RTC)).
 
-Each simulation ran N simulation steps, and was repeated M times.
+Each simulation ran 200 simulation steps, and was repeated 3 times to produce mean simulation runtimes.
 
 ## Benchmark Results
 
-The following figures show the mean simulation run time in seconds, for each of the described benchmarks.
+Using the brute-force communication strategy:
+
+### Brute Force Communication
+
+![Figure 2: Circles Bruteforce - Mean Simulation Time (s) against Population Size](/assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_bruteforce.png)
+
+![Figure 3: Circles Bruteforce RTC - Mean Simulation Time (s) against Population Size](/assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_bruteforce_rtc.png)
+
+@todo - fluffy description of the general shapes. Bigger gaps between GPUs are bigger scales. for brute
+
+## Spatial 3D Communication
+
+Using the much more work efficient Spatial 3D communication strategy:
 
-![Figure 1: Circles Bruteforce](../assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_bruteforce.png)
+![Figure 4: Circles Spatial3D - Mean Simulation Time (s) against Population Size](/assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_spatial3D.png)
 
-![Figure 2: Circles Spatial3D](../assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_spatial3D.png)
+![Figure 5: Circles Spatial3D RTC - Mean Simulation Time (s) against Population Size](/assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_spatial3D_rtc.png)
 
-![Figure 1: Circles Bruteforce RTC](../assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_bruteforce_rtc.png)
+@todo - fluffy description of the general shapes. Bigger gaps between GPUs are bigger scales. for spatial3D
 
-![Figure 2: Circles Spatial3D RTC](../assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_spatial3D_rtc.png)
+### Relative Performance against V100 SXM2 and CUDA 11.0
 
+For simulations at the largest scale benchmarked, containing 1 million agents, the A100 GPU was between X and Y times faster than the V100 GPU, and the H100 showed relative performance speedup of @todo and @todo, as shown in the following figure and tables:
 
-@todo - fluffy description of the general shapes. Bigger gaps between GPUs are bigger scales.
 
-@todo - Speedupl plots? 
+![Figure 6: Circles Benchmark relative Speedup against V100 SXM2 CUDA 11.0](/assets/images/2023-07-21-flamegpu2-h100-a100-v100-benchmarking/plot-speedup-v100-fixed-density-max-pop-V100_SXM2_CUDA_11.0.png)
 
-For simulations at the largest scale benchmarked, containing 1 million agents, the A100 GPU was between X and Y times faster than the V100 GPU, and the H100 showed relative performance speedup of @todo and @todo, as shown in the following tables:
 
-| Benchmark      | V100 (s) | A100 (s) | H100 (s) |
-|:---------------|---------:|---------:|---------:|
-| Bruteforce     | | | |
-| Spatial3D      | | | |
-| Bruteforce RTC | | | |
-| Spatial3D RTC  | | | |
+| Benchmark              |   V100 SXM2 CUDA 11.0 |   A100 SXM4 CUDA 11.8 |   H100 PCIe CUDA 11.8 |
+|:-----------------------|----------------------:|----------------------:|----------------------:|
+| circles_bruteforce     |              1320.742 |              1071.347 |               944.221 |
+| circles_bruteforce_rtc |               693.069 |               648.341 |               481.789 |
+| circles_spatial3D      |                 1.300 |                 0.685 |                 0.551 |
+| circles_spatial3D_rtc  |                 0.757 |                 0.544 |                 0.428 |
+{:.table.table-bordered.table-striped.table-hovered}
 
-| Benchmark      | V100 Relative Speedup | A100 Relative Speedup | H100 Relative Speedup |
-|:---------------|---------:|---------:|---------:|
-| Bruteforce     | 1.00 | | |
-| Spatial3D      | 1.00 | | |
-| Bruteforce RTC | 1.00 | | |
-| Spatial3D RTC  | 1.00 | | |
+| Benchmark              |   V100 SXM2 CUDA 11.0 |   A100 SXM4 CUDA 11.8 |   H100 PCIe CUDA 11.8 |
+|:-----------------------|----------------------:|----------------------:|----------------------:|
+| circles_bruteforce     |                 1.000 |                 1.233 |                 1.399 |
+| circles_bruteforce_rtc |                 1.000 |                 1.069 |                 1.439 |
+| circles_spatial3D      |                 1.000 |                 1.898 |                 2.362 |
+| circles_spatial3D_rtc  |                 1.000 |                 1.393 |                 1.770 |
+{:.table.table-bordered.table-striped.table-hovered}
 
 ## Summary
 

diff --git a/...lamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_bruteforce.png b/...lamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_bruteforce.png
diff --git a/...gpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_bruteforce_rtc.png b/...gpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_bruteforce_rtc.png
diff --git a/...flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_spatial3D.png b/...flamegpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_spatial3D.png
diff --git a/...egpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_spatial3D_rtc.png b/...egpu2-h100-a100-v100-benchmarking/plot-h100-a100-v100-circles_spatial3D_rtc.png
diff --git a/...a100-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_bruteforce.png b/...a100-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_bruteforce.png
diff --git a/...-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_bruteforce_rtc.png b/...-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_bruteforce_rtc.png
diff --git a/...-a100-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_spatial3D.png b/...-a100-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_spatial3D.png
diff --git a/...0-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_spatial3D_rtc.png b/...0-v100-benchmarking/plot-h100-a100-v100-fixed-density-circles_spatial3D_rtc.png
diff --git a/...00-benchmarking/plot-speedup-v100-fixed-density-max-pop-V100_SXM2_CUDA_11.0.png b/...00-benchmarking/plot-speedup-v100-fixed-density-max-pop-V100_SXM2_CUDA_11.0.png