-
Notifications
You must be signed in to change notification settings - Fork 16
Spatter Benchmark Workflow
This document provides an overview of the process to perform memory performance analysis using the Spatter benchmark. The goal is to provide a series of steps for getting a new user up and running with Spatter quickly.
Memory analysis of an application typically involves identifying specific areas within the application where memory operations are the biggest contributors to application latency. For the purposes of Spatter benchmarking these areas represent a significant number of sparce memory accesses via gather and scatter instructions. We can use Spatter to identify how the gather/scatter instructions in these hot spots are impacted by various changes such as those to the application code, compiler, memory hierarchy, CPU or GPU architecture. To do this we will use the Spatter benchmarking tool to extract memory bandwidth metrics and compare them among platforms as we perform these changes.
- The first step in this process is therefore to identify application hotspots containing gather and/or scatter instructions.
- Secondly, generate memory traces which capture these accesses.
- Thirdly, we will use these traces to generate concise gather/scatter patterns representing these accesses.
- Fourthly, we will pass the associated application patterns to Spatter which will compute metrics
- And finally, we will compare the metrics across spatter runs for the application as we change the platform.
In order to identify memory hot spots, we will need to profile an execution of the target application. This can be done with a performance analyzer such as Intel’s VTune.
Intel’s VTune can be downloaded from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html
It can be used to perform a Hot Spot analysis of a given execution to identify areas of interest. The Hot Spots will isolate the top contributors to application latency. For memory bound applications this will identify the key functions whose memory accesses need to be traced.
Please see Intel VTune documentation here: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-documentation.html for more details.
Once we have identified functions within the target application which we want to analyze using Spatter, we next need to extract the raw memory traces from an execution of these areas/hot spots. To do this we will need to instrument these functions under an execution of the application. The execution should be done on a platform which we choose as a baseline for any comparisons, it should also exercise gather and scatter instructions so that we can capture these in the trace. This can be done with Intel Pin located here: https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-binary-instrumentation-tool-downloads.html
Pin, an instrumentation tool, which can be built on various platforms, allows us to instrument parts of an application at runtime. For gather and scatter analysis we will be interested in memory instrumentation which can be performed using the IMemROI Pintools, available from the gs_patterns repository on GitHub at: https://github.com/lanl/gs_patterns. The IMemROI Pintools sources will need to be added to the Pin source tree (under source/tools) and also added to the Pin makefile so that it is built along with Pin. The resulting IMemROI shared libraries will need to be referenced by Pin to perform the memory traces for gather and scatter instructions.
Once we have Pin and the IMemROI Pintools built, we will need to provide the names of the functions we wish to trace in a plain text file called “roi_functs.txt” (one function per line). This file should be located in the directory where we will be running Pin.
We can then execute Pin with ImemROIThreads.so as follows:
$PIN_DIR/pin -t $PIN_DIR/source/tools/ImemROI/obj-intel64/ImemROIThreads.so -- <application> <args>
Where PIN_DIR is set to the root of the Pin source tree.
Similarly, we can also run the application under Pin with the ImemROIThreadsRange.so and ImemInscount.so.
ImemROIThreadsRange.so requires a text file called “roi_range.txt” which contains a starting point (start instruction count), end point (end instruction count) and a total (total instructions can be taken from the run of ImemInscount); used to provide feedback on percentage complete.
ImemInscount.so provides a count of memory instructions and total instructions. Both can be helpful, total instruction count can be provided to ImemROIThreadsRange as was mentioned above. It is also written out to a file called “inscount.out” located in the current directory. Total memory instructions can be useful to independently identify application paths which exhibit the most significant number of memory accesses (For example in the absence of VTune) or as a sanity check on the Pin instrumentation.
Further details are located in the readme provided here: https://github.com/lanl/gs_patterns/blob/main/pin_tracing/README.md
Example Pin execution using ImemROIThreads Pintool:
PIN Tool: ImemThreads.so
PIN -- ROI_FUNC[ 0]: gather_smallbuf_serial
PIN -- ROI_FUNC[ 1]: scatter_smallbuf_serial
< === Application output removed for brevity === >
PIN -- *** gather_smallbuf_serial ***
PIN -- ROI Instrs 754864
PIN -- ROI MemInstrs 270424
PIN -- ROI G/S instrs 0
PIN -- TraceFile roitrace.00.gather_smallbuf_serial.bin
PIN --
PIN -- *** scatter_smallbuf_serial ***
PIN -- ROI Instrs 0
PIN -- ROI MemInstrs 0
PIN -- ROI G/S instrs 0
PIN -- TraceFile roitrace.01.scatter_smallbuf_serial.bin
PIN --
Note: In the above Pin output “ROI G/S Instrs” are 0 as ImemROI Pintools may not accurately detect gather/scatter instructions. However, the traces files does preserve the details so that gs_patterns (Section 4) can extract gather/scatter calls from these files.
After executing the trace, Pin will produce a set of files one for each function we provided in roi_funcs.txt. These files will be named roitrace.##..bin (for example: rotrace.01.myfunction.bin) and will contain the traces of memory accesses performed by the provided function during the application execution. Traces are provided in dynamorio trace format which is a binary representation. More details on dynamorio is located here: https://dynamorio.org/
In order to utilize Spatter, we will need to provide it a description of the memory access patterns of our application. We can do this manually using the Spatter command line or we can use gs_patterns to generate patterns from application traces. This section describes the gs_patterns approach. Please refer to Section 5 for details on the how to provide patterns to Spatter using the command line.
As mentioned above the gs_patterns tool which can be downloaded from: https://github.com/lanl/gs_patterns, can generate a pattern specification compatible with Spatter from dynamorio trace files such as those produced by Intel’s Pin.
To generate a Spatter pattern specification using gs_patterns we will need to first compress the dynamorio trace file generated by Pin (in Section 3) using gzip. We can then execute gs_patterns as follows:
gs_patterns <roitrace.##.func.bin.gz> <binary>
PLEASE NOTE: gs_patterns requires the application binary to contain debug symbols. The output of gs_patterns is a pattern specification in the form of a JSON file. The JSON file is named according to the provided trace file but with a “.json” extension. The JSON file can be passed directly to Spatter in order to report bandwidth (see Section 5).
To run Spatter we need to provide a specification of the application’s gather/scatter access patterns. This can be provided either thru the command line arguments directly or a JSON file can be provided which contain the patterns. For cases where we have generated the pattern via gs_patterns e.g from application traces, we will typically use the latter approach. More details on the command line arguments are provided here: https://github.com/hpcgarage/spatter
Having generated a pattern file using gs_patterns (section 4) we can pass this JSON file to Spatter as follows:
spatter -pFILE=<pattern_file.json> … <other spatter args>
At this point Spatter will generate gather and scatter memory access requests according to the pattern specified and output the results. The results will include the pattern specification as well as the bytes read or written, the elapsed time and the computed bandwidth. Spatter will also provide some statistics such as min, max, average bandwidth across the various patterns, and the standard error.
Example spatter output:
Reading patterns from roitrace.02.gather_smallbuf_serial.bin.json.
argv[3]: --pattern=
Running Spatter version 1.0
Compiler: GNU ver. 8.5.0
Compiler Location: /usr/bin/gcc
Backend: Aggregate Results? YES
Run Configurations
[ {'name':'CUSTOM', 'kernel':'Gather', 'pattern':[0,2,4,6,8,10,12,14… ], 'pattern_gather':[], 'pattern_scatter':[], 'delta':8, 'deltas_gather':[], 'deltas_scatter':[], 'length':1, 'agg':10, 'wrap':1, } ]
config bytes time(s) bw(MB/s)
0 720896 5.651e-05 12756.963369
Min 25% Med 75% Max
12757 12757 12757 12757 12757
H.Mean H.StdErr
12757 0
The workflow has mainly focused on the serial backend (non-multi-threaded) however Spatter supports benchmarking other backends such as CUDA, and OpenMP which are multi-threaded.
These backends can be included when building Spatter and selected at runtime with the –device argument.
TODO: Document changes to the workflow related to these other backends. (Pin for example selects 1 thread for instrumentation)
Once we have Spatter results for our base platform, we can then run the same memory access patterns thru Spatter on another platform(s) or architectures and compare just the memory bandwidths between these platforms. This ensures that the comparisons are truly just the memory bandwidth performance which has been isolated from the application related timing and other variances which could skew the results.
TODO: Discuss comparisons to the STREAM benchmark. https://www.cs.virginia.edu/stream/
To identify hotspots in a CUDA based applications we can use the NSight tool from NVidia, located at https://developer.nvidia.com/nsight-compute. This tool provides many important metrics including Occupancy. Low Occupancy can commonly translate to memory bound areas or where frequent access to global memory negatively impacts performance. Once the target kernel has been identified a trace of its execution similar to what is done for the CPU use case is next.
To trace a GPU kernel we can use the gsnv_trace.so NVBit tool provided in the gs_patterns repo at the https://github.com/hpcgarage/gs_patterns. This will also generate the patterns output as a JSON file.
gsnv_trace.so tool can be configured by setting the GSNV_CONFIG_FILE environment variable to a config file.
The config file should have 1 configuration setting per line. Configuration settings take the form "<CONFIG_ITEM> <CONFIG_VALUE>" where there is a space between the config item and its value.
To target a specific Kernel we can set the GSNV_TARGET_KENEL to the name of the Kernel we are identified as the memory hotspot. NOTE: If no Kernel is provided gsnv_trace.so will isntrument and trace all Kernels.
Example:
echo "GSNV_LOG_LEVEL 1" > ./gsnv_config.txt
echo "GSNV_TRACE_OUT_FILE trace_file.nvbit.bin" >> ./gsnv_config.txt
echo "GSNV_TARGET_KERNEL SweepUCBxyzKernel" >> ./gsnv_config.txt
echo "GSNV_FILE_PREFIX trace_file" >> ./gsnv_config.txt
export GSNV_CONFIG_FILE=./gsnv_config.txt
LD_PRELOAD=$NVBIT_DIR/tools/gsnv_trace/gsnv_trace.so <application> <application options>
gzip trace_file.nvbit.bin
If a previous tracing of a kernel has already yielded a nvbit trace file, this file can be fed to the gs_patterns binary to generate the memory pattern output.
export GSNV_CONFIG_FILE=./gsnv_config.txt
gs_patterns trace.nvbit.bin.gz -nv
To run spatter for a GPU based pattern we can use the pattern file generated by gs_patterns in the previous steps and simply execute spatter to run using the GPU backend.
NOTE: Spatter must have been built with the CUDA backend and have a reasonable local-work-size the default is 1024.
spatter -pFILE=<pattern_file.json> -b CUDA … <other spatter args>