GPU performance timings.rtf

{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fmodern\fcharset0 Courier-Bold;\f1\fmodern\fcharset0 Courier;\f2\fmodern\fcharset0 Courier-Oblique;
}
{\colortbl;\red255\green255\blue255;\red25\green25\blue25;\red0\green0\blue0;}
{\*\expandedcolortbl;;\csgenericrgb\c9804\c9804\c9804;\csgray\c0;}
\paperw11900\paperh16840\margl1440\margr1440\vieww23280\viewh20400\viewkind0
\deftab720
\pard\pardeftab720\partightenfactor0

\f0\b\fs24 \cf2 \expnd0\expndtw0\kerning0
\ul \ulc0 Outstanding performance issues\
\pard\pardeftab720\partightenfactor0

\f1\b0 \cf2 \ulnone - I could look into how much GPU memory I am actually using while running my code. It\'92s not clear to me that it should be much, and yet I seem to have seen signs that it may be many GB. Is it just an issue with garbage collection, or what?\
- If I run with only 2 parallel volumes to reconstruct, it takes 1.2s instead of 5.8s for x30. It looks like this 1.2s is just death by a thousand cuts - lots of Python overheads eating into the run time. I could consider trying to improve that, but ideally I\'92d rather just be able to say that it\'92s not an issue for big problem sizes. It could be relevant for my PIV scenario, though!\
- I need to check how the GPU scales to larger problems. If needed I could do things one plane at a time (doing the iFFT and copying back to RAM) without that incurring a huge overhead I think.\
\pard\pardeftab560\slleading20\partightenfactor0
\cf2 \
\pard\pardeftab720\partightenfactor0

\f0\b \cf2 \ul Notes on GPU performance
\f1\b0 \ulnone \
For the benchmark work task (5.6s) the breakdown is:\
\pard\pardeftab720\partightenfactor0

\fs22 \cf2 0.4 \cf3 \kerning1\expnd0\expndtw0 \CocoaLigature0 InverseTransformBackwardProjection\
4.9 skeleton::ProjectForZ\
1.4  \cf2 \expnd0\expndtw0\kerning0
\CocoaLigature1 PrecalculateFFTArray\
0.05 PrecalculateFH\
2.9  base::ProjectForZ\
2.5    special_fftconvolve\
1.4      special_fftconvolve2_expand\
0.5      special_fftconvolve2_mirror\
0.5      special_fftconvolve2_nomirror\
\
\pard\pardeftab720\partightenfactor0

\fs24 \cf2 However, these numbers seem to be misleading. I can only assume things are running very asynchronously on the GPU and things are being charged to the wrong function:\
- 
\f0\b I don\'92t understand what\'92s happening with the time in expand
\f1\b0 :\
	- If I run convolve twice, or comment it out, run time is affected a lot. The implication is that there is a lot of genuine work in there\
	- But if I run expand twice, run time is barely affected\
  These two things seem to contradict each other. 
\f0\b Requires further investigation
\f1\b0 . I should start by repeating these tests to check I didn\'92t do something silly. I don\'92t 
\f2\i think
\f1\i0  that running twice (or never) should have unintended consequences on the logic, but I suppose it\'92s possible\'85\
\
\pard\pardeftab720\partightenfactor0

\f0\b \cf2 \ul Observations\
\pard\pardeftab720\partightenfactor0

\f1\b0 \cf2 \ulnone - In terms of blocks and alignment, apparently blocks should definitely be multiples of 32, ideally 1024. 
\f2\i Signed
\f1\i0  int types are slightly preferred.\
- I have disabled calls to cupy.cuda.runtime.deviceSynchronize, because that has a big impact on the run time. I am not completely sure how cuda is smart enough to work out the dependencies in my kernels, but it really does seem to be giving correct results in the end\'85\
- Blocking turned out to be really important for performance (especially in the x dimension), and so I padded my arrays out to multiples of 8 (i.e. 32 bytes), allowing us to maintain good block sizes, good cache locality, and good alignment of subsequent rows.\
- Python code can definitely execute in parallel with GPU code. My improvement to use stride_tricks in PrecalculateFFTArray() knocked off a significant amount of time from the Python profile of that function, but didn\'92t change the overall runtime. Presumably that is because ultimately the limiting factor was the run time of the GPU code that was already in the pipeline at that point.\
\
\
\pard\pardeftab720\partightenfactor0

\f0\b \cf2 \ul Notes on different GPUs
\f1\b0 \ulnone \
Google colab combinations I have encountered include:\
  2048 threads x 56 processors	Clock speed 1.3285GHz, mem speed 0.715GHz x 512B = 366.08GB/s, L2 4.19MB		Total GPU RAM 17.07GB\
  2048 threads x 13 processors	Clock speed 0.8235GHz, mem speed 2.505GHz x  48B = 120.24GB/s, L2 1.57MB\
Optic-tectum:\
  2048 threads x 20 processors	Clock speed 1.7335GHz, mem speed 5.005GHz x  32B = 160.16GB/s, L2 2.10MB		Total GPU RAM 8.51GB\
\
}