SpeedComparisons

This page will contain the results from the different implementations on different systems, allowing everyone to see how they compare.

Results

To standardise the results we have chosen the Gray-Scott system, with toroidal topology. The figures are in Million cell-generations per second.

	System 1	System 2	System 3	3b	System 4	System 5	System 6
GrayScott	84	117	67		126	14	9
GrayScott_double	87	115	69		127	10	9
GrayScott_OpenCV	97				117
GrayScott_OpenMP	250	163	98		250	10	8
GrayScott_SSE	540	450	315		490	18	27
GrayScott_SSE_OpenMP	1000	160-230¹	64-79¹		230-728¹	23	24
GrayScott_HWIVector (1 thread)	540	455	285		478	26	27
HWIVector (2 threads)	950	750	478		838	23	27
HWIVector (all threads)	1710	825	675		1268
GrayScott_OpenCL	3400	137	201	27	1760	-	-
GrayScott_OpenCL_Local	3380	54	76	37	1400	-	-
GrayScott_OpenCL_2x2	2880	339	423	112	1860	-	-
GrayScott_OpenCL_Image	2300	note²	note²	324	note²	-	-
GrayScott_OpenCL_Image_2x2	4400	note²	note²	437	note²	-	-

System 1: Visual Studio 2008 (32-bit), Windows 7 (64-bit), Intel i7-2600 (4 cores, 8 threads) @ 3.4GHz, nVidia GeForce GTX 460 (962MB global memory, 48 KB local memory, local memory type: local) (Tim's desktop)

System 2: CMake version 2.8.5, GCC 4.2.1, Mac OS 10.6.8 (64-bit), Intel Core i7-2720QM (4 cores, 8 threads) @ 2.2 GHz, AMD Radeon HD 6750 QM with 1024 MB VRAM (Robert's mobile)

System 3: CMake version 2.8.5, GCC 4.2.1, Mac OS 10.6.8 (64-bit), Dual Intel Xeon E5520 (8 cores, 16 threads) @ 2.27 GHz, ATI Radeon HD 5770 with 1024 MB VRAM (Robert's desktop)

System 3b is the same as System 3 but on its second graphics card: NVidia GeForce GT 120 with 512 MB VRAM

System 4: CMake version 2.8.5, GCC 4.2.1, Mac OS 10.6.8 (64-bit), Intel Core i5 (4 cores, 4 threads) @ 3.1 GHz, AMD Radeon HD 6970M with 1024 MB VRAM (Andrew)

System 5: GCC 4.6.1, Debian Linux (32-bit), Intel Pentium 4 HT@ 3.2GHz, AMD Radeon Mobility 9100 IGP (Tim's old laptop)

System 6: Visual Studio 2008, Windows XP (32-bit), Intel Pentium 4 HT@ 3.2GHz, AMD Radeon Mobility 9100 IGP (Tim's old laptop)

Notes:

1. Speeds up as the dots fill the screen (a known issue on Intel processors with denormal values)

2. Does not run, OS and/or GPU driver does not support a needed OpenCL feature.

Issues:

Not all systems currently implement the toroidal topology.
The grid size affects the speed. 256x256 isn't big enough for some implementations - e.g. at 1024x1024 with _OpenCL_Image_2x2 I get 4200 fps, the equivalent of 67000 fps on a 256x256 grid, but only 36000 fps on an actual 256x256 grid

History

Our first implementation (GrayScott) suffered from a variable speed. Different patterns would run at different speeds, despite the fact that nothing in the code depended on the values on the floats being manipulated.

We tried using doubles instead of floats (GrayScott_double) and this helped enormously. There was still a speed change but it was much less.

Eventually we worked out that we had the problem of denormals - float values that through repeated division were becoming smaller and smaller in the empty spaces where no Gray-Scott spots were found. Adding a tiny constant got rid of this problem, giving us a speed of 84 Mcgs on System 1. Now the float and the double versions run at about the same speed.

The most obvious way to speed up reaction-diffusion (RD) code is to parallelise it. Using OpenMP resulted in a 3x speedup on System 1. Not brilliant but it's very simple to add a single #pragma line before a for loop, see GrayScott_OpenMP.

OpenCL is perhaps the most promising way to get RD to run fast, using the many cores on a graphics card. Our initial implementation, GrayScott_OpenCL gave a 10x speedup over the single-core version on System 1. (Initially we thought it was more, because of the denormals issue.)

But a couple of people (Tom and Robert) suggested exploring SSE first, rather than jumping straight to OpenCL. This turned out to be good advice, our GrayScott_SSE version is 6x faster than the base version on System 1, still running on a single core. Putting OpenMP on top took us to 12x faster than the base version - faster than our OpenCL implementation!

In GrayScott_HWIVector we have wrapped all the SSE calls in a set of macros, to allow it to run on CPUs that don't support SSE, through emulation. We also wrap the threading, to allow it to be used on different platforms. Together these give great performance, 1.7x that of GrayScott_SSE_OpenMP on System 1.

Returning to OpenCL, the main advice for optimisation is to look at cache hits - how to ensure that the data you want is available in the fastest memory. Our GrayScott_OpenCL version uses NDRange local(8,8) which gives a big speed improvement over local(1,1), presumably because it then makes a local cache. We're still learning about these issues so all help is welcome! We tried manually caching data but this didn't help.

There's an image2d_t structure in OpenCL that is already optimized for 2D caching. We found that GrayScott_OpenCL_Image was 3x faster than our float*-based GrayScott_OpenCL on System 1.

Our final optimization to date uses OpenCL's own version of SSE-like processing, using float4's to operate on 4 values at once, both for the maths and the reading and writing. GrayScott_OpenCL_Image_2x2 gave another 2x speedup on top of the image version.

Outside help

Tim started a thread at Khronos.org about this: http://www.khronos.org/message_boards/viewtopic.php?f=28&t=4455

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpeedComparisons

Results

History

Outside help

Clone this wiki locally