Skip to content

bennylp/saxpy-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SAXPY CPU and GPGPU Benchmarks

Table of Contents:

Benchmarks

The following benchmarks have been implemented:

C++ Bulk [gpu] Bulk is yet another parallel algorithms on top of CUDA. It claims to have better scalability than Thrust.
C++ CUDA [gpu] NVidia CUDA toolkit is the base library for accessing GPUs.
C++ OCL [cpu] OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.
C++ OCL [gpu] OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.
C++ OMP [cpu] OpenMP is API specification for parallel programming.
C++ TensorFlow [gpu] TensorFlow is a deep learning library from Google.
C++ Thrust [gpu] NVidia Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust is included with CUDA toolkit.
C++ cuBLAS [gpu] NVidia cuBLAS is a fast GPU-accelerated implementation of the standard basic linear algebra subroutines (BLAS).
C++ loop [cpu] Plain C++ for loop
Java loop [cpu] Plain Java loop
Julia (loop) [cpu] SIMD optimized Julia loop.
Julia (vec) [cpu] With Julia array operation.
Octave [cpu] GNU Octave is a high-level language primarily intended for numerical computations.
Py CNTK [cpu] CNTK is a deep learning library.
Py CNTK [gpu] CNTK is a deep learning library.
Py MXNet [cpu] MXNet is a deep learning library.
Py MXNet [gpu] MXNet is a deep learning library.
Py Numpy [cpu] With Python Numpy array.
Py Pandas [cpu] With Python Pandas dataframe.
Py TensorFlow [cpu] TensorFlow is a deep learning library.
Py TensorFlow [gpu] TensorFlow is a deep learning library.
PyCUDA [gpu] PyCUDA is a Python wrapper for CUDA.
PyOCL [cpu] PyOpenCL is a Python wrapper for OpenCL.
PyOCL [gpu] PyOpenCL is a Python wrapper for OpenCL.
Python loop [cpu] Simple Python for loop.
R (array) [cpu] With array in R, a free software environment for statistical computing and graphics.
R (data.frame) [cpu] With data.frame in R, a free software environment for statistical computing and graphics.
R (data.table) [cpu] With data.table in R, a free software environment for statistical computing and graphics.
R (loop) [cpu] Simple loop in R, a free software environment for statistical computing and graphics.
R (matrix) [cpu] With matrix in R, a free software environment for statistical computing and graphics.

Results

Python: Loop vs Numpy (CPU)

Comparison between simple Python loop and Numpy

results/charts-en/python-loop-vs-numpy-linux-cpu.png

Python: Loop vs Numpy 2 (CPU)

Same as above, on both Linux and Windows

results/charts-en/python-loop-vs-numpy-cpu.png

R: Loop vs Vectorized (CPU)

Benchmarking various vectorization methods in R (array, matrix, data.frame, data.table) vs plain loop

results/charts-en/r-loop-vs-vec.png

Python: Loop vs Numpy vs Pandas (CPU)

Benchmarking the performance of Numpy vs Panda (vs plain Python loop)

results/charts-en/python-loop-vs-numpy-vs-pandas-cpu.png

Julia: Loop vs Vector (CPU)

Comparing the performance of Julia loop vs Julia vector/array (vs C++)

results/charts-en/julia-loop-vs-vector.png

Numpy vs Octave vs R vs Java vs Julia vs C++ (CPU)

Comparing the performance of SAXPY in different programming languages

results/charts-en/script-vs-script-vs-java-vs-c++-cpu.png

Python Vectorization: Numpy vs Deep Learning Frameworks (CPU)

SAXPY array operation in Numpy vs machine learning frameworks such as Tensorflow, MXNet, and CNTK. Only tested on Linux.

Note: CNTK result is way off, not sure why. Please have a look at the source code, maybe I did something wrong.

results/charts-en/vectorized-numpy-vs-frameworks-cpu.png

Numpy vs Deep Learning Frameworks (GPU and CPU)

Same as above, but on GPU as well

results/charts-en/vectorized-numpy-vs-frameworks-gpu.png

Deep Learning Frameworks GPU vs Loop CPU

Comparing frameworks running on GPU with naive C++ loop running on CPU.

results/charts-en/frameworks-gpu-vs-c++-cpu.png

C++ Parallel APIs (CPU)

Comparing naive C++ loop with several parallel programming APIs (OpenCL and OpenMP) on CPU.

results/charts-en/parallel-c++-cpu.png

C++ GPU (vs CPU)

Comparing various C++ GPU libraries (CUDA, OpenCL, Thrust, Bulk, cuBLAS)

results/charts-en/c++-cpu-vs-gpu.png

OpenCL vs PyOpenCL (CPU & GPU)

Comparing C++ OpenCL with PyOpenCL, the OpenCL Python wrapper.

results/charts-en/pyopencl-vs-opencl.png

PyCUDA vs C++ (GPU)

Comparing PyCUDA (Python CUDA wrapper) with native C++ CUDA GPU

results/charts-en/pycuda-vs-c++.png

Tensorflow: Python vs C++ (GPU)

Comparing Tensorflow C++ and Python performance

results/charts-en/tensorflow-python-vs-c++.png

GPU Conclusion

Benchmarking various GPU APIs (only on Linux since it has the most APIs)

Excluded from this chart:

results/charts-en/conclusion-gpus.png

Linux Conclusion

Excluded from this chart:

results/charts-en/conclusion-linux.png

Windows Conclusion

Excluded from this chart:

results/charts-en/conclusion-windows.png

Conclusion

Excluded from this chart:

results/charts-en/conclusion.png

Machine Specifications

Ubuntu 16.04, NVidia GTX 1080

Note: same machine as Windows below (dual-boot)

System Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT)
OS Ubuntu Linux 16.04 64bit
GPU NVidia GeForce GTX 1080 8GB
C++ Compiler g++ 5.4.0
Python3 3.5.2 64bit
TensorFlow TensorFlow 1.4 (GPU)
CUDA CUDA 9.0.61
CudNN7
OpenCL - Khronos OpenCL header 1.2
- Intel OpenCL driver 16.1.1
- NVidia OpenCL 1.2 driver
PyOpenCL version 2015.1
Octave version 4.0.0 64bit
R version 3.2.3 64bit
MXNet mxnet-cu90 (0.12.1)
CNTK CNTK 2.3.1 (CUDA-8, CudNN6)

Windows 10, NVidia GTX 1080

Note: same machine as Linux above (dual-boot)

System Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT)
OS Windows 10 64bit
GPU NVidia GeForce GTX 1080 8GB
C++ Compiler Visual Studio 2015 C++ compiler 64bit version
Python 2.7.12 64bit
Python3 3.5.3 64bit
TensorFlow TensorFlow 1.4 (GPU)
CUDA Version 8.0.61
OpenCL - Intel OpenCL SDK Version 7.0.0.2519
- OpenCL from CUDA SDK
PyOpenCL version 2017.2
Octave version 4.2.1 64bit
R version 3.4.2 64bit