Table of Contents:
- Benchmarks
- Results
- Python: Loop vs Numpy (CPU)
- Python: Loop vs Numpy 2 (CPU)
- R: Loop vs Vectorized (CPU)
- Python: Loop vs Numpy vs Pandas (CPU)
- Julia: Loop vs Vector (CPU)
- Numpy vs Octave vs R vs Java vs Julia vs C++ (CPU)
- Python Vectorization: Numpy vs Deep Learning Frameworks (CPU)
- Numpy vs Deep Learning Frameworks (GPU and CPU)
- Deep Learning Frameworks GPU vs Loop CPU
- C++ Parallel APIs (CPU)
- C++ GPU (vs CPU)
- OpenCL vs PyOpenCL (CPU & GPU)
- PyCUDA vs C++ (GPU)
- Tensorflow: Python vs C++ (GPU)
- GPU Conclusion
- Linux Conclusion
- Windows Conclusion
- Conclusion
- Machine Specifications
The following benchmarks have been implemented:
C++ Bulk [gpu] | Bulk is yet another parallel algorithms on top of CUDA. It claims to have better scalability than Thrust. |
C++ CUDA [gpu] | NVidia CUDA toolkit is the base library for accessing GPUs. |
C++ OCL [cpu] | OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. |
C++ OCL [gpu] | OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. |
C++ OMP [cpu] | OpenMP is API specification for parallel programming. |
C++ TensorFlow [gpu] | TensorFlow is a deep learning library from Google. |
C++ Thrust [gpu] | NVidia Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust is included with CUDA toolkit. |
C++ cuBLAS [gpu] | NVidia cuBLAS is a fast GPU-accelerated implementation of the standard basic linear algebra subroutines (BLAS). |
C++ loop [cpu] | Plain C++ for loop |
Java loop [cpu] | Plain Java loop |
Julia (loop) [cpu] | SIMD optimized Julia loop. |
Julia (vec) [cpu] | With Julia array operation. |
Octave [cpu] | GNU Octave is a high-level language primarily intended for numerical computations. |
Py CNTK [cpu] | CNTK is a deep learning library. |
Py CNTK [gpu] | CNTK is a deep learning library. |
Py MXNet [cpu] | MXNet is a deep learning library. |
Py MXNet [gpu] | MXNet is a deep learning library. |
Py Numpy [cpu] | With Python Numpy array. |
Py Pandas [cpu] | With Python Pandas dataframe. |
Py TensorFlow [cpu] | TensorFlow is a deep learning library. |
Py TensorFlow [gpu] | TensorFlow is a deep learning library. |
PyCUDA [gpu] | PyCUDA is a Python wrapper for CUDA. |
PyOCL [cpu] | PyOpenCL is a Python wrapper for OpenCL. |
PyOCL [gpu] | PyOpenCL is a Python wrapper for OpenCL. |
Python loop [cpu] | Simple Python for loop. |
R (array) [cpu] | With array in R, a free software environment for statistical computing and graphics. |
R (data.frame) [cpu] | With data.frame in R, a free software environment for statistical computing and graphics. |
R (data.table) [cpu] | With data.table in R, a free software environment for statistical computing and graphics. |
R (loop) [cpu] | Simple loop in R, a free software environment for statistical computing and graphics. |
R (matrix) [cpu] | With matrix in R, a free software environment for statistical computing and graphics. |
Comparison between simple Python loop and Numpy
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Python loop [cpu] (src/saxpy_loop.py)
Same as above, on both Linux and Windows
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Python loop [cpu] (src/saxpy_loop.py)
Benchmarking various vectorization methods in R (array, matrix, data.frame, data.table) vs plain loop
- R (array) [cpu] (src/saxpy_array.R)
- R (data.frame) [cpu] (src/saxpy_dataframe.R)
- R (data.table) [cpu] (src/saxpy_datatable.R)
- R (loop) [cpu] (src/saxpy_loop.R)
- R (matrix) [cpu] (src/saxpy_matrix.R)
Benchmarking the performance of Numpy vs Panda (vs plain Python loop)
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Py Pandas [cpu] (src/saxpy_pandas.py)
- Python loop [cpu] (src/saxpy_loop.py)
Comparing the performance of Julia loop vs Julia vector/array (vs C++)
- C++ loop [cpu] (src/saxpy_cpu.cpp)
- Julia (loop) [cpu] (src/saxpy_loop.jl)
- Julia (vec) [cpu] (src/saxpy_array.jl)
Comparing the performance of SAXPY in different programming languages
- C++ loop [cpu] (src/saxpy_cpu.cpp)
- Java loop [cpu] (src/SaxpyLoop.java)
- Julia (loop) [cpu] (src/saxpy_loop.jl)
- Julia (vec) [cpu] (src/saxpy_array.jl)
- Octave [cpu] (src/saxpy.m)
- Py Numpy [cpu] (src/saxpy_numpy.py)
- R (array) [cpu] (src/saxpy_array.R)
SAXPY array operation in Numpy vs machine learning frameworks such as Tensorflow, MXNet, and CNTK. Only tested on Linux.
Note: CNTK result is way off, not sure why. Please have a look at the source code, maybe I did something wrong.
- Py CNTK [cpu] (src/saxpy_cntk.py)
- Py MXNet [cpu] (src/saxpy_mxnet.py)
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Py TensorFlow [cpu] (src/saxpy_tf.py)
Same as above, but on GPU as well
- Py CNTK [cpu] (src/saxpy_cntk.py)
- Py CNTK [gpu] (src/saxpy_cntk.py)
- Py MXNet [cpu] (src/saxpy_mxnet.py)
- Py MXNet [gpu] (src/saxpy_mxnet.py)
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Py TensorFlow [cpu] (src/saxpy_tf.py)
- Py TensorFlow [gpu] (src/saxpy_tf.py)
Comparing frameworks running on GPU with naive C++ loop running on CPU.
- C++ loop [cpu] (src/saxpy_cpu.cpp)
- Py CNTK [gpu] (src/saxpy_cntk.py)
- Py MXNet [gpu] (src/saxpy_mxnet.py)
- Py TensorFlow [gpu] (src/saxpy_tf.py)
Comparing naive C++ loop with several parallel programming APIs (OpenCL and OpenMP) on CPU.
- C++ OCL [cpu] (src/saxpy_ocl1.cpp)
- C++ OMP [cpu] (src/saxpy_omp.cpp)
- C++ loop [cpu] (src/saxpy_cpu.cpp)
Comparing various C++ GPU libraries (CUDA, OpenCL, Thrust, Bulk, cuBLAS)
- C++ Bulk [gpu] (src/saxpy_bulk.cpp)
- C++ CUDA [gpu] (src/saxpy_cuda.cpp)
- C++ OCL [gpu] (src/saxpy_ocl1.cpp)
- C++ Thrust [gpu] (src/saxpy_trust.cpp)
- C++ cuBLAS [gpu] (src/saxpy_cublas.cpp)
- C++ loop [cpu] (src/saxpy_cpu.cpp)
Comparing C++ OpenCL with PyOpenCL, the OpenCL Python wrapper.
- C++ OCL [cpu] (src/saxpy_ocl1.cpp)
- C++ OCL [gpu] (src/saxpy_ocl1.cpp)
- PyOCL [cpu] (src/saxpy_pyocl.py)
- PyOCL [gpu] (src/saxpy_pyocl.py)
Comparing PyCUDA (Python CUDA wrapper) with native C++ CUDA GPU
- C++ CUDA [gpu] (src/saxpy_cuda.cpp)
- PyCUDA [gpu] (src/saxpy_pycuda.py)
Comparing Tensorflow C++ and Python performance
- C++ TensorFlow [gpu] (src/saxpy_tf.cc)
- Py TensorFlow [gpu] (src/saxpy_tf.py)
Benchmarking various GPU APIs (only on Linux since it has the most APIs)
Excluded from this chart:
Excluded from this chart:
- Python loop [cpu] (src/saxpy_loop.py)
- R (loop) [cpu] (src/saxpy_loop.R)
Excluded from this chart:
- Python loop [cpu] (src/saxpy_loop.py)
- R (loop) [cpu] (src/saxpy_loop.R)
- C++ TensorFlow [gpu] (src/saxpy_tf.cc)
- Py CNTK [gpu] (src/saxpy_cntk.py)
- Py CNTK [cpu] (src/saxpy_cntk.py)
Excluded from this chart:
- Python loop [cpu] (src/saxpy_loop.py)
- R (loop) [cpu] (src/saxpy_loop.R)
- C++ TensorFlow [gpu] (src/saxpy_tf.cc)
- Py CNTK [gpu] (src/saxpy_cntk.py)
- Py CNTK [cpu] (src/saxpy_cntk.py)
Note: same machine as Windows below (dual-boot)
System | Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT) |
OS | Ubuntu Linux 16.04 64bit |
GPU | NVidia GeForce GTX 1080 8GB |
C++ Compiler | g++ 5.4.0 |
Python3 | 3.5.2 64bit |
TensorFlow | TensorFlow 1.4 (GPU) |
CUDA | CUDA 9.0.61 |
CudNN7 | |
OpenCL | - Khronos OpenCL header 1.2 |
- Intel OpenCL driver 16.1.1 | |
- NVidia OpenCL 1.2 driver | |
PyOpenCL | version 2015.1 |
Octave | version 4.0.0 64bit |
R | version 3.2.3 64bit |
MXNet | mxnet-cu90 (0.12.1) |
CNTK | CNTK 2.3.1 (CUDA-8, CudNN6) |
Note: same machine as Linux above (dual-boot)
System | Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT) |
OS | Windows 10 64bit |
GPU | NVidia GeForce GTX 1080 8GB |
C++ Compiler | Visual Studio 2015 C++ compiler 64bit version |
Python | 2.7.12 64bit |
Python3 | 3.5.3 64bit |
TensorFlow | TensorFlow 1.4 (GPU) |
CUDA | Version 8.0.61 |
OpenCL | - Intel OpenCL SDK Version 7.0.0.2519 |
- OpenCL from CUDA SDK | |
PyOpenCL | version 2017.2 |
Octave | version 4.2.1 64bit |
R | version 3.4.2 64bit |