This repo contains 5 matrix multiplication algorithms in C++ and CUDA:
- Baseline single-threaded matrix multiplication.
- Tiled single-threaded matrix multiplication.
- Multithreaded matrix multiplication.
- CUDA kernel for matrix multiplication.
- CUDA kernel for tiled matrix multiplication.
The purpose of this repo is to compare their implementation and performance.
On input size 1024, the algorithms take the following time to execute in seconds (as measured on AWS EC2 g4dn instance):
Clearly multithreaded offers an order-of-magnitude better performance, and tiling offers a 20-30% optimization as well due to spatial locality.