Parallel Matmul

Description

This repo contains 5 matrix multiplication algorithms in C++ and CUDA:

The purpose of this repo is to compare their implementation and performance.

On input size 1024, the algorithms take the following time to execute in seconds (as measured on AWS EC2 g4dn instance):

Clearly multithreaded offers an order-of-magnitude better performance, and tiling offers a 20-30% optimization as well due to spatial locality.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
algorithms		algorithms
utils		utils
.gitignore		.gitignore
README.md		README.md