This repository contains the code and lab work from a semester-long project focused on performance analysis and optimization techniques for machine learning workloads, all implemented in C and run on CPUs.
The project involves building a foundational machine learning library and systematically optimizing it through several techniques, culminating in a GPT-2 implementation for CPU.
-
Initial Implementation
- Developed a foundational machine learning library in C for the forward pass of a small CNN.
-
Performance Analysis
- Conducted bottleneck analysis and profiled code performance.
-
Optimizations Applied
- Tiling and Blocking
- Sparse Matrix Multiplication
- Multithreading using pthreads and OpenMP
-
Final Model
- Implemented GPT-2 optimized to run efficiently on CPU.
- Branches
Each branch corresponds to a specific stage of the project:base
: Initial CNN forward-pass implementationtiling-blocking
: Optimized with tiling and blockingsparse-matrix-mul
: Sparse matrix multiplicationmultithreading
: Multithreaded implementationgpt2-cpu
: GPT-2 final optimized implementation
git clone https://github.com/rachnaumesh/ML-Perf-Labs.git
cd ML-Perf-Labs
To check out a specific stage:
git checkout <branch-name>
All projects include a Makefile
.
To build:
make
To run:
./<executable-name>
- gprof
- perf
- toplev from pmu-tools
- Blocking
- Tiling
- Multithreading
- pthreads
- OpenMP