We thoroughly investigated the tSNE algorithm and the open source implementations using Barnet-Hut approximation from https://github.com/lvdmaaten/bhtsne/ and https://github.com/DmitryUlyanov/Multicore-TSNE. We converted the entire codebase from double
to float
in consideration of better vectorization.
We created a testbench using Python and Numpy. This testbench includes 5 test fixtures created from MNIST (https://www.kaggle.com/datasets/oddrationale/mnist-in-csv) and CLIP (https://github.com/openai/CLIP) model embeddings of various sizes for us to quickly test our implementation as well as benchmarking its performance.
Test case | Source | Number of points | Size of point dimension |
simple_200x64 | Scikit Learn digit data | 200 | 64 |
easy_1797x64 | 1797 | 64 | |
medium_10000x768 | MNIST (https://www.kaggle.com/datasets/oddrationale/mnist-in-csv) | 10000 | 768 |
hard_60000x768 | 60000 | 768 | |
hard_62656x512 | CLIP (https://github.com/openai/CLIP) image embeddings of a custom dataset | 62656 | 512 |
The testbench also defines a single-precision floating-point matrix serialization format with reader/writer implemented in both C++ and Python such that we can easily create new test cases and load the computed t-SNE embeddings into Python for visualization and verification.
We reviewed the codebase and profiled the sequential version of the program to identify parallelizable parts. Using perf
, we identified two computation-heavy loops suitable for parallelization: (1) computing pairwise distance for K Nearest Neighbors search on a ball tree, and (2) gradient computation. By parallelizing these two tasks using OpenMP and ISPC, we achieved significant speedup. Preliminary results are discussed in a later section.
We are following our proposed schedule so far - we have finished most implementations for OpenMP and ISPC. The next step is to produce some benchmark results and start working on CUDA.
Date | Plan | Assigned to |
---|---|---|
Apr 9 - Apr 13 | Benchmark OpenMP and ISPC on PSC machine | Yue |
Apr 13 - Apr 15 | Identify computation for CUDA acceleration | Zhe, Yue |
Apr 16 - Apr 22 | Implement CUDA acceleration | Zhe, Yue |
Apr 23 - Apr 25 | Benchmark CUDA on PSC | Yue |
Apr 25 - Apr 28 | Evaluation; Try IPSC+OpenMP+CUDA together | Zhe |
Apr 29 - May 5 | Report and presentation | Zhe, Yue |
We plan to have the following content at the poster session: problem description, parallelization approaches (OpenMP, ISPC, and CUDA), results and evaluation.
Demo: we are not sure if there will be a demo at this stage. We might include a video demo of the embeddings during gradient descent.
We will include figures showing the results of our parallelization approaches.
OpenMP speedup
2 | 4 | 8 | |
---|---|---|---|
Step 1 (K Nearest Neighbors on Ball tree) 47.9s with 1 thread |
1.99x | 4x | 6.6x |
Step 2 (gradient descent) 45.6s with 1 thread |
1.53x | 2.25x | 2.8x |
Total 93.7s with 1 thread |
1.74x | 2.8x | 4x |
ISPC vectorized Euclidean distance computation
AVX1 8x | AVX2 8x | AVX2 16x | |
---|---|---|---|
Step 1 (K Nearest Neighbors on Ball tree) 47.9s with 1 thread and no vectorization |
3.30x | 3.80x | 4.02x |
We have 2 major concerns:
- Our general direction of work so far is to locate potentially parallelizable code in the existing implementation and try to get better speedup using available tools for parallelization. Do we need to research and implement more parallelizable algorithms to replace existing code, such as better kNN search?
- The CUDA version of this program could be challenging to implement. For example, we understand that CUDA doesn’t support tree structures very well. We wonder if there’s any good resources for us to learn more about CUDA.