Here's what I did in project 4:
Our approach to Task1 was to simply get matrix operations working so that we could move on to writing the setup file as well as the python-c interface. So, we wrote our get and set methods first figuring that these would be used in most of our other functions. This was a mistake we made early on as we did not anticipate the extent to which these functions would have to be rewritten in Task 4. So, we wrote set and get then moved onto the easier arithmetic functions such as sub,add,abs, etc. These we had very little trouble with but we ran into some roadbumps in writing mul. We eventually solved this and got a working matrix.c that passed all the basic unittests. At this point, we decided to move on to Task 2 which had us very confused at the beginning. We spent a lot of time on Piazza just looking for answers as to what the parameters were inside the functions in setup.py. Eventually we had filled out setup.py but were getting errors when running make install so we decided to go to office hours. In office hours we were able to get help with filling in the right parameters and it ended up actually being much simpler than we had expected initially. We then moved on to Task3 which took us a long time to get started on. We did not realize the extent to which we had to use the built in Python functions that wrapped or performed operations on PyObjects. For example, our first versions of many of the functions did not return the correct types. We originally wrote the functions to just decompose the arguments then call the functions we wrote in matrix.c. Then, we returned the int or double pointer directly. It took us a while to realize that we had to use functions like PyLong_FromLong to cast ints correctly. This led to us changing a lot of our code. Then, we got a lot of basic functionality down, but we weren’t able to pass the unit tests for these functions. We tested it locally extensively but some of our Task3 functions were not passing the unittests for an unknown reason. We then realized that this could be because subscript was not fully implemented. We believed that this should have been made more clear. We then went on to try to implement subscript which took a lot more time than we had expected which led to us taking a few days on task3. Finally, we still weren’t passing the unittests and realized that this could be because set_subscript wasn’t fully implemented. So, we went on to implement that which also took more time than we had expected for it. This seemed to somewhat defeat the purpose of a unittest because the most important methods in task3 had to be implemented correctly in order to test any other parts. So, we decided to write our own test files for almost all the methods and only were able to use the unittesting framework after we were 100% sure that set_subscript and subscript were implemented in their entirety. After we had these implemented, it was much easier to debug the rest of our Task3 functions and we found that the lack of clarity around the unittesting framework cost us a lot of time in finishing Task3.
After finally finishing Task 3 which culminated in completing the Matrix61c_get and Matrix61c_set methods, we were ready to improve the speed of our operations in Task 4. We broke up our approach into a few phases: applying OpenMP to parallelize computation, implementing SIMD instructions, and attempting to use loop unrolling and blocking. We started by trying to use OpenMP to speed up operations. Given that we knew optimizing matrix multiply would be a challenge due to the method’s complexity, coupled with the fact that it was a large part of the overall project grade, we started first by applying simple OpenMP parallelization techniques like “pragma omp parallel for” to the matrix multiplication. However, as expected, this only marginally improved our method performance. We then moved on to implementing SIMD instructions to operate on multiple data points at once. This was something we struggled mightily with understanding. However, we were able to utilize the resources at our disposal to forge a path forward. We found past CS61C lecture slides, specifically Prof. Weaver’s lecture slides from Spring 2019, particularly helpful in understanding how to take our initial algorithm and iteratively make it faster. Through these resources we were able to implement SIMD functionality through the methods of _mm256_loadu_pd, _mm256_broadcast_sd, and _mm256_fmadd_pd. However, we still did not see the significant speedup that we were expecting. We realized that this was because our SIMD instructions all still depended on each other sequentially—our SIMD add could not complete until our SIMD load and broadcast were done. Thus, we realized we had to do another level of optimization: loop unrolling. We did our best to unroll our inner loops to reduce dependencies between our SIMD functions. It was at this point that we saw the most dramatic increase in our matrix operation speed. Moreover, we attempted one last measure to increase our operation speed on multiply and get over the hump: blocking. Something that we learned about by watching the aforementioned previous CS61C lectures, we tried to take advantage of cache locality to improve our performance. At first, we saw a significant improvement; after implementing blocking our speedup once again increased significantly. However, we were not able to figure out how to deal with tail cases of loop unrolling after blocking. Since we did already see a significant speedup on mul_matrix, we decided to table this issue for the time being and move on to optimizing the rest of our code. Since we had already implemented SIMD and OpenMP parallelization on mul_matrix, it became relatively easy to implement these on our other methods, namely add_matrix, sub_matrix, neg_matrix, and abs_matrix. This allowed us to gain more points from speeding up a wider array of commonly use methods. Additionally, we did one more thing to try to speed up operations; we decided to modify allocate_matrix to allocate contiguous memory. Our initial implmentation of allocate_matrix only allocated memory row-by-row, prohibiting us from taking advantage of the cache when doing matrix operation. We hoped that this addition would speed up the operation by reducing the amount of calls to memory. Overall, we enjoyed the experience of physically modifying our code to better interact with hardware that we have learned about in this class such as caches, cores, and processors. The interactivity allowed us to become more comfortable in code optimization and managing parallelism. We hope to use these skills in future work so that we can add all these different ways to optimize our programs to our arsenal.