Welcome to the repository for the Machine Learning Systems course (INFR11269) for the 2024/2025 academic year. This course focuses on building and deploying machine learning systems, with hands-on programming tasks, paper writing, and peer reviews.
The full course schedule, assessments, and additional details are available in the official course page:
Machine Learning Systems - 2024/2025
- Task-1: Part 1, Implementing machine learning operators with GPU programming.
- Task-2: Part 2, Integrating the operator into a distributed ML system (ServerlessLLM + RAG).
- Resources: Slides and reading materials related to the course.
- [11/02/2025] Update instructions of pytorch demo. If you encounter a
No disk space
error, try logging into Interactive mode first and installing the environment on the node. - [05/02/2025] We have uploaded the code template for the first part of the assessment into the
task-1
folder. Additionally, we have relocated thepytorch-demo
to theresources
directory and have included materials forgpu-programming
in the same directory. The part 1 specification in under theAssessment
section on Learn.
- Implement an ML operator using Triton/Cupy.
- Learn about performance optimization and profiling.
- Integrate your Task 1 operator into a distributed ML system using ServerlessLLM and RAG.
- Write a paper documenting your work on both tasks in the format of a NeurIPS or ICML paper.
The course consists of 10 weeks of lectures and Q&A sessions. Each week has the following structure:
- Lectures: Core topics presented by the primary and guest lecturers.
- Q&A Sessions: Focused on solving problems, demos, and discussing task-related questions.
The full course schedule is available here.