A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU parallel programming, memory management, and performance optimization skills.
A structured, day-by-day plan to master NVIDIA CUDA programming over 120 days. Each day includes:
- Core Topic
- Practical Exercise / Mini-Project
- Debugging Pitfalls
- Resource Suggestions
Six Capstone Projects are spread out at Days 20, 40, 60, 80, 100, and 120 to synthesize the skills acquired.
Day | Core Topic | Practical Exercise / Mini-Project | Debugging Pitfalls | Resource Suggestions |
---|---|---|---|---|
1 | Introduction to GPU Computing & CUDA(daily-updates/day-01-intro-cuda.md) | Write a simple “Hello GPU” kernel that prints a message from one thread. | Missing <cuda_runtime.h> or incorrect compiler flags. |
CUDA C Programming Guide, Introduction |
2 | Setting Up the Development Environment | Install CUDA Toolkit & drivers; compile and run sample codes. | Incompatible driver/toolkit versions. | Official “Getting Started” Guide |
3 | GPU vs. CPU Architecture Foundations | Compare GPU SMs and CPU cores; discuss throughput vs. latency. | Mixing CPU vs. GPU roles can lead to design inefficiencies. | CUDA C Programming Guide, “Hardware Model Overview” |
4 | Thread Hierarchy: Grids & Blocks | Launch a kernel using different grid/block dimensions, experiment with thread indexing. | Off-by-one errors in thread indexing. | CUDA C Programming Guide, “Thread Hierarchy” |
5 | Thread Hierarchy: Warps (Intro) | Inspect warp size and how threads are grouped (no direct warp programming yet). | Divergence issues in the same warp. | CUDA C Best Practices Guide, “Warps and SIMT Model” |
6 | Basic Kernel Launch & Execution | Write a kernel for element-wise addition on a small array. | Mixing up cudaMemcpyHostToDevice vs. cudaMemcpyDeviceToHost . |
CUDA Samples, Vector Addition |
7 | Memory Model (Global Memory Basics) | Transfer data to global memory, run a simple computation, transfer results back. | Mismatched data sizes or forgetting to free GPU memory. | CUDA C Programming Guide, “Global Memory Access” |
8 | Memory Allocation & Pointers | Use cudaMalloc /cudaFree ; practice error checking. |
Memory leaks if forgetting to free device pointers. | CUDA C Programming Guide, “Memory Allocation” |
9 | Memory Alignment & Coalescing | Benchmark coalesced vs. non-coalesced memory accesses in a kernel. | Non-contiguous access patterns hurt performance. | CUDA C Best Practices Guide, “Memory Coalescing” |
10 | Shared Memory Fundamentals | Implement tile-based matrix multiplication using shared memory. | Race conditions without thread sync. | CUDA C Programming Guide, “Shared Memory” |
11 | Thread Synchronization (__syncthreads() ) |
Extend tile-based multiplication with sync calls; measure performance. | Missing __syncthreads() => partial updates. |
CUDA C Programming Guide, “Synchronization” |
12 | Bank Conflicts in Shared Memory | Test an access pattern that causes bank conflicts; measure performance impact. | Overlooking bank conflict in shared memory accesses. | CUDA C Programming Guide, “Shared Memory Bank Conflicts” |
13 | Basic Atomic Operations | Use atomicAdd to sum an array in parallel. |
Large-scale sums using atomics can degrade performance. | CUDA C Programming Guide, “Atomic Functions” |
14 | Progress Checkpoint | Quick recap or quiz: global vs. shared memory usage. | Not consolidating learning => repeated mistakes. | Review Days 1–13 notes; create a “lessons learned” list |
15 | Advanced Atomic Operations | Experiment with atomicCAS , atomicExch , etc. |
Overusing atomics => performance bottlenecks. | CUDA C Programming Guide, “Atomic Functions” |
16 | Kernel Configuration Tuning | Adjust block sizes for the same kernel; observe occupancy changes. | Non-multiples of warp size can lower occupancy. | CUDA Occupancy Calculator (within Nsight Compute) |
17 | Host-Device Synchronization Patterns | Use cudaDeviceSynchronize() for timing; measure kernel durations. |
Missing sync => partial results readback. | CUDA C Programming Guide, “Device Synchronization” |
18 | Error Handling & cudaGetErrorString() |
Implement robust error checks after each CUDA call. | Ignoring errors => hard-to-trace bugs. | CUDA Error Handling |
19 | Unified Memory (UM) Intro | Use cudaMallocManaged ; run simple vector addition. |
Forgetting that UM still migrates data. | CUDA C Programming Guide, “Unified Memory” |
20 | Capstone Project #1 | Image Convolution Pipeline: 2D convolution (e.g., edge detection) on GPU. | – | – |
21 | Streams & Concurrency (Basics) | Launch two kernels in different streams; check overlap. | Default stream is blocking if concurrency not enabled. | CUDA C Programming Guide, “Streams” |
22 | Events & Timing | Use CUDA events for precise kernel timing. | Misplacing events => skewed time measurements. | CUDA C Programming Guide, “Events” |
23 | Asynchronous Memory Copy | Copy data using streams asynchronously. | Not using pinned host memory => limited async. | CUDA C Programming Guide, “Asynchronous Transfers” |
24 | Pinned (Page-Locked) Memory | Compare pinned vs. pageable host memory transfers. | Excessive pinned memory => system performance hit. | CUDA C Programming Guide, “Pinned Memory” |
25 | Double Buffering Technique | Implement a two-buffer pipeline to overlap compute and transfer. | Improper sync => reading incomplete data. | CUDA Concurrent Kernels Sample |
26 | Constant Memory | Use constant memory for read-only data (e.g., coefficients). | Over the constant memory limit or forgetting read-only usage. | CUDA C Programming Guide, “Constant Memory” |
27 | Texture & Surface Memory (Intro) | Sample a small 2D texture; compare vs. global memory fetch. | Missing texture binding/unbinding steps. | CUDA C Programming Guide, “Texture Memory” |
28 | Progress Checkpoint | Recap concurrency & memory (short quiz or multi-topic mini-project). | Gaps in concurrency knowledge => suboptimal performance. | Review Days 15–27 notes |
29 | Texture Memory (Practical) | Implement image-processing kernel (e.g., grayscale) using textures. | Incorrect normalized coordinates => invalid fetch. | CUDA C Programming Guide, “Texture Reference” |
30 | Surface Memory | Write operations using surfaces (e.g., output image buffer). | Coordinate boundary checks are critical. | CUDA C Programming Guide, “Surface Reference” |
31 | Unified Memory Deep Dive | Use cudaMallocManaged with multiple kernels; measure page-fault overhead. |
Poor data locality => excessive page migration. | CUDA Unified Memory |
32 | Stream Sync & Dependencies | Enforce execution order with events or cudaStreamWaitEvent() . |
Forgetting to record an event => faulty sync. | CUDA C Programming Guide, “Stream Synchronization” |
33 | Intro to CUDA Graphs | Convert a kernel sequence into a CUDA graph; measure performance. | Mistakes in graph capture => unexpected results. | CUDA C Programming Guide, “Graphs” |
34 | Nsight Systems / Nsight Compute | Profile a small app to find bottlenecks; read kernel timelines. | Misreading profiling data => false conclusions. | Nsight Systems / Nsight Compute |
35 | Occupancy & Launch Config Tuning | Use the Occupancy Calculator to refine block size for better SM use. | Overlooking shared memory => reduced occupancy. | CUDA C Best Practices Guide, “Occupancy” |
36 | Profiling & Bottleneck Analysis | Profile matrix multiplication or similar; identify memory vs. compute limits. | Only measuring compute => ignoring memory throughput (or vice versa). | Nsight Compute Metrics |
37 | Intro to Warp-Level Primitives | Use warp shuffle instructions for a small parallel reduce. | Divergence or incorrect shuffle masks can ruin results. | CUDA C Programming Guide, “Warp Intrinsics” |
38 | Warp Divergence | Write a kernel with branching; measure performance difference. | Excessive branching => warp serialization. | CUDA C Best Practices Guide, “Minimizing Divergence” |
39 | Dynamic Parallelism | Launch kernels from within a kernel to handle subdivided tasks. | Over-launching can deplete resources quickly. | CUDA C Programming Guide, “Dynamic Parallelism” |
40 | Capstone Project #2 | Sparse Matrix-Vector Multiplication for large sparse data sets. | – | – |
41 | Advanced Streams & Multi-Stream Concurrency | Launch multiple kernels in parallel using multiple streams. | Oversubscribing GPU with too many streams. | CUDA C Programming Guide, “Streams” |
42 | Progress Checkpoint | Recap concurrency, warp ops, dynamic parallelism. | Missing these => issues in advanced topics later. | Short quiz or combined mini-project for Days 29–41 |
43 | Efficient Data Transfers & Zero-Copy | Map host memory into device space (zero-copy); measure overhead vs. pinned. | Improper use => performance losses. | CUDA C Programming Guide, “Zero-Copy” |
44 | Advanced Warp Intrinsics (Scan, etc.) | Implement a warp-wide prefix sum with __shfl_down_sync . |
Misaligned warp sync => incorrect partial sums. | CUDA C Programming Guide, “Shuffle and Cooperative Groups” |
45 | Cooperative Groups (Intro) | Use cooperative groups for flexible synchronization within blocks or grids. | Not all GPUs support advanced groups fully. | CUDA C Programming Guide, “Cooperative Groups” |
46 | Peer-to-Peer Communication (Multi-GPU) | Enable P2P for direct data transfers (if you have multiple GPUs). | GPU compatibility issues block P2P. | CUDA C Programming Guide, “Peer-to-Peer Memory Access” |
47 | Intermediate Debugging & Profiling Tools | Use cuda-gdb or Nsight Eclipse for step-by-step debugging. |
Missing sync can make debugging inconsistent. | CUDA Debugger Docs |
48 | Memory Footprint Optimization | Reduce shared memory or register usage; measure occupancy. | Overuse => occupancy drop. | CUDA C Best Practices Guide, “Memory & Register Optimization” |
49 | Thrust for High-Level Operations | Replace custom loops with Thrust transforms, sorts, reductions. | Excessive device-host transfers if not structured well. | Thrust Library |
50 | Intro to cuBLAS | Perform basic vector/matrix ops with cuBLAS, compare to custom kernels. | Type mismatches or library error handling. | cuBLAS Library User Guide |
51 | Intro to cuFFT | Implement a simple 1D FFT on the GPU; measure performance. | Incorrect sizing, padding => wrong results. | cuFFT Library User Guide |
52 | Code Optimization (Part 1) | Loop unrolling or register usage tweaks; measure improvements. | Over-optimizing => diminishing returns. | CUDA C Best Practices Guide, “Kernel Optimization” |
53 | Code Optimization (Part 2) | Analyze PTX, apply instruction-level optimizations. | PTX can vary across CUDA toolkit versions. | PTX Tools, ptxas , cuobjdump |
54 | Nsight Compute: Kernel Analysis | Examine occupancy, memory throughput, and instruction mix. | Ignoring memory metrics => incomplete optimization approach. | Nsight Compute Docs |
55 | Intro to Device Libraries (cuRAND, etc.) | Generate random numbers (cuRAND); run a Monte Carlo simulation. | Misusing seeds or distribution params. | cuRAND Library User Guide |
56 | Progress Checkpoint | Recap concurrency (multi-stream), libraries, optimization. | Fragmented knowledge => missed performance. | Summarize library-based vs. custom kernels |
57 | Robust Error Handling & Debugging | Expand error checking macros; advanced debugging with cuda-gdb . |
Release builds can hide errors if not checked. | CUDA Runtime Error Handling |
58 | Handling Large Data Sets | Chunk large arrays with streaming techniques. | Possible out-of-range indices in chunk loops. | CUDA C Best Practices Guide, “Large Data Handling” |
59 | MPS (Multi-Process Service) | Enable MPS for sharing GPU among multiple processes (if supported). | Oversubscription => resource contention. | NVIDIA MPS Documentation |
60 | Capstone Project #3 | Multi-Stream Data Processing: Overlap transfers & kernels for real-time feeds. | – | – |
61 | GPU-Accelerated Sorting | Use Thrust’s sort ; compare vs. CPU for large data. |
Memory bandwidth constraints. | Thrust Sort Examples |
62 | Stream Compaction & Parallel Patterns | Implement parallel compaction (remove zeros) via Thrust or custom. | Prefix sums must be correct or compaction fails. | Thrust, Parallel Patterns |
63 | Concurrency Patterns (Producer-Consumer) | Pipeline kernels: one generating data, one consuming it. | Missing sync can drop data. | CUDA C Programming Guide, “Streams & Concurrency” |
64 | Pinned + Unified Memory Hybrid | Use pinned memory for input streaming, unified memory for intermediate results. | Over-complicating memory usage => confusion. | Unified vs. Pinned Memory Best Practices |
65 | Collaborative Grouping Techniques | Use cooperative groups for advanced reductions. | Not all GPUs support grid-level groups. | CUDA C Programming Guide, “Cooperative Groups” |
66 | Peer-to-Peer (P2P) & Multi-GPU Scaling | Split data across multiple GPUs if available. | Mismatched driver/device IDs block P2P. | Multi-GPU Programming |
67 | GPU-Accelerated Graph Analytics (Intro) | Simple BFS or PageRank with adjacency lists on the GPU. | Random memory access => poor coalescing. | NVIDIA Developer Blog (search for graph analytics) |
68 | Memory Pool & Custom Allocators | Reuse device memory with a custom allocator to reduce cudaMalloc overhead. |
Not thread-safe under dynamic parallelism. | cudaMallocAsync Docs |
69 | Occupancy-Based Tuning for Large Problems | Maximize occupancy on a large matrix multiplication. | Register usage can limit occupancy. | Nsight Compute Occupancy Analysis |
70 | Progress Checkpoint | Recap concurrency patterns, advanced memory, multi-GPU. | Small concurrency oversights => big issues. | Review key lessons from Days 57–69 |
71 | Advanced Streams & Overlapping | Overlap multiple kernels, data transfers, and CPU tasks. | Missing sync => race conditions. | Nsight Systems Docs |
72 | CUDA Graphs: Complex Workflows | Merge dependent kernels & copies into one CUDA graph. | Missing finalization => incomplete runs. | CUDA C Programming Guide, “Graph Best Practices” |
73 | Dynamic Graph Launches | Build and launch graphs at runtime based on conditions. | Too many small graphs => overhead. | Nsight Documentation |
74 | Multi-GPU Programming (Deeper Exploration) | Distribute workload across two GPUs if hardware supports. | Load imbalance if data not split evenly. | Multi-GPU Programming |
75 | Performance Metrics & Roofline Analysis | Collect memory throughput, FLOPS, chart on a roofline. | Misreading memory/compute-bound => wrong optimization focus. | NVIDIA Roofline Analysis Blog |
76 | Mixed Precision & Tensor Cores (If Supported) | Implement half-precision (FP16) matrix multiply on Tensor Cores. | Precision trade-offs can break accuracy. | CUDA C Programming Guide, “Tensor Core Programming” |
77 | UM Advanced Topics (Prefetch, Advise) | Use cudaMemAdvise , prefetch data to specific devices. |
Incorrect prefetch sizes => performance drop. | CUDA Unified Memory |
78 | Large-Scale Projects: Modular Kernel Design | Split large kernels into smaller, manageable modules. | Over-fragmentation => overhead from multiple launches. | CUDA C Best Practices Guide, “Modular Kernel Design” |
79 | Portability & Scalability Best Practices | Adjust code for various GPU architectures (SM versions). | Hardcoding for a single GPU => poor portability. | CUDA Toolkit Release Notes |
80 | Capstone Project #4 | Multi-GPU Matrix Multiply: Split large matrix across 2 GPUs. | – | – |
81 | Cooperative Groups: Advanced Patterns | Try a grid-level cooperative kernel needing all blocks to sync. | Not all GPUs support grid-level sync. | CUDA C Programming Guide, “Cooperative Launch” |
82 | Large-Scale Batch Processing | Use batched operations (cuBLAS batched GEMM) for efficiency. | Wrong batch sizes => idle SMs. | cuBLAS Documentation |
83 | External Libraries (cuDNN, etc.) | Integrate a small NN layer using cuDNN if possible. | Version mismatches => runtime errors. | cuDNN Library User Guide |
84 | Progress Checkpoint | Reflect on concurrency, multi-GPU, libraries. | Missing synergy => suboptimal solutions. | Short recap or quiz for Days 71–83 |
85 | Instruction Throughput Profiling | Use Nsight Compute to track instruction throughput for tight kernels. | Ignoring special function units or double precision differences. | Nsight Compute Docs |
86 | Occupancy vs. ILP | Compare effects of occupancy vs. ILP (Instruction-Level Parallelism). | Forcing ILP might drop occupancy. | CUDA C Best Practices Guide |
87 | Custom Memory Allocators | Extend your memory pool design with stream-ordered allocations. | Race conditions if multiple streams allocate at once. | cudaMallocAsync Docs |
88 | Kernel Fusion & Loop Fusion | Merge multiple small kernels into a single kernel to reduce launch overhead. | Oversized kernels can exceed register/shared limits. | CUDA C Best Practices Guide, “Kernel Fusion” |
89 | Algorithmic Optimizations (Tiling, Blocking) | Refine tiling or blocking for matrix multiply, convolution, etc. | Over-tiling => overhead > gains. | GPU Gems, HPC papers on tiling strategies |
90 | Minimizing Data Transfers | Use pinned memory, async transfers, or kernel-side generation to limit PCIe overhead. | Large single transfers might block concurrency. | CUDA C Best Practices Guide, “Data Transfer” |
91 | Enterprise-Level Code Structure | Explore multi-file, multi-module approach with separate compilation. | Linker errors if not using correct NVCC flags. | CUDA C Programming Guide, “Separate Compilation and Linking” |
92 | Advanced Debugging (Races & Deadlocks) | Diagnose a race or deadlock in a complex multi-stream or multi-block scenario. | Relying on __syncthreads() incorrectly => deadlock. |
CUDA Debugger Docs |
93 | Real-Time GPU Computing Techniques | If real-time constraints exist, explore low-latency execution patterns. | GPU scheduling unpredictability if shared environment. | NVIDIA Documentation on Real-Time GPU |
94 | Host Multithreading + GPU Coordination | Use multiple CPU threads to launch kernels/manage streams concurrently. | CPU thread contention can throttle the GPU. | CUDA C Best Practices Guide, “Multi-threaded Host Code” |
95 | CUDA Graph Updates & Reusability | Dynamically update parts of a CUDA graph without a full rebuild. | Mishandling graph node handles => crashes. | CUDA Graphs, advanced topics |
96 | Precision & Numerical Stability | Examine rounding, float vs. double, iterative error accumulation. | Small errors compound over iterations. | Kahan Summation, HPC references |
97 | Advanced P2P & Clustering (If Possible) | Use GPU-GPU RDMA or multi-node scaling in a cluster environment. | Not all hardware supports GPUDirect RDMA. | GPUDirect RDMA Docs |
98 | Progress Checkpoint | Recap advanced debugging, multi-threaded host, graphs, precision. | Missing these => major issues in large apps. | Summarize Days 85–97 |
99 | Graph API for Complex DAG Workloads | Build a multi-kernel DAG with conditional branches/loops using CUDA Graphs. | Overcomplicated graph => overhead. | CUDA Graph Examples |
100 | Capstone Project #5 | CUDA Graph-Optimized Workload: Merge multiple kernels + copies into one graph. | – | – |
101 | GPU-Accelerated ML Frameworks (Intro) | If possible, integrate a custom kernel/layer into TensorFlow or PyTorch. | Version mismatches => build errors. | PyTorch CUDA Extensions, TensorFlow Custom Ops |
102 | CUDA + Other Parallel Frameworks | Explore hybrid CPU/GPU parallelism (OpenMP, MPI). | Partitioning complexities, oversubscription risk. | MPI + CUDA Resources, HPC references |
103 | Tuning GPU-Accelerated ML Ops | Profile a small neural net or inference pipeline; identify GPU hotspots. | Overlooking CPU-GPU sync or data pipeline bottlenecks. | Nsight Systems for ML Frameworks |
104 | Multi-GPU Scaling in ML | Distribute training across multiple GPUs or data parallel approach. | Communication overhead if poorly balanced. | NCCL Library, multi-GPU references |
105 | HPC: Memory Throughput & Computation | Review HPC patterns (PDE solvers, climate modeling) for GPU acceleration. | Memory-bound kernels for large stencils. | AMGX, HPC domain-specific papers |
106 | HPC: Precision & Mixed Precision | Use half or custom data types for HPC kernels if feasible. | Must confirm numerical stability. | HPC papers on mixed precision, e.g. SC conference proceedings |
107 | Advanced Debugging Tools (cuda-memcheck, etc.) | Use cuda-memcheck for memory leak/race detection in a bigger scenario. |
Errors may appear asynchronously. | cuda-memcheck Docs |
108 | Graphics Interop (OpenGL/DX) | If relevant, share buffers between CUDA and graphics APIs. | Handling context incorrectly => errors. | CUDA Graphics Interop |
109 | Large-Scale Code, Maintainability | Organize your code into modules/libraries; consider CMake for builds. | Linker errors if ignoring separate compilation steps. | Separate Compilation Guide |
110 | HPC Tools & Libraries (MAGMA, etc.) | Try MAGMA for advanced linear algebra on GPU. | Library/hardware version mismatches. | MAGMA |
111 | Testing & Validation Strategies | Implement unit tests for GPU kernels using CPU reference checks. | Race conditions might appear only at high thread counts. | Google Test examples with CUDA |
112 | Progress Checkpoint | Reflect on HPC/ML techniques, debugging, multi-GPU scaling. | Missing HPC/ML concepts => domain inefficiency. | Recap Days 99–111; compile best practices |
113 | Revisiting Key Optimizations | Identify top 3 bottlenecks in your main code; systematically address them. | Missing “low-hanging fruit” => suboptimal gains. | Re-profile with Nsight, confirm improvements |
114 | GPU Scheduling & CUcontext Exploration |
Investigate multiple contexts/users sharing GPU resources. | Misusing contexts => driver crashes. | CUDA Driver API, “Context Management” |
115 | Final Performance Tweaks & Fine-Tuning | Adjust L1/Shared memory config if your GPU allows; fine-tune block dimensions. | Over-optimizing => not portable across GPU models. | Architecture-specific best practices (e.g., Turing, Ampere) |
116 | Memory Hierarchy Mastery | Create a reference diagram of global, shared, local, constant, texture, etc. | Overlooking local memory spills into global. | CUDA C Programming Guide, memory hierarchy sections |
117 | Detailed Profiling Recap | Re-profile older mini-projects; apply new knowledge for more gains. | Inconsistent metrics => unclear improvements. | Nsight Systems, Nsight Compute |
118 | Review of Common Pitfalls | Make a checklist of frequent issues: out-of-bounds, race conditions, divergence, etc. | Forgetting final checks => final project problems. | Personal “CUDA Pitfalls & Solutions” doc |
119 | Prep for Final Capstone | Check environment, references, library versions; plan scope carefully. | Over-scoping => incomplete final project. | Double-check docs & notes |
120 | Capstone Project #6 | Final Project: End-to-End HPC or ML Application. | – | – |
Capstone | Project & Description | Core Skills | Implementation & Optimization Tips |
---|---|---|---|
1 (Day 20) | Image Convolution Pipeline Implement a 2D convolution (edge detection) on the GPU. |
- Grid/Block config - Shared memory tiling - Basic concurrency |
- Optimize shared memory usage - Use events for performance measurement - Validate vs. CPU reference |
2 (Day 40) | Sparse Matrix-Vector Multiplication Handle large sparse matrices; measure speed-up vs. CPU. |
- Memory coalescing - Atomic add for scatter - Stream concurrency |
- Use CSR format - Handle boundary conditions carefully - Profile with Nsight for memory/compute bottlenecks |
3 (Day 60) | Multi-Stream Data Processing Overlap data transfers & kernel execution for real-time feeds. |
- Stream usage - Async transfers - Concurrency patterns |
- Pin host memory for async - Keep multiple streams busy - Monitor concurrency in Nsight |
4 (Day 80) | Multi-GPU Matrix Multiply Split a large matrix across two GPUs if possible. |
- Multi-GPU partitioning - Peer-to-peer (if hardware allows) - Advanced sync |
- Ensure load balance - Use P2P for direct memory copying - Measure scaling efficiency |
5 (Day 100) | CUDA Graph-Optimized Workload Convert a pipeline of kernels & copies into a single CUDA graph. |
- CUDA Graph creation - Graph instantiation & reuse - Profiling graph approach |
- Identify common subgraphs - Update or re-launch partial graphs if workflow changes - Compare timeline vs. standard stream |
6 (Day 120) | Final Project: End-to-End Application HPC or ML (e.g., PDE solver or small model training) on the GPU |
- Concurrency, memory mgmt, multi-GPU or libraries, debugging, final optimizations | - Plan data management - Thorough accuracy validation - Use Nsight, cuda-gdb, timing events - Aim for stable, well-commented, high-performance code |
-
Official Documentation
-
Keep Track of Your Environment
- Note your CUDA version, driver version, and GPU architecture (compute capability).
- Certain features (e.g.,
cudaMallocAsync
, cooperative groups) require specific toolkit versions or newer GPUs.
-
Profile Early and Often
- Nsight Systems for system-wide, high-level analysis.
- Nsight Compute for detailed kernel-level performance metrics.
-
Maintain a Personal Cookbook
- Collect error-handling macros, common patterns (e.g., parallel reductions, prefix sums, warp shuffles).
- Keep a record of lessons learned, pitfalls, and solutions.
-
Test Frequently
- Write small reference tests or CPU comparisons to validate correctness.
- For performance improvements, record baseline metrics before changes and compare after.
-
Explore Community & Research
- NVIDIA Developer Forums
- GPU Gems
- Various HPC and parallel computing conference papers for domain-specific insights.
Happy coding, and best of luck on your 120-Day CUDA Journey!
For the latest repository activity, see Repository Activity Log.
- Day 1: Introduction to GPU Computing & CUDA
- Day 3: GPU vs. CPU Architecture Foundations
- Day 4: Thread Hierarchy: Grids & Blocks
- Day 8: Memory Allocation & Pointers
- Day 9: Memory Alignment & Coalescing
- Day 10: Shared Memory Fundamentals
- Day 11: Thread Synchronization (__syncthreads())
- Day 12: Bank Conflicts in Shared Memory
- Day 13: Basic Atomic Operations
- Day 14: Progress Checkpoint
- Day 15: Advanced Atomic Operations
- Day 16: Kernel Configuration Tuning
- Day 17: Host-Device Synchronization Patterns
- Day 18: Error Handling & cudaGetErrorString()
- Day 19: Unified Memory (UM) Intro
- Day 20: Capstone Project #1
- Day 21: Streams & Concurrency (Basics)
- Day 22: Events & Timing
- Day 23: Asynchronous Memory Copy
- Day 24: Pinned (Page-Locked) Memory
- Day 25: Double Buffering Technique
- Day 26: Constant Memory
- Day 27: Texture & Surface Memory (Intro)
- Day 28: Progress Checkpoint
- Day 29: Texture Memory (Practical)
- Day 30: Surface Memory