Skip to content

A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU parallel programming, memory management, and performance optimization skills.

License

Notifications You must be signed in to change notification settings

AdepojuJeremy/Cuda-120-Days-Challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cuda-120-Days-Challenge

A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU parallel programming, memory management, and performance optimization skills.

120-Day CUDA Curriculum

A structured, day-by-day plan to master NVIDIA CUDA programming over 120 days. Each day includes:

  • Core Topic
  • Practical Exercise / Mini-Project
  • Debugging Pitfalls
  • Resource Suggestions

Six Capstone Projects are spread out at Days 20, 40, 60, 80, 100, and 120 to synthesize the skills acquired.


Day-by-Day CUDA Curriculum

Day Core Topic Practical Exercise / Mini-Project Debugging Pitfalls Resource Suggestions
1 Introduction to GPU Computing & CUDA(daily-updates/day-01-intro-cuda.md) Write a simple “Hello GPU” kernel that prints a message from one thread. Missing <cuda_runtime.h> or incorrect compiler flags. CUDA C Programming Guide, Introduction
2 Setting Up the Development Environment Install CUDA Toolkit & drivers; compile and run sample codes. Incompatible driver/toolkit versions. Official “Getting Started” Guide
3 GPU vs. CPU Architecture Foundations Compare GPU SMs and CPU cores; discuss throughput vs. latency. Mixing CPU vs. GPU roles can lead to design inefficiencies. CUDA C Programming Guide, “Hardware Model Overview”
4 Thread Hierarchy: Grids & Blocks Launch a kernel using different grid/block dimensions, experiment with thread indexing. Off-by-one errors in thread indexing. CUDA C Programming Guide, “Thread Hierarchy”
5 Thread Hierarchy: Warps (Intro) Inspect warp size and how threads are grouped (no direct warp programming yet). Divergence issues in the same warp. CUDA C Best Practices Guide, “Warps and SIMT Model”
6 Basic Kernel Launch & Execution Write a kernel for element-wise addition on a small array. Mixing up cudaMemcpyHostToDevice vs. cudaMemcpyDeviceToHost. CUDA Samples, Vector Addition
7 Memory Model (Global Memory Basics) Transfer data to global memory, run a simple computation, transfer results back. Mismatched data sizes or forgetting to free GPU memory. CUDA C Programming Guide, “Global Memory Access”
8 Memory Allocation & Pointers Use cudaMalloc/cudaFree; practice error checking. Memory leaks if forgetting to free device pointers. CUDA C Programming Guide, “Memory Allocation”
9 Memory Alignment & Coalescing Benchmark coalesced vs. non-coalesced memory accesses in a kernel. Non-contiguous access patterns hurt performance. CUDA C Best Practices Guide, “Memory Coalescing”
10 Shared Memory Fundamentals Implement tile-based matrix multiplication using shared memory. Race conditions without thread sync. CUDA C Programming Guide, “Shared Memory”
11 Thread Synchronization (__syncthreads()) Extend tile-based multiplication with sync calls; measure performance. Missing __syncthreads() => partial updates. CUDA C Programming Guide, “Synchronization”
12 Bank Conflicts in Shared Memory Test an access pattern that causes bank conflicts; measure performance impact. Overlooking bank conflict in shared memory accesses. CUDA C Programming Guide, “Shared Memory Bank Conflicts”
13 Basic Atomic Operations Use atomicAdd to sum an array in parallel. Large-scale sums using atomics can degrade performance. CUDA C Programming Guide, “Atomic Functions”
14 Progress Checkpoint Quick recap or quiz: global vs. shared memory usage. Not consolidating learning => repeated mistakes. Review Days 1–13 notes; create a “lessons learned” list
15 Advanced Atomic Operations Experiment with atomicCAS, atomicExch, etc. Overusing atomics => performance bottlenecks. CUDA C Programming Guide, “Atomic Functions”
16 Kernel Configuration Tuning Adjust block sizes for the same kernel; observe occupancy changes. Non-multiples of warp size can lower occupancy. CUDA Occupancy Calculator (within Nsight Compute)
17 Host-Device Synchronization Patterns Use cudaDeviceSynchronize() for timing; measure kernel durations. Missing sync => partial results readback. CUDA C Programming Guide, “Device Synchronization”
18 Error Handling & cudaGetErrorString() Implement robust error checks after each CUDA call. Ignoring errors => hard-to-trace bugs. CUDA Error Handling
19 Unified Memory (UM) Intro Use cudaMallocManaged; run simple vector addition. Forgetting that UM still migrates data. CUDA C Programming Guide, “Unified Memory”
20 Capstone Project #1 Image Convolution Pipeline: 2D convolution (e.g., edge detection) on GPU.
21 Streams & Concurrency (Basics) Launch two kernels in different streams; check overlap. Default stream is blocking if concurrency not enabled. CUDA C Programming Guide, “Streams”
22 Events & Timing Use CUDA events for precise kernel timing. Misplacing events => skewed time measurements. CUDA C Programming Guide, “Events”
23 Asynchronous Memory Copy Copy data using streams asynchronously. Not using pinned host memory => limited async. CUDA C Programming Guide, “Asynchronous Transfers”
24 Pinned (Page-Locked) Memory Compare pinned vs. pageable host memory transfers. Excessive pinned memory => system performance hit. CUDA C Programming Guide, “Pinned Memory”
25 Double Buffering Technique Implement a two-buffer pipeline to overlap compute and transfer. Improper sync => reading incomplete data. CUDA Concurrent Kernels Sample
26 Constant Memory Use constant memory for read-only data (e.g., coefficients). Over the constant memory limit or forgetting read-only usage. CUDA C Programming Guide, “Constant Memory”
27 Texture & Surface Memory (Intro) Sample a small 2D texture; compare vs. global memory fetch. Missing texture binding/unbinding steps. CUDA C Programming Guide, “Texture Memory”
28 Progress Checkpoint Recap concurrency & memory (short quiz or multi-topic mini-project). Gaps in concurrency knowledge => suboptimal performance. Review Days 15–27 notes
29 Texture Memory (Practical) Implement image-processing kernel (e.g., grayscale) using textures. Incorrect normalized coordinates => invalid fetch. CUDA C Programming Guide, “Texture Reference”
30 Surface Memory Write operations using surfaces (e.g., output image buffer). Coordinate boundary checks are critical. CUDA C Programming Guide, “Surface Reference”
31 Unified Memory Deep Dive Use cudaMallocManaged with multiple kernels; measure page-fault overhead. Poor data locality => excessive page migration. CUDA Unified Memory
32 Stream Sync & Dependencies Enforce execution order with events or cudaStreamWaitEvent(). Forgetting to record an event => faulty sync. CUDA C Programming Guide, “Stream Synchronization”
33 Intro to CUDA Graphs Convert a kernel sequence into a CUDA graph; measure performance. Mistakes in graph capture => unexpected results. CUDA C Programming Guide, “Graphs”
34 Nsight Systems / Nsight Compute Profile a small app to find bottlenecks; read kernel timelines. Misreading profiling data => false conclusions. Nsight Systems / Nsight Compute
35 Occupancy & Launch Config Tuning Use the Occupancy Calculator to refine block size for better SM use. Overlooking shared memory => reduced occupancy. CUDA C Best Practices Guide, “Occupancy”
36 Profiling & Bottleneck Analysis Profile matrix multiplication or similar; identify memory vs. compute limits. Only measuring compute => ignoring memory throughput (or vice versa). Nsight Compute Metrics
37 Intro to Warp-Level Primitives Use warp shuffle instructions for a small parallel reduce. Divergence or incorrect shuffle masks can ruin results. CUDA C Programming Guide, “Warp Intrinsics”
38 Warp Divergence Write a kernel with branching; measure performance difference. Excessive branching => warp serialization. CUDA C Best Practices Guide, “Minimizing Divergence”
39 Dynamic Parallelism Launch kernels from within a kernel to handle subdivided tasks. Over-launching can deplete resources quickly. CUDA C Programming Guide, “Dynamic Parallelism”
40 Capstone Project #2 Sparse Matrix-Vector Multiplication for large sparse data sets.
41 Advanced Streams & Multi-Stream Concurrency Launch multiple kernels in parallel using multiple streams. Oversubscribing GPU with too many streams. CUDA C Programming Guide, “Streams”
42 Progress Checkpoint Recap concurrency, warp ops, dynamic parallelism. Missing these => issues in advanced topics later. Short quiz or combined mini-project for Days 29–41
43 Efficient Data Transfers & Zero-Copy Map host memory into device space (zero-copy); measure overhead vs. pinned. Improper use => performance losses. CUDA C Programming Guide, “Zero-Copy”
44 Advanced Warp Intrinsics (Scan, etc.) Implement a warp-wide prefix sum with __shfl_down_sync. Misaligned warp sync => incorrect partial sums. CUDA C Programming Guide, “Shuffle and Cooperative Groups”
45 Cooperative Groups (Intro) Use cooperative groups for flexible synchronization within blocks or grids. Not all GPUs support advanced groups fully. CUDA C Programming Guide, “Cooperative Groups”
46 Peer-to-Peer Communication (Multi-GPU) Enable P2P for direct data transfers (if you have multiple GPUs). GPU compatibility issues block P2P. CUDA C Programming Guide, “Peer-to-Peer Memory Access”
47 Intermediate Debugging & Profiling Tools Use cuda-gdb or Nsight Eclipse for step-by-step debugging. Missing sync can make debugging inconsistent. CUDA Debugger Docs
48 Memory Footprint Optimization Reduce shared memory or register usage; measure occupancy. Overuse => occupancy drop. CUDA C Best Practices Guide, “Memory & Register Optimization”
49 Thrust for High-Level Operations Replace custom loops with Thrust transforms, sorts, reductions. Excessive device-host transfers if not structured well. Thrust Library
50 Intro to cuBLAS Perform basic vector/matrix ops with cuBLAS, compare to custom kernels. Type mismatches or library error handling. cuBLAS Library User Guide
51 Intro to cuFFT Implement a simple 1D FFT on the GPU; measure performance. Incorrect sizing, padding => wrong results. cuFFT Library User Guide
52 Code Optimization (Part 1) Loop unrolling or register usage tweaks; measure improvements. Over-optimizing => diminishing returns. CUDA C Best Practices Guide, “Kernel Optimization”
53 Code Optimization (Part 2) Analyze PTX, apply instruction-level optimizations. PTX can vary across CUDA toolkit versions. PTX Tools, ptxas, cuobjdump
54 Nsight Compute: Kernel Analysis Examine occupancy, memory throughput, and instruction mix. Ignoring memory metrics => incomplete optimization approach. Nsight Compute Docs
55 Intro to Device Libraries (cuRAND, etc.) Generate random numbers (cuRAND); run a Monte Carlo simulation. Misusing seeds or distribution params. cuRAND Library User Guide
56 Progress Checkpoint Recap concurrency (multi-stream), libraries, optimization. Fragmented knowledge => missed performance. Summarize library-based vs. custom kernels
57 Robust Error Handling & Debugging Expand error checking macros; advanced debugging with cuda-gdb. Release builds can hide errors if not checked. CUDA Runtime Error Handling
58 Handling Large Data Sets Chunk large arrays with streaming techniques. Possible out-of-range indices in chunk loops. CUDA C Best Practices Guide, “Large Data Handling”
59 MPS (Multi-Process Service) Enable MPS for sharing GPU among multiple processes (if supported). Oversubscription => resource contention. NVIDIA MPS Documentation
60 Capstone Project #3 Multi-Stream Data Processing: Overlap transfers & kernels for real-time feeds.
61 GPU-Accelerated Sorting Use Thrust’s sort; compare vs. CPU for large data. Memory bandwidth constraints. Thrust Sort Examples
62 Stream Compaction & Parallel Patterns Implement parallel compaction (remove zeros) via Thrust or custom. Prefix sums must be correct or compaction fails. Thrust, Parallel Patterns
63 Concurrency Patterns (Producer-Consumer) Pipeline kernels: one generating data, one consuming it. Missing sync can drop data. CUDA C Programming Guide, “Streams & Concurrency”
64 Pinned + Unified Memory Hybrid Use pinned memory for input streaming, unified memory for intermediate results. Over-complicating memory usage => confusion. Unified vs. Pinned Memory Best Practices
65 Collaborative Grouping Techniques Use cooperative groups for advanced reductions. Not all GPUs support grid-level groups. CUDA C Programming Guide, “Cooperative Groups”
66 Peer-to-Peer (P2P) & Multi-GPU Scaling Split data across multiple GPUs if available. Mismatched driver/device IDs block P2P. Multi-GPU Programming
67 GPU-Accelerated Graph Analytics (Intro) Simple BFS or PageRank with adjacency lists on the GPU. Random memory access => poor coalescing. NVIDIA Developer Blog (search for graph analytics)
68 Memory Pool & Custom Allocators Reuse device memory with a custom allocator to reduce cudaMalloc overhead. Not thread-safe under dynamic parallelism. cudaMallocAsync Docs
69 Occupancy-Based Tuning for Large Problems Maximize occupancy on a large matrix multiplication. Register usage can limit occupancy. Nsight Compute Occupancy Analysis
70 Progress Checkpoint Recap concurrency patterns, advanced memory, multi-GPU. Small concurrency oversights => big issues. Review key lessons from Days 57–69
71 Advanced Streams & Overlapping Overlap multiple kernels, data transfers, and CPU tasks. Missing sync => race conditions. Nsight Systems Docs
72 CUDA Graphs: Complex Workflows Merge dependent kernels & copies into one CUDA graph. Missing finalization => incomplete runs. CUDA C Programming Guide, “Graph Best Practices”
73 Dynamic Graph Launches Build and launch graphs at runtime based on conditions. Too many small graphs => overhead. Nsight Documentation
74 Multi-GPU Programming (Deeper Exploration) Distribute workload across two GPUs if hardware supports. Load imbalance if data not split evenly. Multi-GPU Programming
75 Performance Metrics & Roofline Analysis Collect memory throughput, FLOPS, chart on a roofline. Misreading memory/compute-bound => wrong optimization focus. NVIDIA Roofline Analysis Blog
76 Mixed Precision & Tensor Cores (If Supported) Implement half-precision (FP16) matrix multiply on Tensor Cores. Precision trade-offs can break accuracy. CUDA C Programming Guide, “Tensor Core Programming”
77 UM Advanced Topics (Prefetch, Advise) Use cudaMemAdvise, prefetch data to specific devices. Incorrect prefetch sizes => performance drop. CUDA Unified Memory
78 Large-Scale Projects: Modular Kernel Design Split large kernels into smaller, manageable modules. Over-fragmentation => overhead from multiple launches. CUDA C Best Practices Guide, “Modular Kernel Design”
79 Portability & Scalability Best Practices Adjust code for various GPU architectures (SM versions). Hardcoding for a single GPU => poor portability. CUDA Toolkit Release Notes
80 Capstone Project #4 Multi-GPU Matrix Multiply: Split large matrix across 2 GPUs.
81 Cooperative Groups: Advanced Patterns Try a grid-level cooperative kernel needing all blocks to sync. Not all GPUs support grid-level sync. CUDA C Programming Guide, “Cooperative Launch”
82 Large-Scale Batch Processing Use batched operations (cuBLAS batched GEMM) for efficiency. Wrong batch sizes => idle SMs. cuBLAS Documentation
83 External Libraries (cuDNN, etc.) Integrate a small NN layer using cuDNN if possible. Version mismatches => runtime errors. cuDNN Library User Guide
84 Progress Checkpoint Reflect on concurrency, multi-GPU, libraries. Missing synergy => suboptimal solutions. Short recap or quiz for Days 71–83
85 Instruction Throughput Profiling Use Nsight Compute to track instruction throughput for tight kernels. Ignoring special function units or double precision differences. Nsight Compute Docs
86 Occupancy vs. ILP Compare effects of occupancy vs. ILP (Instruction-Level Parallelism). Forcing ILP might drop occupancy. CUDA C Best Practices Guide
87 Custom Memory Allocators Extend your memory pool design with stream-ordered allocations. Race conditions if multiple streams allocate at once. cudaMallocAsync Docs
88 Kernel Fusion & Loop Fusion Merge multiple small kernels into a single kernel to reduce launch overhead. Oversized kernels can exceed register/shared limits. CUDA C Best Practices Guide, “Kernel Fusion”
89 Algorithmic Optimizations (Tiling, Blocking) Refine tiling or blocking for matrix multiply, convolution, etc. Over-tiling => overhead > gains. GPU Gems, HPC papers on tiling strategies
90 Minimizing Data Transfers Use pinned memory, async transfers, or kernel-side generation to limit PCIe overhead. Large single transfers might block concurrency. CUDA C Best Practices Guide, “Data Transfer”
91 Enterprise-Level Code Structure Explore multi-file, multi-module approach with separate compilation. Linker errors if not using correct NVCC flags. CUDA C Programming Guide, “Separate Compilation and Linking”
92 Advanced Debugging (Races & Deadlocks) Diagnose a race or deadlock in a complex multi-stream or multi-block scenario. Relying on __syncthreads() incorrectly => deadlock. CUDA Debugger Docs
93 Real-Time GPU Computing Techniques If real-time constraints exist, explore low-latency execution patterns. GPU scheduling unpredictability if shared environment. NVIDIA Documentation on Real-Time GPU
94 Host Multithreading + GPU Coordination Use multiple CPU threads to launch kernels/manage streams concurrently. CPU thread contention can throttle the GPU. CUDA C Best Practices Guide, “Multi-threaded Host Code”
95 CUDA Graph Updates & Reusability Dynamically update parts of a CUDA graph without a full rebuild. Mishandling graph node handles => crashes. CUDA Graphs, advanced topics
96 Precision & Numerical Stability Examine rounding, float vs. double, iterative error accumulation. Small errors compound over iterations. Kahan Summation, HPC references
97 Advanced P2P & Clustering (If Possible) Use GPU-GPU RDMA or multi-node scaling in a cluster environment. Not all hardware supports GPUDirect RDMA. GPUDirect RDMA Docs
98 Progress Checkpoint Recap advanced debugging, multi-threaded host, graphs, precision. Missing these => major issues in large apps. Summarize Days 85–97
99 Graph API for Complex DAG Workloads Build a multi-kernel DAG with conditional branches/loops using CUDA Graphs. Overcomplicated graph => overhead. CUDA Graph Examples
100 Capstone Project #5 CUDA Graph-Optimized Workload: Merge multiple kernels + copies into one graph.
101 GPU-Accelerated ML Frameworks (Intro) If possible, integrate a custom kernel/layer into TensorFlow or PyTorch. Version mismatches => build errors. PyTorch CUDA Extensions, TensorFlow Custom Ops
102 CUDA + Other Parallel Frameworks Explore hybrid CPU/GPU parallelism (OpenMP, MPI). Partitioning complexities, oversubscription risk. MPI + CUDA Resources, HPC references
103 Tuning GPU-Accelerated ML Ops Profile a small neural net or inference pipeline; identify GPU hotspots. Overlooking CPU-GPU sync or data pipeline bottlenecks. Nsight Systems for ML Frameworks
104 Multi-GPU Scaling in ML Distribute training across multiple GPUs or data parallel approach. Communication overhead if poorly balanced. NCCL Library, multi-GPU references
105 HPC: Memory Throughput & Computation Review HPC patterns (PDE solvers, climate modeling) for GPU acceleration. Memory-bound kernels for large stencils. AMGX, HPC domain-specific papers
106 HPC: Precision & Mixed Precision Use half or custom data types for HPC kernels if feasible. Must confirm numerical stability. HPC papers on mixed precision, e.g. SC conference proceedings
107 Advanced Debugging Tools (cuda-memcheck, etc.) Use cuda-memcheck for memory leak/race detection in a bigger scenario. Errors may appear asynchronously. cuda-memcheck Docs
108 Graphics Interop (OpenGL/DX) If relevant, share buffers between CUDA and graphics APIs. Handling context incorrectly => errors. CUDA Graphics Interop
109 Large-Scale Code, Maintainability Organize your code into modules/libraries; consider CMake for builds. Linker errors if ignoring separate compilation steps. Separate Compilation Guide
110 HPC Tools & Libraries (MAGMA, etc.) Try MAGMA for advanced linear algebra on GPU. Library/hardware version mismatches. MAGMA
111 Testing & Validation Strategies Implement unit tests for GPU kernels using CPU reference checks. Race conditions might appear only at high thread counts. Google Test examples with CUDA
112 Progress Checkpoint Reflect on HPC/ML techniques, debugging, multi-GPU scaling. Missing HPC/ML concepts => domain inefficiency. Recap Days 99–111; compile best practices
113 Revisiting Key Optimizations Identify top 3 bottlenecks in your main code; systematically address them. Missing “low-hanging fruit” => suboptimal gains. Re-profile with Nsight, confirm improvements
114 GPU Scheduling & CUcontext Exploration Investigate multiple contexts/users sharing GPU resources. Misusing contexts => driver crashes. CUDA Driver API, “Context Management”
115 Final Performance Tweaks & Fine-Tuning Adjust L1/Shared memory config if your GPU allows; fine-tune block dimensions. Over-optimizing => not portable across GPU models. Architecture-specific best practices (e.g., Turing, Ampere)
116 Memory Hierarchy Mastery Create a reference diagram of global, shared, local, constant, texture, etc. Overlooking local memory spills into global. CUDA C Programming Guide, memory hierarchy sections
117 Detailed Profiling Recap Re-profile older mini-projects; apply new knowledge for more gains. Inconsistent metrics => unclear improvements. Nsight Systems, Nsight Compute
118 Review of Common Pitfalls Make a checklist of frequent issues: out-of-bounds, race conditions, divergence, etc. Forgetting final checks => final project problems. Personal “CUDA Pitfalls & Solutions” doc
119 Prep for Final Capstone Check environment, references, library versions; plan scope carefully. Over-scoping => incomplete final project. Double-check docs & notes
120 Capstone Project #6 Final Project: End-to-End HPC or ML Application.

Capstone Projects Summary

Capstone Project & Description Core Skills Implementation & Optimization Tips
1 (Day 20) Image Convolution Pipeline
Implement a 2D convolution (edge detection) on the GPU.
- Grid/Block config
- Shared memory tiling
- Basic concurrency
- Optimize shared memory usage
- Use events for performance measurement
- Validate vs. CPU reference
2 (Day 40) Sparse Matrix-Vector Multiplication
Handle large sparse matrices; measure speed-up vs. CPU.
- Memory coalescing
- Atomic add for scatter
- Stream concurrency
- Use CSR format
- Handle boundary conditions carefully
- Profile with Nsight for memory/compute bottlenecks
3 (Day 60) Multi-Stream Data Processing
Overlap data transfers & kernel execution for real-time feeds.
- Stream usage
- Async transfers
- Concurrency patterns
- Pin host memory for async
- Keep multiple streams busy
- Monitor concurrency in Nsight
4 (Day 80) Multi-GPU Matrix Multiply
Split a large matrix across two GPUs if possible.
- Multi-GPU partitioning
- Peer-to-peer (if hardware allows)
- Advanced sync
- Ensure load balance
- Use P2P for direct memory copying
- Measure scaling efficiency
5 (Day 100) CUDA Graph-Optimized Workload
Convert a pipeline of kernels & copies into a single CUDA graph.
- CUDA Graph creation
- Graph instantiation & reuse
- Profiling graph approach
- Identify common subgraphs
- Update or re-launch partial graphs if workflow changes
- Compare timeline vs. standard stream
6 (Day 120) Final Project: End-to-End Application
HPC or ML (e.g., PDE solver or small model training) on the GPU
- Concurrency, memory mgmt, multi-GPU or libraries, debugging, final optimizations - Plan data management
- Thorough accuracy validation
- Use Nsight, cuda-gdb, timing events
- Aim for stable, well-commented, high-performance code

Additional Resources

  1. Official Documentation

  2. Keep Track of Your Environment

    • Note your CUDA version, driver version, and GPU architecture (compute capability).
    • Certain features (e.g., cudaMallocAsync, cooperative groups) require specific toolkit versions or newer GPUs.
  3. Profile Early and Often

  4. Maintain a Personal Cookbook

    • Collect error-handling macros, common patterns (e.g., parallel reductions, prefix sums, warp shuffles).
    • Keep a record of lessons learned, pitfalls, and solutions.
  5. Test Frequently

    • Write small reference tests or CPU comparisons to validate correctness.
    • For performance improvements, record baseline metrics before changes and compare after.
  6. Explore Community & Research


Happy coding, and best of luck on your 120-Day CUDA Journey!

Repository Activity

For the latest repository activity, see Repository Activity Log.

Daily Updates

About

A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU parallel programming, memory management, and performance optimization skills.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages