Cuda-120-Days-Challenge

A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU parallel programming, memory management, and performance optimization skills.

120-Day CUDA Curriculum

A structured, day-by-day plan to master NVIDIA CUDA programming over 120 days. Each day includes:

Core Topic
Practical Exercise / Mini-Project
Debugging Pitfalls
Resource Suggestions

Six Capstone Projects are spread out at Days 20, 40, 60, 80, 100, and 120 to synthesize the skills acquired.

Day-by-Day CUDA Curriculum

Day	Core Topic	Practical Exercise / Mini-Project	Debugging Pitfalls	Resource Suggestions
1	Introduction to GPU Computing & CUDA(daily-updates/day-01-intro-cuda.md)	Write a simple “Hello GPU” kernel that prints a message from one thread.	Missing `<cuda_runtime.h>` or incorrect compiler flags.	CUDA C Programming Guide, Introduction
2	Setting Up the Development Environment	Install CUDA Toolkit & drivers; compile and run sample codes.	Incompatible driver/toolkit versions.	Official “Getting Started” Guide
3	GPU vs. CPU Architecture Foundations	Compare GPU SMs and CPU cores; discuss throughput vs. latency.	Mixing CPU vs. GPU roles can lead to design inefficiencies.	CUDA C Programming Guide, “Hardware Model Overview”
4	Thread Hierarchy: Grids & Blocks	Launch a kernel using different grid/block dimensions, experiment with thread indexing.	Off-by-one errors in thread indexing.	CUDA C Programming Guide, “Thread Hierarchy”
5	Thread Hierarchy: Warps (Intro)	Inspect warp size and how threads are grouped (no direct warp programming yet).	Divergence issues in the same warp.	CUDA C Best Practices Guide, “Warps and SIMT Model”
6	Basic Kernel Launch & Execution	Write a kernel for element-wise addition on a small array.	Mixing up `cudaMemcpyHostToDevice` vs. `cudaMemcpyDeviceToHost`.	CUDA Samples, Vector Addition
7	Memory Model (Global Memory Basics)	Transfer data to global memory, run a simple computation, transfer results back.	Mismatched data sizes or forgetting to free GPU memory.	CUDA C Programming Guide, “Global Memory Access”
8	Memory Allocation & Pointers	Use `cudaMalloc`/`cudaFree`; practice error checking.	Memory leaks if forgetting to free device pointers.	CUDA C Programming Guide, “Memory Allocation”
9	Memory Alignment & Coalescing	Benchmark coalesced vs. non-coalesced memory accesses in a kernel.	Non-contiguous access patterns hurt performance.	CUDA C Best Practices Guide, “Memory Coalescing”
10	Shared Memory Fundamentals	Implement tile-based matrix multiplication using shared memory.	Race conditions without thread sync.	CUDA C Programming Guide, “Shared Memory”
11	Thread Synchronization (`__syncthreads()`)	Extend tile-based multiplication with sync calls; measure performance.	Missing `__syncthreads()` => partial updates.	CUDA C Programming Guide, “Synchronization”
12	Bank Conflicts in Shared Memory	Test an access pattern that causes bank conflicts; measure performance impact.	Overlooking bank conflict in shared memory accesses.	CUDA C Programming Guide, “Shared Memory Bank Conflicts”
13	Basic Atomic Operations	Use `atomicAdd` to sum an array in parallel.	Large-scale sums using atomics can degrade performance.	CUDA C Programming Guide, “Atomic Functions”
14	Progress Checkpoint	Quick recap or quiz: global vs. shared memory usage.	Not consolidating learning => repeated mistakes.	Review Days 1–13 notes; create a “lessons learned” list
15	Advanced Atomic Operations	Experiment with `atomicCAS`, `atomicExch`, etc.	Overusing atomics => performance bottlenecks.	CUDA C Programming Guide, “Atomic Functions”
16	Kernel Configuration Tuning	Adjust block sizes for the same kernel; observe occupancy changes.	Non-multiples of warp size can lower occupancy.	CUDA Occupancy Calculator (within Nsight Compute)
17	Host-Device Synchronization Patterns	Use `cudaDeviceSynchronize()` for timing; measure kernel durations.	Missing sync => partial results readback.	CUDA C Programming Guide, “Device Synchronization”
18	Error Handling & `cudaGetErrorString()`	Implement robust error checks after each CUDA call.	Ignoring errors => hard-to-trace bugs.	CUDA Error Handling
19	Unified Memory (UM) Intro	Use `cudaMallocManaged`; run simple vector addition.	Forgetting that UM still migrates data.	CUDA C Programming Guide, “Unified Memory”
20	Capstone Project #1	Image Convolution Pipeline: 2D convolution (e.g., edge detection) on GPU.	–	–
21	Streams & Concurrency (Basics)	Launch two kernels in different streams; check overlap.	Default stream is blocking if concurrency not enabled.	CUDA C Programming Guide, “Streams”
22	Events & Timing	Use CUDA events for precise kernel timing.	Misplacing events => skewed time measurements.	CUDA C Programming Guide, “Events”
23	Asynchronous Memory Copy	Copy data using streams asynchronously.	Not using pinned host memory => limited async.	CUDA C Programming Guide, “Asynchronous Transfers”
24	Pinned (Page-Locked) Memory	Compare pinned vs. pageable host memory transfers.	Excessive pinned memory => system performance hit.	CUDA C Programming Guide, “Pinned Memory”
25	Double Buffering Technique	Implement a two-buffer pipeline to overlap compute and transfer.	Improper sync => reading incomplete data.	CUDA Concurrent Kernels Sample
26	Constant Memory	Use constant memory for read-only data (e.g., coefficients).	Over the constant memory limit or forgetting read-only usage.	CUDA C Programming Guide, “Constant Memory”
27	Texture & Surface Memory (Intro)	Sample a small 2D texture; compare vs. global memory fetch.	Missing texture binding/unbinding steps.	CUDA C Programming Guide, “Texture Memory”
28	Progress Checkpoint	Recap concurrency & memory (short quiz or multi-topic mini-project).	Gaps in concurrency knowledge => suboptimal performance.	Review Days 15–27 notes
29	Texture Memory (Practical)	Implement image-processing kernel (e.g., grayscale) using textures.	Incorrect normalized coordinates => invalid fetch.	CUDA C Programming Guide, “Texture Reference”
30	Surface Memory	Write operations using surfaces (e.g., output image buffer).	Coordinate boundary checks are critical.	CUDA C Programming Guide, “Surface Reference”
31	Unified Memory Deep Dive	Use `cudaMallocManaged` with multiple kernels; measure page-fault overhead.	Poor data locality => excessive page migration.	CUDA Unified Memory
32	Stream Sync & Dependencies	Enforce execution order with events or `cudaStreamWaitEvent()`.	Forgetting to record an event => faulty sync.	CUDA C Programming Guide, “Stream Synchronization”
33	Intro to CUDA Graphs	Convert a kernel sequence into a CUDA graph; measure performance.	Mistakes in graph capture => unexpected results.	CUDA C Programming Guide, “Graphs”
34	Nsight Systems / Nsight Compute	Profile a small app to find bottlenecks; read kernel timelines.	Misreading profiling data => false conclusions.	Nsight Systems / Nsight Compute
35	Occupancy & Launch Config Tuning	Use the Occupancy Calculator to refine block size for better SM use.	Overlooking shared memory => reduced occupancy.	CUDA C Best Practices Guide, “Occupancy”
36	Profiling & Bottleneck Analysis	Profile matrix multiplication or similar; identify memory vs. compute limits.	Only measuring compute => ignoring memory throughput (or vice versa).	Nsight Compute Metrics
37	Intro to Warp-Level Primitives	Use warp shuffle instructions for a small parallel reduce.	Divergence or incorrect shuffle masks can ruin results.	CUDA C Programming Guide, “Warp Intrinsics”
38	Warp Divergence	Write a kernel with branching; measure performance difference.	Excessive branching => warp serialization.	CUDA C Best Practices Guide, “Minimizing Divergence”
39	Dynamic Parallelism	Launch kernels from within a kernel to handle subdivided tasks.	Over-launching can deplete resources quickly.	CUDA C Programming Guide, “Dynamic Parallelism”
40	Capstone Project #2	Sparse Matrix-Vector Multiplication for large sparse data sets.	–	–
41	Advanced Streams & Multi-Stream Concurrency	Launch multiple kernels in parallel using multiple streams.	Oversubscribing GPU with too many streams.	CUDA C Programming Guide, “Streams”
42	Progress Checkpoint	Recap concurrency, warp ops, dynamic parallelism.	Missing these => issues in advanced topics later.	Short quiz or combined mini-project for Days 29–41
43	Efficient Data Transfers & Zero-Copy	Map host memory into device space (zero-copy); measure overhead vs. pinned.	Improper use => performance losses.	CUDA C Programming Guide, “Zero-Copy”
44	Advanced Warp Intrinsics (Scan, etc.)	Implement a warp-wide prefix sum with `__shfl_down_sync`.	Misaligned warp sync => incorrect partial sums.	CUDA C Programming Guide, “Shuffle and Cooperative Groups”
45	Cooperative Groups (Intro)	Use cooperative groups for flexible synchronization within blocks or grids.	Not all GPUs support advanced groups fully.	CUDA C Programming Guide, “Cooperative Groups”
46	Peer-to-Peer Communication (Multi-GPU)	Enable P2P for direct data transfers (if you have multiple GPUs).	GPU compatibility issues block P2P.	CUDA C Programming Guide, “Peer-to-Peer Memory Access”
47	Intermediate Debugging & Profiling Tools	Use `cuda-gdb` or Nsight Eclipse for step-by-step debugging.	Missing sync can make debugging inconsistent.	CUDA Debugger Docs
48	Memory Footprint Optimization	Reduce shared memory or register usage; measure occupancy.	Overuse => occupancy drop.	CUDA C Best Practices Guide, “Memory & Register Optimization”
49	Thrust for High-Level Operations	Replace custom loops with Thrust transforms, sorts, reductions.	Excessive device-host transfers if not structured well.	Thrust Library
50	Intro to cuBLAS	Perform basic vector/matrix ops with cuBLAS, compare to custom kernels.	Type mismatches or library error handling.	cuBLAS Library User Guide
51	Intro to cuFFT	Implement a simple 1D FFT on the GPU; measure performance.	Incorrect sizing, padding => wrong results.	cuFFT Library User Guide
52	Code Optimization (Part 1)	Loop unrolling or register usage tweaks; measure improvements.	Over-optimizing => diminishing returns.	CUDA C Best Practices Guide, “Kernel Optimization”
53	Code Optimization (Part 2)	Analyze PTX, apply instruction-level optimizations.	PTX can vary across CUDA toolkit versions.	PTX Tools, `ptxas`, `cuobjdump`
54	Nsight Compute: Kernel Analysis	Examine occupancy, memory throughput, and instruction mix.	Ignoring memory metrics => incomplete optimization approach.	Nsight Compute Docs
55	Intro to Device Libraries (cuRAND, etc.)	Generate random numbers (cuRAND); run a Monte Carlo simulation.	Misusing seeds or distribution params.	cuRAND Library User Guide
56	Progress Checkpoint	Recap concurrency (multi-stream), libraries, optimization.	Fragmented knowledge => missed performance.	Summarize library-based vs. custom kernels
57	Robust Error Handling & Debugging	Expand error checking macros; advanced debugging with `cuda-gdb`.	Release builds can hide errors if not checked.	CUDA Runtime Error Handling
58	Handling Large Data Sets	Chunk large arrays with streaming techniques.	Possible out-of-range indices in chunk loops.	CUDA C Best Practices Guide, “Large Data Handling”
59	MPS (Multi-Process Service)	Enable MPS for sharing GPU among multiple processes (if supported).	Oversubscription => resource contention.	NVIDIA MPS Documentation
60	Capstone Project #3	Multi-Stream Data Processing: Overlap transfers & kernels for real-time feeds.	–	–
61	GPU-Accelerated Sorting	Use Thrust’s `sort`; compare vs. CPU for large data.	Memory bandwidth constraints.	Thrust Sort Examples
62	Stream Compaction & Parallel Patterns	Implement parallel compaction (remove zeros) via Thrust or custom.	Prefix sums must be correct or compaction fails.	Thrust, Parallel Patterns
63	Concurrency Patterns (Producer-Consumer)	Pipeline kernels: one generating data, one consuming it.	Missing sync can drop data.	CUDA C Programming Guide, “Streams & Concurrency”
64	Pinned + Unified Memory Hybrid	Use pinned memory for input streaming, unified memory for intermediate results.	Over-complicating memory usage => confusion.	Unified vs. Pinned Memory Best Practices
65	Collaborative Grouping Techniques	Use cooperative groups for advanced reductions.	Not all GPUs support grid-level groups.	CUDA C Programming Guide, “Cooperative Groups”
66	Peer-to-Peer (P2P) & Multi-GPU Scaling	Split data across multiple GPUs if available.	Mismatched driver/device IDs block P2P.	Multi-GPU Programming
67	GPU-Accelerated Graph Analytics (Intro)	Simple BFS or PageRank with adjacency lists on the GPU.	Random memory access => poor coalescing.	NVIDIA Developer Blog (search for graph analytics)
68	Memory Pool & Custom Allocators	Reuse device memory with a custom allocator to reduce `cudaMalloc` overhead.	Not thread-safe under dynamic parallelism.	cudaMallocAsync Docs
69	Occupancy-Based Tuning for Large Problems	Maximize occupancy on a large matrix multiplication.	Register usage can limit occupancy.	Nsight Compute Occupancy Analysis
70	Progress Checkpoint	Recap concurrency patterns, advanced memory, multi-GPU.	Small concurrency oversights => big issues.	Review key lessons from Days 57–69
71	Advanced Streams & Overlapping	Overlap multiple kernels, data transfers, and CPU tasks.	Missing sync => race conditions.	Nsight Systems Docs
72	CUDA Graphs: Complex Workflows	Merge dependent kernels & copies into one CUDA graph.	Missing finalization => incomplete runs.	CUDA C Programming Guide, “Graph Best Practices”
73	Dynamic Graph Launches	Build and launch graphs at runtime based on conditions.	Too many small graphs => overhead.	Nsight Documentation
74	Multi-GPU Programming (Deeper Exploration)	Distribute workload across two GPUs if hardware supports.	Load imbalance if data not split evenly.	Multi-GPU Programming
75	Performance Metrics & Roofline Analysis	Collect memory throughput, FLOPS, chart on a roofline.	Misreading memory/compute-bound => wrong optimization focus.	NVIDIA Roofline Analysis Blog
76	Mixed Precision & Tensor Cores (If Supported)	Implement half-precision (FP16) matrix multiply on Tensor Cores.	Precision trade-offs can break accuracy.	CUDA C Programming Guide, “Tensor Core Programming”
77	UM Advanced Topics (Prefetch, Advise)	Use `cudaMemAdvise`, prefetch data to specific devices.	Incorrect prefetch sizes => performance drop.	CUDA Unified Memory
78	Large-Scale Projects: Modular Kernel Design	Split large kernels into smaller, manageable modules.	Over-fragmentation => overhead from multiple launches.	CUDA C Best Practices Guide, “Modular Kernel Design”
79	Portability & Scalability Best Practices	Adjust code for various GPU architectures (SM versions).	Hardcoding for a single GPU => poor portability.	CUDA Toolkit Release Notes
80	Capstone Project #4	Multi-GPU Matrix Multiply: Split large matrix across 2 GPUs.	–	–
81	Cooperative Groups: Advanced Patterns	Try a grid-level cooperative kernel needing all blocks to sync.	Not all GPUs support grid-level sync.	CUDA C Programming Guide, “Cooperative Launch”
82	Large-Scale Batch Processing	Use batched operations (cuBLAS batched GEMM) for efficiency.	Wrong batch sizes => idle SMs.	cuBLAS Documentation
83	External Libraries (cuDNN, etc.)	Integrate a small NN layer using cuDNN if possible.	Version mismatches => runtime errors.	cuDNN Library User Guide
84	Progress Checkpoint	Reflect on concurrency, multi-GPU, libraries.	Missing synergy => suboptimal solutions.	Short recap or quiz for Days 71–83
85	Instruction Throughput Profiling	Use Nsight Compute to track instruction throughput for tight kernels.	Ignoring special function units or double precision differences.	Nsight Compute Docs
86	Occupancy vs. ILP	Compare effects of occupancy vs. ILP (Instruction-Level Parallelism).	Forcing ILP might drop occupancy.	CUDA C Best Practices Guide
87	Custom Memory Allocators	Extend your memory pool design with stream-ordered allocations.	Race conditions if multiple streams allocate at once.	cudaMallocAsync Docs
88	Kernel Fusion & Loop Fusion	Merge multiple small kernels into a single kernel to reduce launch overhead.	Oversized kernels can exceed register/shared limits.	CUDA C Best Practices Guide, “Kernel Fusion”
89	Algorithmic Optimizations (Tiling, Blocking)	Refine tiling or blocking for matrix multiply, convolution, etc.	Over-tiling => overhead > gains.	GPU Gems, HPC papers on tiling strategies
90	Minimizing Data Transfers	Use pinned memory, async transfers, or kernel-side generation to limit PCIe overhead.	Large single transfers might block concurrency.	CUDA C Best Practices Guide, “Data Transfer”
91	Enterprise-Level Code Structure	Explore multi-file, multi-module approach with separate compilation.	Linker errors if not using correct NVCC flags.	CUDA C Programming Guide, “Separate Compilation and Linking”
92	Advanced Debugging (Races & Deadlocks)	Diagnose a race or deadlock in a complex multi-stream or multi-block scenario.	Relying on `__syncthreads()` incorrectly => deadlock.	CUDA Debugger Docs
93	Real-Time GPU Computing Techniques	If real-time constraints exist, explore low-latency execution patterns.	GPU scheduling unpredictability if shared environment.	NVIDIA Documentation on Real-Time GPU
94	Host Multithreading + GPU Coordination	Use multiple CPU threads to launch kernels/manage streams concurrently.	CPU thread contention can throttle the GPU.	CUDA C Best Practices Guide, “Multi-threaded Host Code”
95	CUDA Graph Updates & Reusability	Dynamically update parts of a CUDA graph without a full rebuild.	Mishandling graph node handles => crashes.	CUDA Graphs, advanced topics
96	Precision & Numerical Stability	Examine rounding, float vs. double, iterative error accumulation.	Small errors compound over iterations.	Kahan Summation, HPC references
97	Advanced P2P & Clustering (If Possible)	Use GPU-GPU RDMA or multi-node scaling in a cluster environment.	Not all hardware supports GPUDirect RDMA.	GPUDirect RDMA Docs
98	Progress Checkpoint	Recap advanced debugging, multi-threaded host, graphs, precision.	Missing these => major issues in large apps.	Summarize Days 85–97
99	Graph API for Complex DAG Workloads	Build a multi-kernel DAG with conditional branches/loops using CUDA Graphs.	Overcomplicated graph => overhead.	CUDA Graph Examples
100	Capstone Project #5	CUDA Graph-Optimized Workload: Merge multiple kernels + copies into one graph.	–	–
101	GPU-Accelerated ML Frameworks (Intro)	If possible, integrate a custom kernel/layer into TensorFlow or PyTorch.	Version mismatches => build errors.	PyTorch CUDA Extensions, TensorFlow Custom Ops
102	CUDA + Other Parallel Frameworks	Explore hybrid CPU/GPU parallelism (OpenMP, MPI).	Partitioning complexities, oversubscription risk.	MPI + CUDA Resources, HPC references
103	Tuning GPU-Accelerated ML Ops	Profile a small neural net or inference pipeline; identify GPU hotspots.	Overlooking CPU-GPU sync or data pipeline bottlenecks.	Nsight Systems for ML Frameworks
104	Multi-GPU Scaling in ML	Distribute training across multiple GPUs or data parallel approach.	Communication overhead if poorly balanced.	NCCL Library, multi-GPU references
105	HPC: Memory Throughput & Computation	Review HPC patterns (PDE solvers, climate modeling) for GPU acceleration.	Memory-bound kernels for large stencils.	AMGX, HPC domain-specific papers
106	HPC: Precision & Mixed Precision	Use half or custom data types for HPC kernels if feasible.	Must confirm numerical stability.	HPC papers on mixed precision, e.g. SC conference proceedings
107	Advanced Debugging Tools (cuda-memcheck, etc.)	Use `cuda-memcheck` for memory leak/race detection in a bigger scenario.	Errors may appear asynchronously.	cuda-memcheck Docs
108	Graphics Interop (OpenGL/DX)	If relevant, share buffers between CUDA and graphics APIs.	Handling context incorrectly => errors.	CUDA Graphics Interop
109	Large-Scale Code, Maintainability	Organize your code into modules/libraries; consider CMake for builds.	Linker errors if ignoring separate compilation steps.	Separate Compilation Guide
110	HPC Tools & Libraries (MAGMA, etc.)	Try MAGMA for advanced linear algebra on GPU.	Library/hardware version mismatches.	MAGMA
111	Testing & Validation Strategies	Implement unit tests for GPU kernels using CPU reference checks.	Race conditions might appear only at high thread counts.	Google Test examples with CUDA
112	Progress Checkpoint	Reflect on HPC/ML techniques, debugging, multi-GPU scaling.	Missing HPC/ML concepts => domain inefficiency.	Recap Days 99–111; compile best practices
113	Revisiting Key Optimizations	Identify top 3 bottlenecks in your main code; systematically address them.	Missing “low-hanging fruit” => suboptimal gains.	Re-profile with Nsight, confirm improvements
114	GPU Scheduling & `CUcontext` Exploration	Investigate multiple contexts/users sharing GPU resources.	Misusing contexts => driver crashes.	CUDA Driver API, “Context Management”
115	Final Performance Tweaks & Fine-Tuning	Adjust L1/Shared memory config if your GPU allows; fine-tune block dimensions.	Over-optimizing => not portable across GPU models.	Architecture-specific best practices (e.g., Turing, Ampere)
116	Memory Hierarchy Mastery	Create a reference diagram of global, shared, local, constant, texture, etc.	Overlooking local memory spills into global.	CUDA C Programming Guide, memory hierarchy sections
117	Detailed Profiling Recap	Re-profile older mini-projects; apply new knowledge for more gains.	Inconsistent metrics => unclear improvements.	Nsight Systems, Nsight Compute
118	Review of Common Pitfalls	Make a checklist of frequent issues: out-of-bounds, race conditions, divergence, etc.	Forgetting final checks => final project problems.	Personal “CUDA Pitfalls & Solutions” doc
119	Prep for Final Capstone	Check environment, references, library versions; plan scope carefully.	Over-scoping => incomplete final project.	Double-check docs & notes
120	Capstone Project #6	Final Project: End-to-End HPC or ML Application.	–	–

Capstone Projects Summary

Capstone	Project & Description	Core Skills	Implementation & Optimization Tips
1 (Day 20)	Image Convolution Pipeline Implement a 2D convolution (edge detection) on the GPU.	- Grid/Block config - Shared memory tiling - Basic concurrency	- Optimize shared memory usage - Use events for performance measurement - Validate vs. CPU reference
2 (Day 40)	Sparse Matrix-Vector Multiplication Handle large sparse matrices; measure speed-up vs. CPU.	- Memory coalescing - Atomic add for scatter - Stream concurrency	- Use CSR format - Handle boundary conditions carefully - Profile with Nsight for memory/compute bottlenecks
3 (Day 60)	Multi-Stream Data Processing Overlap data transfers & kernel execution for real-time feeds.	- Stream usage - Async transfers - Concurrency patterns	- Pin host memory for async - Keep multiple streams busy - Monitor concurrency in Nsight
4 (Day 80)	Multi-GPU Matrix Multiply Split a large matrix across two GPUs if possible.	- Multi-GPU partitioning - Peer-to-peer (if hardware allows) - Advanced sync	- Ensure load balance - Use P2P for direct memory copying - Measure scaling efficiency
5 (Day 100)	CUDA Graph-Optimized Workload Convert a pipeline of kernels & copies into a single CUDA graph.	- CUDA Graph creation - Graph instantiation & reuse - Profiling graph approach	- Identify common subgraphs - Update or re-launch partial graphs if workflow changes - Compare timeline vs. standard stream
6 (Day 120)	Final Project: End-to-End Application HPC or ML (e.g., PDE solver or small model training) on the GPU	- Concurrency, memory mgmt, multi-GPU or libraries, debugging, final optimizations	- Plan data management - Thorough accuracy validation - Use Nsight, cuda-gdb, timing events - Aim for stable, well-commented, high-performance code

Additional Resources

Official Documentation
Keep Track of Your Environment
- Note your CUDA version, driver version, and GPU architecture (compute capability).
- Certain features (e.g., cudaMallocAsync, cooperative groups) require specific toolkit versions or newer GPUs.
Profile Early and Often
- Nsight Systems for system-wide, high-level analysis.
- Nsight Compute for detailed kernel-level performance metrics.
Maintain a Personal Cookbook
- Collect error-handling macros, common patterns (e.g., parallel reductions, prefix sums, warp shuffles).
- Keep a record of lessons learned, pitfalls, and solutions.
Test Frequently
- Write small reference tests or CPU comparisons to validate correctness.
- For performance improvements, record baseline metrics before changes and compare after.
Explore Community & Research
- NVIDIA Developer Forums
- GPU Gems
- Various HPC and parallel computing conference papers for domain-specific insights.

Happy coding, and best of luck on your 120-Day CUDA Journey!

Repository Activity

For the latest repository activity, see Repository Activity Log.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
daily-updates		daily-updates
logs		logs
LICENSE		LICENSE
README.md		README.md
create-daily-update.sh		create-daily-update.sh
generate-table.sh		generate-table.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cuda-120-Days-Challenge

120-Day CUDA Curriculum

Day-by-Day CUDA Curriculum

Capstone Projects Summary

Additional Resources

Repository Activity

Daily Updates

About

Releases

Packages

Languages

License

AdepojuJeremy/Cuda-120-Days-Challenge

Folders and files

Latest commit

History

Repository files navigation

Cuda-120-Days-Challenge

120-Day CUDA Curriculum

Day-by-Day CUDA Curriculum

Capstone Projects Summary

Additional Resources

Repository Activity

Daily Updates

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages