Welcome to the CUDA Kernels repository! This project is a comprehensive collection of CUDA implementations ranging from fundamental concepts to advanced mathematical operations. It is designed for both beginners starting their CUDA journey and professionals looking for reference implementations.
The repository is organized into modules of increasing complexity:
- 01_kernel_basics: Introduction to writing your first CUDA kernel.
- 02_grid_block: Understanding the Grid-Block-Thread hierarchy.
- 03_hardware: Querying GPU device properties and capabilities.
- 01_vector_ops: Standard vector operations (Add, Sub, Mul, Dot) demonstrating global memory usage.
- 02_vector_dot: Dot product implementation using atomic operations.
- 03_constant_memory: Optimization using Constant Memory for read-only data.
- 04_unified_memory: Simplifies memory management using
cudaMallocManagedandcudaMemPrefetchAsync.
- 01_matrix_vector_ops: High-performance Matrix-Vector multiplication (Standard, Banded, Symmetric, Triangular) and Rank-1/Rank-2 updates. Includes CPU verification.
- 02_fft: Fast Fourier Transform implementations (Radix-2 and Stockham algorithms).
- 01_tiled_matmul: The "Holy Grail" of CUDA optimizations. Tiled Matrix-Matrix multiplication using Shared Memory.
- 02_reduction: Highly optimized parallel reduction (Sum) using Warp Shuffle instructions.
- 01_streams: Demonstrates maximizing GPU throughput by overlapping Compute with Memory Transfers using CUDA Streams.
- 01_histogram: Optimized frequency counting using Privatized Shared Memory Atomics to reduce Global Memory contention.
-
01_cublas: Industry-standard Matrix Multiplication using NVIDIA's hand-tuned
cuBLASlibrary. - 02_thrust: High-level C++ template library ("STL for CUDA") for Sorting and Reducing without writing kernels.
- 03_curand: Parallel Random Number Generation (Monte Carlo Pi Estimation).
- 04_cusparse: Sparse Matrix-Vector multiplication using Compressed Sparse Row (CSR) format.
-
05_cusolver: Dense Cholesky Decomposition (
$A = L L^T$ ). - 06_nvtx: Profiling range markers for Nsight Systems.
- 07_dynamic_parallelism: Child kernel launches from the GPU (CDP).
- NVIDIA GPU: Compute Capability 5.0 or higher recommended.
- CUDA Toolkit: Version 10.0 or higher.
- Compiler:
nvcc(bundled with CUDA Toolkit). - Build Tool:
make(ornmakeon Windows, though headers are set up for typicalmake).
Each module contains a Makefile. To build a specific module, navigate to its directory and run make.
Example: Running the Matrix-Vector Operations Demo
cd modules/03_advanced_math/01_matrix_vector_ops
make
./mv_app.exeFor a detailed learning path, check out A_BEGINNERS_GUIDE.md.
For a guide on the broader ecosystem (cuBLAS, Thrust, TensorRT, etc.), read CUDA_ECOSYSTEM_GUIDE.md.
All modules are self-contained and include verification mechanisms (comparing GPU results against CPU reference implementations) to ensure correctness.
To verify the entire repository at once, run the included PowerShell script:
./scripts/verify_all.ps1Contributions are welcome! Please ensure code is formatted and includes verification logic.
MIT License
-
GPU Architecture & Execution Model
- Concepts: Grid/Block/Thread hierarchy, SIMT (Single Instruction, Multiple Threads) architecture, Warp divergence, and SM (Streaming Multiprocessor) occupancy.
- Demonstrated in:
modules/01_fundamentals
-
CUDA Memory Hierarchy & Management
- Concepts: Global, Shared, Constant, and Unified Memory (
cudaMallocManaged). Understanding memory coalescing and minimizing global memory latency. - Demonstrated in:
modules/02_memory_management,modules/04_optimizations
- Concepts: Global, Shared, Constant, and Unified Memory (
-
Advanced Parallel Algorithms & Optimization Patterns
- Tiled Matrix Multiplication: Utilizing Shared Memory to drastically reduce global memory bandwidth requirements (often considered the "Holy Grail" of CUDA optimization).
- Parallel Reduction: Highly optimized sum reductions using low-level Warp Shuffle (
__shfl_down_sync) instructions. - Histograms & Atomics: Privatized shared memory atomics to resolve global memory contention.
- Demonstrated in:
modules/04_optimizations,modules/06_advanced_algorithms
-
Concurrency & Stream Management
- Concepts: Overlapping kernel execution with asynchronous memory transfers (
cudaMemcpyAsync) using CUDA Streams to maximize total GPU utilization. - Demonstrated in:
modules/05_concurrency
- Concepts: Overlapping kernel execution with asynchronous memory transfers (
-
NVIDIA Ecosystem Integration
- Concepts: Utilizing industry-standard, highly-tuned libraries instead of reinventing the wheel for production systems.
- Demonstrated in:
modules/07_ecosystem(cuBLAS, cuSPARSE, cuSOLVER, cuRAND, Thrust)
-
Profiling, Debugging & Verification
- Concepts: NVTX (NVIDIA Tools Extension) markers for performance profiling in Nsight Systems. Establishing reliable CPU reference implementations for numerical validation.
- Demonstrated in:
modules/07_ecosystem/06_nvtx,scripts/verify_all.ps1
