CUDA Kernels Repository

Welcome to the CUDA Kernels repository! This project is a comprehensive collection of CUDA implementations ranging from fundamental concepts to advanced mathematical operations. It is designed for both beginners starting their CUDA journey and professionals looking for reference implementations.

📂 Project Structure

The repository is organized into modules of increasing complexity:

1. Fundamentals (`modules/01_fundamentals`)

01_kernel_basics: Introduction to writing your first CUDA kernel.
02_grid_block: Understanding the Grid-Block-Thread hierarchy.
03_hardware: Querying GPU device properties and capabilities.

2. Memory Management (`modules/02_memory_management`)

01_vector_ops: Standard vector operations (Add, Sub, Mul, Dot) demonstrating global memory usage.
02_vector_dot: Dot product implementation using atomic operations.
03_constant_memory: Optimization using Constant Memory for read-only data.
04_unified_memory: Simplifies memory management using cudaMallocManaged and cudaMemPrefetchAsync.

3. Advanced Math (`modules/03_advanced_math`)

01_matrix_vector_ops: High-performance Matrix-Vector multiplication (Standard, Banded, Symmetric, Triangular) and Rank-1/Rank-2 updates. Includes CPU verification.
02_fft: Fast Fourier Transform implementations (Radix-2 and Stockham algorithms).

4. Optimizations (`modules/04_optimizations`)

01_tiled_matmul: The "Holy Grail" of CUDA optimizations. Tiled Matrix-Matrix multiplication using Shared Memory.
02_reduction: Highly optimized parallel reduction (Sum) using Warp Shuffle instructions.

5. Concurrency (`modules/05_concurrency`)

01_streams: Demonstrates maximizing GPU throughput by overlapping Compute with Memory Transfers using CUDA Streams.

6. Advanced Algorithms (`modules/06_advanced_algorithms`)

01_histogram: Optimized frequency counting using Privatized Shared Memory Atomics to reduce Global Memory contention.

7. The Ecosystem (`modules/07_ecosystem`)

01_cublas: Industry-standard Matrix Multiplication using NVIDIA's hand-tuned cuBLAS library.
02_thrust: High-level C++ template library ("STL for CUDA") for Sorting and Reducing without writing kernels.
03_curand: Parallel Random Number Generation (Monte Carlo Pi Estimation).
04_cusparse: Sparse Matrix-Vector multiplication using Compressed Sparse Row (CSR) format.
05_cusolver: Dense Cholesky Decomposition ($A = L L^T$).
06_nvtx: Profiling range markers for Nsight Systems.
07_dynamic_parallelism: Child kernel launches from the GPU (CDP).

🚀 Getting Started

Prerequisites

NVIDIA GPU: Compute Capability 5.0 or higher recommended.
CUDA Toolkit: Version 10.0 or higher.
Compiler: nvcc (bundled with CUDA Toolkit).
Build Tool: make (or nmake on Windows, though headers are set up for typical make).

Building and Running

Each module contains a Makefile. To build a specific module, navigate to its directory and run make.

Example: Running the Matrix-Vector Operations Demo

cd modules/03_advanced_math/01_matrix_vector_ops
make
./mv_app.exe

📚 Documentation

For a detailed learning path, check out A_BEGINNERS_GUIDE.md.

For a guide on the broader ecosystem (cuBLAS, Thrust, TensorRT, etc.), read CUDA_ECOSYSTEM_GUIDE.md.

🛠️ Verification

All modules are self-contained and include verification mechanisms (comparing GPU results against CPU reference implementations) to ensure correctness.

To verify the entire repository at once, run the included PowerShell script:

./scripts/verify_all.ps1

🤝 Contributing

Contributions are welcome! Please ensure code is formatted and includes verification logic.

📄 License

MIT License

🧠 Key Topics & Demonstrated Skills

GPU Architecture & Execution Model
- Concepts: Grid/Block/Thread hierarchy, SIMT (Single Instruction, Multiple Threads) architecture, Warp divergence, and SM (Streaming Multiprocessor) occupancy.
- Demonstrated in: modules/01_fundamentals
CUDA Memory Hierarchy & Management
- Concepts: Global, Shared, Constant, and Unified Memory (cudaMallocManaged). Understanding memory coalescing and minimizing global memory latency.
- Demonstrated in: modules/02_memory_management, modules/04_optimizations
Advanced Parallel Algorithms & Optimization Patterns
- Tiled Matrix Multiplication: Utilizing Shared Memory to drastically reduce global memory bandwidth requirements (often considered the "Holy Grail" of CUDA optimization).
- Parallel Reduction: Highly optimized sum reductions using low-level Warp Shuffle (__shfl_down_sync) instructions.
- Histograms & Atomics: Privatized shared memory atomics to resolve global memory contention.
- Demonstrated in: modules/04_optimizations, modules/06_advanced_algorithms
Concurrency & Stream Management
- Concepts: Overlapping kernel execution with asynchronous memory transfers (cudaMemcpyAsync) using CUDA Streams to maximize total GPU utilization.
- Demonstrated in: modules/05_concurrency
NVIDIA Ecosystem Integration
- Concepts: Utilizing industry-standard, highly-tuned libraries instead of reinventing the wheel for production systems.
- Demonstrated in: modules/07_ecosystem (cuBLAS, cuSPARSE, cuSOLVER, cuRAND, Thrust)
Profiling, Debugging & Verification
- Concepts: NVTX (NVIDIA Tools Extension) markers for performance profiling in Nsight Systems. Establishing reliable CPU reference implementations for numerical validation.
- Demonstrated in: modules/07_ecosystem/06_nvtx, scripts/verify_all.ps1

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
common/include		common/include
docs		docs
modules		modules
scripts		scripts
.gitignore		.gitignore
A_BEGINNERS_GUIDE.md		A_BEGINNERS_GUIDE.md
CONTRIBUTING.md		CONTRIBUTING.md
CUDA_ECOSYSTEM_GUIDE.md		CUDA_ECOSYSTEM_GUIDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.bat		build.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Kernels Repository

📂 Project Structure

1. Fundamentals (`modules/01_fundamentals`)

2. Memory Management (`modules/02_memory_management`)

3. Advanced Math (`modules/03_advanced_math`)

4. Optimizations (`modules/04_optimizations`)

5. Concurrency (`modules/05_concurrency`)

6. Advanced Algorithms (`modules/06_advanced_algorithms`)

7. The Ecosystem (`modules/07_ecosystem`)

🚀 Getting Started

Prerequisites

Building and Running

📚 Documentation

🛠️ Verification

🤝 Contributing

📄 License

🧠 Key Topics & Demonstrated Skills

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Kernels Repository

📂 Project Structure

1. Fundamentals (modules/01_fundamentals)

2. Memory Management (modules/02_memory_management)

3. Advanced Math (modules/03_advanced_math)

4. Optimizations (modules/04_optimizations)

5. Concurrency (modules/05_concurrency)

6. Advanced Algorithms (modules/06_advanced_algorithms)

7. The Ecosystem (modules/07_ecosystem)

🚀 Getting Started

Prerequisites

Building and Running

📚 Documentation

🛠️ Verification

🤝 Contributing

📄 License

🧠 Key Topics & Demonstrated Skills

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Fundamentals (`modules/01_fundamentals`)

2. Memory Management (`modules/02_memory_management`)

3. Advanced Math (`modules/03_advanced_math`)

4. Optimizations (`modules/04_optimizations`)

5. Concurrency (`modules/05_concurrency`)

6. Advanced Algorithms (`modules/06_advanced_algorithms`)

7. The Ecosystem (`modules/07_ecosystem`)

Packages