Skip to content

ridash2005/CUDA_Kernels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CUDA Kernels Repository

C++ CUDA Make Windows PowerShell License: MIT

CUDA Kernels Banner

Welcome to the CUDA Kernels repository! This project is a comprehensive collection of CUDA implementations ranging from fundamental concepts to advanced mathematical operations. It is designed for both beginners starting their CUDA journey and professionals looking for reference implementations.

πŸ“‚ Project Structure

The repository is organized into modules of increasing complexity:

1. Fundamentals (modules/01_fundamentals)

  • 01_kernel_basics: Introduction to writing your first CUDA kernel.
  • 02_grid_block: Understanding the Grid-Block-Thread hierarchy.
  • 03_hardware: Querying GPU device properties and capabilities.

2. Memory Management (modules/02_memory_management)

  • 01_vector_ops: Standard vector operations (Add, Sub, Mul, Dot) demonstrating global memory usage.
  • 02_vector_dot: Dot product implementation using atomic operations.
  • 03_constant_memory: Optimization using Constant Memory for read-only data.
  • 04_unified_memory: Simplifies memory management using cudaMallocManaged and cudaMemPrefetchAsync.

3. Advanced Math (modules/03_advanced_math)

  • 01_matrix_vector_ops: High-performance Matrix-Vector multiplication (Standard, Banded, Symmetric, Triangular) and Rank-1/Rank-2 updates. Includes CPU verification.
  • 02_fft: Fast Fourier Transform implementations (Radix-2 and Stockham algorithms).

4. Optimizations (modules/04_optimizations)

  • 01_tiled_matmul: The "Holy Grail" of CUDA optimizations. Tiled Matrix-Matrix multiplication using Shared Memory.
  • 02_reduction: Highly optimized parallel reduction (Sum) using Warp Shuffle instructions.

5. Concurrency (modules/05_concurrency)

  • 01_streams: Demonstrates maximizing GPU throughput by overlapping Compute with Memory Transfers using CUDA Streams.

6. Advanced Algorithms (modules/06_advanced_algorithms)

  • 01_histogram: Optimized frequency counting using Privatized Shared Memory Atomics to reduce Global Memory contention.

7. The Ecosystem (modules/07_ecosystem)

  • 01_cublas: Industry-standard Matrix Multiplication using NVIDIA's hand-tuned cuBLAS library.
  • 02_thrust: High-level C++ template library ("STL for CUDA") for Sorting and Reducing without writing kernels.
  • 03_curand: Parallel Random Number Generation (Monte Carlo Pi Estimation).
  • 04_cusparse: Sparse Matrix-Vector multiplication using Compressed Sparse Row (CSR) format.
  • 05_cusolver: Dense Cholesky Decomposition ($A = L L^T$).
  • 06_nvtx: Profiling range markers for Nsight Systems.
  • 07_dynamic_parallelism: Child kernel launches from the GPU (CDP).

πŸš€ Getting Started

Prerequisites

  • NVIDIA GPU: Compute Capability 5.0 or higher recommended.
  • CUDA Toolkit: Version 10.0 or higher.
  • Compiler: nvcc (bundled with CUDA Toolkit).
  • Build Tool: make (or nmake on Windows, though headers are set up for typical make).

Building and Running

Each module contains a Makefile. To build a specific module, navigate to its directory and run make.

Example: Running the Matrix-Vector Operations Demo

cd modules/03_advanced_math/01_matrix_vector_ops
make
./mv_app.exe

πŸ“š Documentation

For a detailed learning path, check out A_BEGINNERS_GUIDE.md.

For a guide on the broader ecosystem (cuBLAS, Thrust, TensorRT, etc.), read CUDA_ECOSYSTEM_GUIDE.md.

πŸ› οΈ Verification

All modules are self-contained and include verification mechanisms (comparing GPU results against CPU reference implementations) to ensure correctness.

To verify the entire repository at once, run the included PowerShell script:

./scripts/verify_all.ps1

🀝 Contributing

Contributions are welcome! Please ensure code is formatted and includes verification logic.

πŸ“„ License

MIT License


🧠 Key Topics & Demonstrated Skills

  1. GPU Architecture & Execution Model

    • Concepts: Grid/Block/Thread hierarchy, SIMT (Single Instruction, Multiple Threads) architecture, Warp divergence, and SM (Streaming Multiprocessor) occupancy.
    • Demonstrated in: modules/01_fundamentals
  2. CUDA Memory Hierarchy & Management

    • Concepts: Global, Shared, Constant, and Unified Memory (cudaMallocManaged). Understanding memory coalescing and minimizing global memory latency.
    • Demonstrated in: modules/02_memory_management, modules/04_optimizations
  3. Advanced Parallel Algorithms & Optimization Patterns

    • Tiled Matrix Multiplication: Utilizing Shared Memory to drastically reduce global memory bandwidth requirements (often considered the "Holy Grail" of CUDA optimization).
    • Parallel Reduction: Highly optimized sum reductions using low-level Warp Shuffle (__shfl_down_sync) instructions.
    • Histograms & Atomics: Privatized shared memory atomics to resolve global memory contention.
    • Demonstrated in: modules/04_optimizations, modules/06_advanced_algorithms
  4. Concurrency & Stream Management

    • Concepts: Overlapping kernel execution with asynchronous memory transfers (cudaMemcpyAsync) using CUDA Streams to maximize total GPU utilization.
    • Demonstrated in: modules/05_concurrency
  5. NVIDIA Ecosystem Integration

    • Concepts: Utilizing industry-standard, highly-tuned libraries instead of reinventing the wheel for production systems.
    • Demonstrated in: modules/07_ecosystem (cuBLAS, cuSPARSE, cuSOLVER, cuRAND, Thrust)
  6. Profiling, Debugging & Verification

    • Concepts: NVTX (NVIDIA Tools Extension) markers for performance profiling in Nsight Systems. Establishing reliable CPU reference implementations for numerical validation.
    • Demonstrated in: modules/07_ecosystem/06_nvtx, scripts/verify_all.ps1

About

CUDA C++ repository demonstrating advanced GPU computing, optimized parallel algorithms (FFTs, Tiled MatMul), and NVIDIA ecosystem integrations (cuBLAS, Thrust). Engineered for maximum throughput and HPC learning.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors