A high-performance computing (HPC) and advanced concurrency sandbox dedicated to optimizing workloads across heterogeneous hardware architectures. This repository features two core engineering tracks: deterministic Numerical Integration Solvers optimized across mass-parallel GPU and CPU layers, and a production-grade, asynchronous Multi-Threaded Pipeline Simulation modeling strict resource contention.
The core objective is to isolate, profile, and mitigate hardware-specific bottlenecks—such as synchronization overhead, thread contention, memory latency, and race conditions—under high-throughput constraints.
Parallel-Programming/ (Root)
- Workload-Distribution-using-CUDA-OpenCL/ ──► [Track 1: GPU Hardware Acceleration]
CUDA_barrier_experiment.cpp,CUDA_diffrent_thread_per_block.cpp,CUDA_diffrent_workload_distribution.cpp,CUDA_memory_models.cppOPENCL_barrier_experimen.cpp,OPENCL_diffrent_thread_per_block.cpp,OPENCL_diffrent_workload_distribution_CPU.cpp,OPENCL_memory_models.cppcuda-opencl-report.pdf- Workload-Distribution-using-OpenMP/ ──► [Track 1: OpenMP Parallelism]
integration_using_loop_with_reduction.cpp,integration_using_loop_with_no_reduction.cppintegration_task_based_recursion.cpp,integration_task_based_no_recursion.cpp,integration_alternative_routing.cppOpenMP-report.pdf- Workload-Distribution-using-pthreads/ ──► [Track 1: Low-Level OS Threading]
integration.cpp,integration_using_locks.cpp,integration-no-locks.cppintegration-dynamic-sharing-job-queue.cpp,integration_with_jumps.cpppthread-report.pdf- multithread-pizza-store-simulation/ ──► [Track 2: Asynchronous Event-Driven Pipeline]
thread-handling-pizza.c,thread-handling-pizza.h,test-res.sh
These modules benchmark an identical numerical integration workload across different abstraction layers to analyze hardware exploitation efficiency and programming model trade-offs.
- Hardware Barrier Synchronization: Profiling execution overhead and work-group fence efficiency during parallel reductions (
CUDA_barrier_experiment.cpp,OPENCL_barrier_experimen.cpp). - Occupancy Tuning: Thread-per-block configuration profiling to determine optimal streaming multiprocessor wrapping (
CUDA_diffrent_thread_per_block.cpp,OPENCL_diffrent_thread_per_block.cpp). - Workload Mapping: Comparative analysis of alternative spatial data distribution patterns over the GPU grid vs. host CPU routing (
CUDA_diffrent_workload_distribution.cpp,OPENCL_diffrent_workload_distribution_CPU.cpp). - Memory Subsystem Layouts: Evaluation of explicit Shared Memory / Local Scratchpad architectures against direct Global Memory bus access to eliminate bandwidth saturation (
CUDA_memory_models.cpp,OPENCL_memory_models.cpp).
- Loop-Based Parallelization: Testing traditional loop constructs with automated variable privatization, comparing built-in parallel reductions against raw un-synchronized loops (
integration_using_loop_with_reduction.cpp,integration_using_loop_with_no_reduction.cpp). - Task-Based Concurrency: Implementing dynamic compiler tasks to evaluate recursive divide-and-conquer processing against flat non-recursive queue routing (
integration_task_based_recursion.cpp,integration_task_based_no_recursion.cpp). - Custom Scheduling Paths: Engineering alternative work-routing algorithms to minimize scheduling overhead across asymmetric CPU threads (
integration_alternative_routing.cpp).
- Baseline Concurrency: Raw POSIX thread spawning, manual domain decomposition, and join fencing boundaries (
integration.cpp). - Lock Contention Analysis: Profiling the severe performance penalties of heavy thread contention using native Mutex locks against fully optimized lock-free architectures (
integration_using_locks.cpp,integration-no-locks.cpp). - Dynamic Work Distribution: Implementing an asynchronous dynamic sharing Job Queue paradigm to eliminate load imbalance among operating system threads (
integration-dynamic-sharing-job-queue.cpp). - Non-Sequential Traversal: Measuring cash locality and performance profiles using interleaved work distribution via custom jump step patterns (
integration_with_jumps.cpp).
- Architecture: A complete multi-stage, event-driven transaction engine (
thread-handling-pizza.c,thread-handling-pizza.h) modeling a high-throughput storefront via the Producer-Consumer design pattern. - Thread Lifecycle Pipelines: Orders move asynchronously through decoupled operational stages managed entirely via POSIX threads (Order Intake ➔ Kitchen Prep ➔ Oven Baking ➔ Logistics/Delivery).
- Granular Synchronization: Protecting global inventory, limited oven slots, and delivery drivers using fine-grained Mutex Locks and handling non-blocking thread communication via Condition Variables.
- Race & Deadlock Mitigation: Hardened architecture engineered to guarantee fail-safe state changes under heavy concurrent stress, eliminating race conditions, thread starvation, and circular wait deadlocks.
- Automated Validation: Includes a functional test harness (
test-res.sh) to stress-test the pipeline and extract execution telemetry under varying load profiles.
- Thread Contention vs. Throughput: Profiling the numerical solvers verified that native mutex constraints degrade performance exponentially under heavy core scaling. Vectorized reduction and lock-free execution pathing achieved near-linear scaling metrics.
- Pipeline Synchronization Efficiency: In the asynchronous storefront simulation, fine-grained locking over distinct resource groups drastically reduced thread blocking states compared to a coarse global lock approach.
- Compute vs. Memory Bounds: GPU kernels are bound heavily by memory access coalescence and hardware barrier latency, while deep-recursive CPU tasks are latency-bound by stack allocation and OS task-scheduling metadata overhead.
Ensure your target system has the necessary compiler infrastructures and hardware-vendor backends configured.
- GPU Targets: NVIDIA CUDA Toolkit (v11.0+) or an OpenCL 2.0+ compatible runtime ICD.
- CPU Targets: GCC Core toolchain (v9.0+) with native OpenMP (
-fopenmp) and POSIX standard threading (-pthread) support.