Skip to content

Releases: SC-SGS/CPPuddle

Release 0.4.0

23 Apr 04:02
5d189e5

Choose a tag to compare

Description

This release reflects the state/implementation of CPPuddle I described in my dissertation (see Chapter 5 within).

The release contains several updates, mostly regarding the interface and the documentation. It also includes some convenience methods for using CPPuddle's kernel fusion features:

  • CPPuddle is now split into three modules: executor_recycling (containing the executor pools), memory_recycling (containing the recycling allocators and related functionality) and kernel_aggregation (containing the kernel fusion functionality). Each module has its own namespace. Methods and classes are now hidden away in the respective detail namespace unless explicitly intended to be user-facing. The file structure was also adapted accordingly.
  • The old interface (from v0.3.1 and older) still works for now but causes deprecation warnings. Hence, this CPPuddle release stays fully compatible with the previous releases, not requiring (but encouraging) the switch to the new interface and include files yet.
  • CPPuddle now comes with doxygen documentation, which is automatically build on each push to the master branch (see https://sc-sgs.github.io/CPPuddle/ ).
  • CPPuddle now contains additional convenience methods for using the kernel fusion features, particularly with Kokkos. Some of those were previous used in Octo-Tiger, but I now generalized them to be usable with other HPX applications (with CPPuddle).
  • CPPuddle now includes a few examples, showcasing the kernel fusion and buffer recycling.
  • Added several small bugfixes.

What's Changed

  • Fixes small issue with stream_interface constructor types by @G-071 in #27
  • Remove erroneous assert by @G-071 in #28
  • Refactor Interface by @G-071 in #29
  • Add doxygen and github pages by @G-071 in #30
  • Move Kokkos Kernel Aggregation Utils from Octo-Tiger to CPPuddle by @G-071 in #31
  • Add aggregation region wrapper by @G-071 in #32
  • Add some initial examples by @G-071 in #34
  • Add more triggers for github actions tests by @G-071 in #35
  • Update aggregation_executors_and_allocators.hpp to compile with gcc 14 by @diehlpk in #33
  • Build documentation only on master/documentation branches and via a manual trigger by @G-071 in #37
  • Recent HPX master has changed some namespaces for the executor traits specialization by @hkaiser in #36
  • Only build doc on master branch by @G-071 in #38
  • Remove const from number slices by @G-071 in #39
  • Adapt to HPX namespace deprecations by @G-071 in #40
  • Small fixes to internal pool names by @G-071 in #41

New Contributors

Full Changelog: v0.3.1...v0.4.0

Release 0.3.1

08 Oct 03:21
c1ae269

Choose a tag to compare

Description

This is mostly a bugfix release:

  • Fixed executor reference counting in work aggregation areas. This enables CPU/GPU load balancing again (mostly useful in consumer-grade hardware).
  • Fixed aggregation mutex choice (should be hpx::mutex). Use hpx::mutex by default everywhere else now as well (though std::mutex remains a valid option here).
  • Added an option to turn off executor pools whilst still providing the same interface (useful for performance comparisons).

What's Changed

  • Update README.md by @G-071 in #23
  • Fix combined CPU GPU execution by @G-071 in #24
  • Add option to disable using the executor pool by @G-071 in #25
  • Change mutex defaults by @G-071 in #26

Full Changelog: v0.3.0...v0.3.1

Release 0.3.0

25 Aug 18:20
377ee35

Choose a tag to compare

Description

This release contains a refactored/overhauled buffer management core and adds proper MultiGPU support.

Feature list / Changelog:

  • CPPuddle is now usable as a header-only library.

  • Reworked buffer manager by adding an HPX-aware mode and variable internal buckets. This mode uses the OS thread ID as a hint to reduce locking and get buffers for the correct NUMA node.

  • Added cmake variable to steer the number of internally used buckets (tradeoff between speed and memory usage).

  • Repaired and added MultiGPU functionality (also works for the work aggregation executors / allocators).

  • Removed central reference counting for recycled Kokkos buffer (now per View counting).

  • Added proper finalize method which prevents further usage after being called.

  • Added cmake toggles to enable/disable content recycling and buffer recycling as required (useful for benchmarking).

  • Made the internal CPPuddle allocation/recycling counters compatible with HPX performance counters.

  • Contains various bug fixes and a cleaned up codebase.

Note: The MultiGPU addition required some slight adjustment to the interface, requiring additional device_id parameters for various functions. Additionally, some gpu_id parameters from the defunct previous MultiGPU code have been removed. Other than that, the interface largely stayed the same.

What's Changed

Full Changelog: v0.2.0...v0.3.0

Release 0.2.1

24 Aug 21:12

Choose a tag to compare

Description

This release backports the interface changes from v0.3.0 to the older v0.2.0 release.

Feature list / Changelog:

  • Backports the interface changes from Release v0.3.0 to v0.2.0, effectively allowing applications such as Octo-Tiger to still use the old CPPuddle core (from 0.2.0) despite having been ported to the new CPPuddle interface (from 0.3.0).

  • Notably, the interface was backported for v0.2.1 in a way that keeps this release compatible with the interface of previous CPPuddle releases (which was not feasible for 0.3.0 due to the removal of the old MultiGPU code).

  • The release further fixes some small test issues

Full Changelog: v0.2.0...v0.2.1

Release 0.2.0

18 Aug 20:46
c922bcc

Choose a tag to compare

Description

This release adds work aggregation/kernel fusion features, SYCL support and A64FX support:

  • Added explicit work aggregation executors and allocators. These allow multithreaded work aggregation / kernel fusion of GPU kernels when using HPX. They are intended to combine GPU kernels on-the-fly that are doing the same work but on different HPX components (same HPX locality though). See here more detailed description and benchmarks with a real-world HPX application using both an NVIDIA A100 and an AMD MI100 (using CUDA, HIP and Kokkos)
  • Added basic tests for CPU work aggregation executor/allocators
  • Added more detailed CPU/GPU STREAM tests for work aggregation executor/allocators
  • Added SYCL allocators (used for the benchmarks here)
  • Fixed various CI bugs and compilation on A64Fx (see here for usage example on A64Fx machines)
  • Note: Including the work aggregation executor/allocators requires C++17, other features still work with C++14.

Pull requests

New Contributors

Full Changelog: v0.1.0...v0.2.0

Release 0.1.0

23 Feb 19:10
1b04b54

Choose a tag to compare

The version for this release has been in use for multiple months now and seems to work well, hence this initial release with the basic functionality before more experimental features are added!

The release contains the basic (multithreaded) recycling / reusage functionality for buffers and executors:

It provides allocators that enable

  • Reusage of buffers allocated by std::allocator
  • Reusage of aligned buffers
  • Reusage of CUDA device memory buffers
  • Reusage of CUDA pinned host memory buffer
  • Reusage of HIP device memory buffers
  • Reusage of HIP pinned host memory buffer
  • Reusage of Kokkos Views (via wrapper class)

It further provides executors pools for arbitrary executor with various scheduling policies (tested with HPX CUDA/HIP and Kokkos executors)

  • Round robin scheduling policy
  • Priority scheduling policy
  • MultiGPU with Round Robin scheduling policy
  • MultiGPU with Priority scheduling policy

The release also includes CI functionality on GitHub actions and Jenkins (for GPU and concurrency tests).