Skip to content

Extract Particle Communication And Update Infrastructure From PR501#544

Open
aaadelmann wants to merge 11 commits into
masterfrom
pr501-communication-particle-update
Open

Extract Particle Communication And Update Infrastructure From PR501#544
aaadelmann wants to merge 11 commits into
masterfrom
pr501-communication-particle-update

Conversation

@aaadelmann

Copy link
Copy Markdown
Member

Extract Particle Communication And Update Infrastructure From PR501

Summary

This PR extracts the particle communication/update infrastructure from the large
PR501 branch. It builds on the PCG allocation split and keeps FFT, NUFFT,
higher-order scatter/gather, and PIF examples out of scope.

The main goals are:

  • modernize particle migration around reusable send/receive buffers,
  • add direct particle-attribute archive serialization/deserialization,
  • reduce repeated allocations in receive-side particle unpacking,
  • add diagnostics for particle update timing,
  • introduce particle sorting/update support infrastructure,
  • thread nghost through field layout / halo APIs where particle layout needs
    consistent ghost-width awareness,
  • add focused particle update regression tests.

What Changed

Particle migration path

Main files:

  • src/Particle/ParticleSpatialLayout.h
  • src/Particle/ParticleSpatialLayout.hpp
  • src/Particle/ParticleBase.h
  • src/Particle/ParticleBase.hpp
  • src/Particle/ParticleAttrib.h
  • src/Particle/ParticleAttrib.hpp
  • src/Particle/ParticleAttribBase.h

ParticleSpatialLayout::update() now has a more explicit multi-stage migration
flow:

  1. apply particle boundary conditions,
  2. locate particles that leave the current rank,
  3. exchange send/receive counts,
  4. post sends and receives,
  5. destroy particles that left locally,
  6. wait for all MPI requests,
  7. pre-reserve receive storage,
  8. deserialize received particle attributes.

The receive tail is timed explicitly:

  • particleWait
  • particleFreeBuffers
  • particleDeserialize
  • particleDeserResize
  • particleDeserCopy

This makes the previously hidden tail of updateParticle visible in profiles.

Receive-side pre-reserve fix

The key performance fix is receive-side pre-reserving of particle attribute
capacity.

Root cause observed on LUMI:

  • Receive finalizers called ParticleAttrib::deserialize(offset, nrecvs) once
    per source rank and attribute.
  • That path repeatedly called a preserving resize while receiving particles.
  • On GPU, Kokkos::resize preserves existing entries, so repeated grows copied
    already-live particle storage many times within a single update step.
  • This dominated particleDeserialize and caused a large updateParticle
    regression.

Fix:

  • Add ParticleAttrib::reserve(size_type).

  • Before deferred receive finalizers run, compute final receive capacity once:

    localAfterDestroy + totalRecvs
  • Reserve every particle attribute once to that capacity.

  • Keep receive finalizers focused on archive copy/deserialization.

Relevant code:

if (totalRecvs > 0) {
    const size_type receiveCapacity = localAfterDestroy + totalRecvs;
    pc.forAllAttributes([&]<typename Attribute>(Attribute*& attribute) {
        attribute->reserve(receiveCapacity);
    });
}

Communication archive and buffer handling

Main files:

  • src/Communicate/Archive.h
  • src/Communicate/Archive.hpp
  • src/Communicate/BufferHandler.hpp
  • src/Communicate/Buffers.*
  • src/Communicate/Communicator.*
  • src/Communicate/LogEntry.*
  • src/Communicate/LoggingBufferHandler.h

The branch adds/updates:

  • archive serialization APIs for direct particle attribute transfer,
  • hash-based serialization for selected particles,
  • receive-side deserialization with offsets,
  • shared/default buffer handler plumbing,
  • buffer/logging diagnostics,
  • HIP/CUDA allocation handling in Archive.

Notable HIP detail:

  • Archive rounds HIP GPU allocations to 64 KiB granularity to satisfy HSA IPC
    requirements used by Cray MPICH for large GPU transfers.
  • hipFree / cudaFree return values are intentionally cast to void to avoid
    warning noise from nodiscard return values in destructors/free paths.

Particle sorting infrastructure

Main files:

  • src/Particle/ParticleSort.h
  • src/Particle/SortBuffer.h

The branch introduces reusable sorting buffers and particle sort helpers used by
the new spatial update path. The buffers grow on demand and are reused to avoid
allocation churn.

Field layout / halo nghost plumbing

Main files:

  • src/FieldLayout/FieldLayout.h
  • src/FieldLayout/FieldLayout.hpp
  • src/FieldLayout/SubFieldLayout.hpp
  • src/Field/HaloCells.h
  • src/Field/HaloCells.hpp
  • src/Field/BareField.hpp

The particle layout changes require consistent ghost-width awareness when field
layouts and halo neighbor regions are computed. This PR threads nghost through
the relevant field layout / halo APIs.

This is small in line count but important for correctness; reviewers should
look at it together with the particle layout changes.

Utility support

Main files:

  • src/Utility/BufferView.h
  • src/Utility/ParallelDispatch.h
  • src/Utility/Tuning.h
  • src/Utility/TypeUtils.h
  • src/Utility/IpplTimings.*
  • src/Utility/Timer.*

The utility changes provide reusable support for:

  • buffer views,
  • dispatch helpers,
  • tuning infrastructure,
  • richer timing diagnostics needed by the particle update investigation.

ALPINE Kokkos view lifetime fixes

The ALPINE managers no longer take addresses of temporary Kokkos view handles
returned by getView().

Changed files:

  • alpine/LandauDampingManager.h
  • alpine/BumponTailInstabilityManager.h
  • alpine/PenningTrapManager.h

Before:

view_type* R = &(this->pcontainer_m->R.getView());
samplingR.generate(*R, rand_pool64);

After:

view_type R = this->pcontainer_m->R.getView();
samplingR.generate(R, rand_pool64);

This avoids dangling pointers/references to temporary view handles and fixes
compilers/backends that reject taking the address of a temporary Kokkos view.

Validation And Performance Evidence

LUMI Results (ALPS will follow)

Benchmark Problem size Nodes Ranks master pr501-pcg pr501-com pr501-fft pr501-hosg pr501-nufft
FEM 513_10 8 64 28.23 27.77 (-2%) 27.71 (-2%) 27.88 (-1%) 28.23 (0%) 27.95 (-1%)
FFT 512_10 4 32 4.39 4.41 (0%) 11.54 (+163%) 11.53 (+162%) 11.59 (+164%) 11.53 (+162%)
FFT 512_10 16 128 1.65 1.65 (0%) 2.88 (+74%) 2.88 (+75%) 2.93 (+78%) 2.93 (+78%)
PCG 512_10 1 8 72.31 70.01 (-3%) 76.99 (+6%) 76.75 (+6%)
PCG 512_10 4 32 34.60 32.93 (-5%) 43.71 (+26%) 33.62 (-3%) 36.32 (+5%) 33.55 (-3%)
PCG 512_10 64 512 25.00 23.85 (-5%) 22.89 (-8%) 22.40 (-10%) 22.74 (-9%) 22.58 (-10%)

The original symptom was a large updateParticle regression moving from
pr501-pcg to the communication/particle-update split. Initial child timers
looked small because deferred receive finalization/deserialization was hidden in
the tail of updateParticle.

Diagnostics split that tail into:

  • particleWait,
  • particleFreeBuffers,
  • particleDeserialize.

The regression was traced to particleDeserialize, not MPI wait time.

LUMI before/after pre-reserve fix

Recorded in PR501_SPLIT_MAP.md:

ranks metric before pre-reserve after pre-reserve improvement
32 updateParticle wall max 8.08981 0.338958 23.9x
32 particleDeserialize wall max 7.88320 0.0106297 741x
128 updateParticle wall max 1.59376 0.214328 7.4x
128 particleDeserialize wall max 1.40305 0.0132888 106x

Post-fix timer split:

ranks particleDeserialize wall max particleDeserResize wall max particleDeserCopy wall max
32 0.0106297 0.0000129 0.0105913
128 0.0132888 0.0000196 0.0132491

Interpretation:

  • The repeated preserving resize was the dominant regression.

  • After pre-reserving, deserialize time is small and almost entirely actual
    archive copy.

  • At 128 ranks, updateParticle became balanced:

    before max/avg/min = 1.594 / 0.864 / 0.811
    after  max/avg/min = 0.214 / 0.211 / 0.207
    

Local OpenMP check

The split map records a Mac OpenMP comparison between pr501-pcg and
pr501-communication-particle-update.

Command shape:

mpiexec -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=false -n 2 \
  ./LandauDamping 32 32 32 20000 5 FFT 0.01 LeapFrog --overallocate 2.0 --info 5

mpiexec -x OMP_NUM_THREADS=2 -x OMP_PROC_BIND=false -n 4 \
  ./LandauDamping 32 32 32 20000 5 FFT 0.01 LeapFrog --overallocate 2.0 --info 5

Observed locally:

  • the LUMI slowdown did not reproduce on Mac/OpenMP,
  • the communication branch was faster for the small CPU/OpenMP test,
  • this pointed toward GPU/MPI-specific receive-side behavior, later confirmed
    by the particleDeserialize diagnosis.

Test coverage added/updated

New or updated tests include:

  • unit_tests/Particle/ParticleUpdate.cpp
  • unit_tests/Particle/ParticleUpdateNonuniform.cpp
  • unit_tests/Particle/ParticleSendRecv.cpp
  • unit_tests/Particle/ParticleBase.cpp
  • unit_tests/Communicate/BufferHandler.cpp
  • particle integration benchmark updates in test/particle.

ParticleUpdateNonuniform.cpp covers ORB/nonuniform layout scenarios including:

  • conservation after ORB repartition and update,
  • burst migration into narrow ranks,
  • repeated ORB repartitions,
  • ranks with zero particles,
  • periodic wrap across ORB boundaries,
  • particle injection between repartitions,
  • corner migration in 3D.

Reviewer Notes

  • This PR intentionally includes the prior PCG allocation split as its base.
    Review the particle/communication changes as the new layer on top of PCG.
  • The branch is larger than the PCG split but still avoids FFT, NUFFT,
    higher-order scatter/gather, and PIF changes.
  • The nghost field layout / halo changes should be reviewed with the particle
    layout changes; they are part of the same correctness surface.
  • The receive pre-reserve fix is central to the LUMI performance recovery.
  • Archive byte-copy kernels are used for device-side serialization paths. If
    further GPU performance issues appear, profile particleDeserCopy and
    Archive::deserialize(offset).

PaulFisch and others added 11 commits May 31, 2026 21:34
Reuse PCG and preconditioner work fields across iterations instead of allocating temporary fields during every solve step. This also passes the solver field by reference through OperatorF to avoid extra halo-related allocations.
Split the communication and particle-update infrastructure from PR501 on top of the PCG split. This brings in reusable communication buffers, page-granular archive allocation, particle attribute serialization hooks, packed particle send IDs, particle sorting buffers, and the rewritten ParticleSpatialLayout update path.

Keep this branch independent from the later interpolation, FFT, NUFFT, and PIF splits by dropping those APIs from the extracted ParticleAttrib changes. Add particle update regression coverage and update existing tests for live-view and page-sized buffer semantics.

Validated with a Debug Serial Kokkos 5.0.0 build: full 1-rank ctest passes, plus ParticleSendRecv, ParticleUpdate, and ParticleUpdateNonuniform pass under mpiexec -n 2.
Consume CUDA/HIP runtime return values in Archive so HIP nodiscard annotations do not trigger warnings. Update particle benchmark/test callers to store ParticleAttrib::getView() by value now that it returns a live subview instead of a stable lvalue reference.
@aaadelmann aaadelmann self-assigned this Jun 10, 2026
@aaadelmann aaadelmann added enhancement New feature or request cleanup labels Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cleanup enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants