Extract Particle Communication And Update Infrastructure From PR501 by aaadelmann · Pull Request #544 · IPPL-framework/ippl

aaadelmann · 2026-06-10T21:04:50Z

Extract Particle Communication And Update Infrastructure From PR501

Summary

This PR extracts the particle communication/update infrastructure from the large
PR501 branch. It builds on the PCG allocation split and keeps FFT, NUFFT,
higher-order scatter/gather, and PIF examples out of scope.

The main goals are:

modernize particle migration around reusable send/receive buffers,
add direct particle-attribute archive serialization/deserialization,
reduce repeated allocations in receive-side particle unpacking,
add diagnostics for particle update timing,
introduce particle sorting/update support infrastructure,
thread nghost through field layout / halo APIs where particle layout needs
consistent ghost-width awareness,
add focused particle update regression tests.

What Changed

Particle migration path

Main files:

src/Particle/ParticleSpatialLayout.h
src/Particle/ParticleSpatialLayout.hpp
src/Particle/ParticleBase.h
src/Particle/ParticleBase.hpp
src/Particle/ParticleAttrib.h
src/Particle/ParticleAttrib.hpp
src/Particle/ParticleAttribBase.h

ParticleSpatialLayout::update() now has a more explicit multi-stage migration
flow:

apply particle boundary conditions,
locate particles that leave the current rank,
exchange send/receive counts,
post sends and receives,
destroy particles that left locally,
wait for all MPI requests,
pre-reserve receive storage,
deserialize received particle attributes.

The receive tail is timed explicitly:

particleWait
particleFreeBuffers
particleDeserialize
particleDeserResize
particleDeserCopy

This makes the previously hidden tail of updateParticle visible in profiles.

Receive-side pre-reserve fix

The key performance fix is receive-side pre-reserving of particle attribute
capacity.

Root cause observed on LUMI:

Receive finalizers called ParticleAttrib::deserialize(offset, nrecvs) once
per source rank and attribute.
That path repeatedly called a preserving resize while receiving particles.
On GPU, Kokkos::resize preserves existing entries, so repeated grows copied
already-live particle storage many times within a single update step.
This dominated particleDeserialize and caused a large updateParticle
regression.

Fix:

Add ParticleAttrib::reserve(size_type).
Before deferred receive finalizers run, compute final receive capacity once:
```
localAfterDestroy + totalRecvs
```
Reserve every particle attribute once to that capacity.
Keep receive finalizers focused on archive copy/deserialization.

Relevant code:

if (totalRecvs > 0) {
    const size_type receiveCapacity = localAfterDestroy + totalRecvs;
    pc.forAllAttributes([&]<typename Attribute>(Attribute*& attribute) {
        attribute->reserve(receiveCapacity);
    });
}

Communication archive and buffer handling

Main files:

src/Communicate/Archive.h
src/Communicate/Archive.hpp
src/Communicate/BufferHandler.hpp
src/Communicate/Buffers.*
src/Communicate/Communicator.*
src/Communicate/LogEntry.*
src/Communicate/LoggingBufferHandler.h

The branch adds/updates:

archive serialization APIs for direct particle attribute transfer,
hash-based serialization for selected particles,
receive-side deserialization with offsets,
shared/default buffer handler plumbing,
buffer/logging diagnostics,
HIP/CUDA allocation handling in Archive.

Notable HIP detail:

Archive rounds HIP GPU allocations to 64 KiB granularity to satisfy HSA IPC
requirements used by Cray MPICH for large GPU transfers.
hipFree / cudaFree return values are intentionally cast to void to avoid
warning noise from nodiscard return values in destructors/free paths.

Particle sorting infrastructure

Main files:

src/Particle/ParticleSort.h
src/Particle/SortBuffer.h

The branch introduces reusable sorting buffers and particle sort helpers used by
the new spatial update path. The buffers grow on demand and are reused to avoid
allocation churn.

Field layout / halo `nghost` plumbing

Main files:

src/FieldLayout/FieldLayout.h
src/FieldLayout/FieldLayout.hpp
src/FieldLayout/SubFieldLayout.hpp
src/Field/HaloCells.h
src/Field/HaloCells.hpp
src/Field/BareField.hpp

The particle layout changes require consistent ghost-width awareness when field
layouts and halo neighbor regions are computed. This PR threads nghost through
the relevant field layout / halo APIs.

This is small in line count but important for correctness; reviewers should
look at it together with the particle layout changes.

Utility support

Main files:

src/Utility/BufferView.h
src/Utility/ParallelDispatch.h
src/Utility/Tuning.h
src/Utility/TypeUtils.h
src/Utility/IpplTimings.*
src/Utility/Timer.*

The utility changes provide reusable support for:

buffer views,
dispatch helpers,
tuning infrastructure,
richer timing diagnostics needed by the particle update investigation.

ALPINE Kokkos view lifetime fixes

The ALPINE managers no longer take addresses of temporary Kokkos view handles
returned by getView().

Changed files:

alpine/LandauDampingManager.h
alpine/BumponTailInstabilityManager.h
alpine/PenningTrapManager.h

Before:

view_type* R = &(this->pcontainer_m->R.getView());
samplingR.generate(*R, rand_pool64);

After:

view_type R = this->pcontainer_m->R.getView();
samplingR.generate(R, rand_pool64);

This avoids dangling pointers/references to temporary view handles and fixes
compilers/backends that reject taking the address of a temporary Kokkos view.

Validation And Performance Evidence

LUMI Results (ALPS will follow)

Benchmark	Problem size	Nodes	Ranks	master	pr501-pcg	pr501-com	pr501-fft	pr501-hosg	pr501-nufft
FEM	`513_10`	8	64	28.23	27.77 (-2%)	27.71 (-2%)	27.88 (-1%)	28.23 (0%)	27.95 (-1%)
FFT	`512_10`	4	32	4.39	4.41 (0%)	11.54 (+163%)	11.53 (+162%)	11.59 (+164%)	11.53 (+162%)
FFT	`512_10`	16	128	1.65	1.65 (0%)	2.88 (+74%)	2.88 (+75%)	2.93 (+78%)	2.93 (+78%)
PCG	`512_10`	1	8	72.31	70.01 (-3%)		76.99 (+6%)		76.75 (+6%)
PCG	`512_10`	4	32	34.60	32.93 (-5%)	43.71 (+26%)	33.62 (-3%)	36.32 (+5%)	33.55 (-3%)
PCG	`512_10`	64	512	25.00	23.85 (-5%)	22.89 (-8%)	22.40 (-10%)	22.74 (-9%)	22.58 (-10%)

The original symptom was a large updateParticle regression moving from
pr501-pcg to the communication/particle-update split. Initial child timers
looked small because deferred receive finalization/deserialization was hidden in
the tail of updateParticle.

Diagnostics split that tail into:

particleWait,
particleFreeBuffers,
particleDeserialize.

The regression was traced to particleDeserialize, not MPI wait time.

LUMI before/after pre-reserve fix

Recorded in PR501_SPLIT_MAP.md:

ranks	metric	before pre-reserve	after pre-reserve	improvement
32	`updateParticle` wall max	8.08981	0.338958	23.9x
32	`particleDeserialize` wall max	7.88320	0.0106297	741x
128	`updateParticle` wall max	1.59376	0.214328	7.4x
128	`particleDeserialize` wall max	1.40305	0.0132888	106x

Post-fix timer split:

ranks	`particleDeserialize` wall max	`particleDeserResize` wall max	`particleDeserCopy` wall max
32	0.0106297	0.0000129	0.0105913
128	0.0132888	0.0000196	0.0132491

Interpretation:

The repeated preserving resize was the dominant regression.
After pre-reserving, deserialize time is small and almost entirely actual
archive copy.

At 128 ranks, updateParticle became balanced:

before max/avg/min = 1.594 / 0.864 / 0.811
after  max/avg/min = 0.214 / 0.211 / 0.207

Local OpenMP check

The split map records a Mac OpenMP comparison between pr501-pcg and
pr501-communication-particle-update.

Command shape:

mpiexec -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=false -n 2 \
  ./LandauDamping 32 32 32 20000 5 FFT 0.01 LeapFrog --overallocate 2.0 --info 5

mpiexec -x OMP_NUM_THREADS=2 -x OMP_PROC_BIND=false -n 4 \
  ./LandauDamping 32 32 32 20000 5 FFT 0.01 LeapFrog --overallocate 2.0 --info 5

Observed locally:

the LUMI slowdown did not reproduce on Mac/OpenMP,
the communication branch was faster for the small CPU/OpenMP test,
this pointed toward GPU/MPI-specific receive-side behavior, later confirmed
by the particleDeserialize diagnosis.

Test coverage added/updated

New or updated tests include:

unit_tests/Particle/ParticleUpdate.cpp
unit_tests/Particle/ParticleUpdateNonuniform.cpp
unit_tests/Particle/ParticleSendRecv.cpp
unit_tests/Particle/ParticleBase.cpp
unit_tests/Communicate/BufferHandler.cpp
particle integration benchmark updates in test/particle.

ParticleUpdateNonuniform.cpp covers ORB/nonuniform layout scenarios including:

conservation after ORB repartition and update,
burst migration into narrow ranks,
repeated ORB repartitions,
ranks with zero particles,
periodic wrap across ORB boundaries,
particle injection between repartitions,
corner migration in 3D.

Reviewer Notes

This PR intentionally includes the prior PCG allocation split as its base.
Review the particle/communication changes as the new layer on top of PCG.
The branch is larger than the PCG split but still avoids FFT, NUFFT,
higher-order scatter/gather, and PIF changes.
The nghost field layout / halo changes should be reviewed with the particle
layout changes; they are part of the same correctness surface.
The receive pre-reserve fix is central to the LUMI performance recovery.
Archive byte-copy kernels are used for device-side serialization paths. If
further GPU performance issues appear, profile particleDeserCopy and
Archive::deserialize(offset).

Reuse PCG and preconditioner work fields across iterations instead of allocating temporary fields during every solve step. This also passes the solver field by reference through OperatorF to avoid extra halo-related allocations.

Split the communication and particle-update infrastructure from PR501 on top of the PCG split. This brings in reusable communication buffers, page-granular archive allocation, particle attribute serialization hooks, packed particle send IDs, particle sorting buffers, and the rewritten ParticleSpatialLayout update path. Keep this branch independent from the later interpolation, FFT, NUFFT, and PIF splits by dropping those APIs from the extracted ParticleAttrib changes. Add particle update regression coverage and update existing tests for live-view and page-sized buffer semantics. Validated with a Debug Serial Kokkos 5.0.0 build: full 1-rank ctest passes, plus ParticleSendRecv, ParticleUpdate, and ParticleUpdateNonuniform pass under mpiexec -n 2.

Consume CUDA/HIP runtime return values in Archive so HIP nodiscard annotations do not trigger warnings. Update particle benchmark/test callers to store ParticleAttrib::getView() by value now that it returns a live subview instead of a stable lvalue reference.

…-particle-update

PaulFisch and others added 11 commits May 31, 2026 21:34

Hoist PCG allocations out of solve loop

60cf99e

Reuse PCG and preconditioner work fields across iterations instead of allocating temporary fields during every solve step. This also passes the solver field by reference through OperatorF to avoid extra halo-related allocations.

this is the how to split

87a546e

Fix ALPINE particle view handle lifetimes

855b2ef

Add particle update diagnostic timers

816e1a5

Add particle update load diagnostics

6741b4e

Pre-reserve particle receive storage

ed6b926

Clean particle deserialize diagnostics

484c5e2

Document particle receive pre-reserve fix

b800e3f

Merge remote-tracking branch 'origin/master' into pr501-communication…

478e379

…-particle-update

aaadelmann self-assigned this Jun 10, 2026

aaadelmann added enhancement New feature or request cleanup labels Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Particle Communication And Update Infrastructure From PR501#544

Extract Particle Communication And Update Infrastructure From PR501#544
aaadelmann wants to merge 11 commits into
masterfrom
pr501-communication-particle-update

aaadelmann commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aaadelmann commented Jun 10, 2026

Extract Particle Communication And Update Infrastructure From PR501

Summary

What Changed

Particle migration path

Receive-side pre-reserve fix

Communication archive and buffer handling

Particle sorting infrastructure

Field layout / halo nghost plumbing

Utility support

ALPINE Kokkos view lifetime fixes

Validation And Performance Evidence

LUMI Results (ALPS will follow)

LUMI before/after pre-reserve fix

Local OpenMP check

Test coverage added/updated

Reviewer Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Field layout / halo `nghost` plumbing