Pr501 pcg by aaadelmann · Pull Request #543 · IPPL-framework/ippl

aaadelmann · 2026-06-10T20:45:58Z

PCG/Preconditioner

Summary

This PR extracts the PCG allocation/performance part from the larger PR501 work.
It keeps the scope intentionally small:

reuse CG/PCG work fields instead of constructing them inside every solve,
make preconditioners write into caller-provided output fields,
pass solver operator fields by reference instead of by value,
keep preconditioner scratch fields synchronized with layout changes,
fix ALPINE particle initialization code that took pointers/references to temporary Kokkos views.

The primary motivation is to remove repeated allocation and avoid repeated
halo-buffer growth on copied fields in the PCG/preconditioner path. This was
observed as repeated allocation work in multi-rank profiling.

What Changed

PCG workspaces are reused

src/LinearSolvers/PCG.h

In the CG/PCG algorithm these are:

r: the current residual, initially rhs - A * lhs,
d: the current search direction,
q: the operator-applied search direction, i.e. A * d.
PCG adds reusable s and pcond_out buffers.
Here s stores the preconditioned residual M^{-1} r, and pcond_out
stores the initial preconditioner output before it is folded into d.
PCG::operator() refreshes workspace layouts instead of constructing local
fields on every solve.
op_m(d) now assigns into the existing q storage instead of forcing a
deepCopy().
Preconditioner applications now write into pcond_out / s instead of
returning freshly allocated fields.

Important BC detail:

pcond_out and s intentionally keep their default NoBcFace boundary
conditions. This preserves the old behavior where the preconditioner returned
a fresh field without periodic BCs. If these buffers inherited periodic BCs,
preconditioner-internal operators could trigger extra PeriodicFace::apply
MPI calls and diverge from the expected multi-rank communication sequence.

Preconditioners now write into caller-owned output

src/LinearSolvers/Preconditioner.h

The base preconditioner interface changed from returning Field to:
```
void operator()(Field& input, Field& result)
```
Concrete preconditioners now reuse member scratch fields where possible.
init_fields() allocates scratch fields once and updates their layout on
subsequent calls.
Several preconditioners still keep necessary deepCopy() calls where
expression aliasing can occur, especially for field-valued inverse-diagonal
operators.

Covered preconditioners include:

identity/default,
Jacobi,
polynomial Newton,
polynomial Chebyshev,
Richardson,
alternative Richardson,
Gauss-Seidel,
SSOR.

Solver operators take fields by reference

src/LinearSolvers/PCG.h
src/LinearSolvers/Preconditioner.h
src/PoissonSolvers/PoissonCG.h

The solver operator wrapper now takes the field by reference:

#define IPPL_SOLVER_OPERATOR_WRAPPER(fun, type) \
    [](type& arg) {                             \
        return fun(arg);                        \
    }

This matters because a by-value Field copy shares the underlying data views but
does not share all mutable side state, notably halo exchange buffers that may be
grown through Kokkos::realloc. When the operator acts on a copied field, the
copy can grow its halo buffer without updating the original. In multi-rank runs
this showed up as repeated allocation work.

PoissonCG.h no longer redefines the wrapper locally. Keeping a single
by-reference wrapper avoids silently reintroducing the by-value behavior.

ALPINE Kokkos view lifetime fixes

The ALPINE managers no longer take addresses of temporary view handles returned
by getView().

Changed files:

alpine/LandauDampingManager.h
alpine/BumponTailInstabilityManager.h
alpine/PenningTrapManager.h

Before:

view_type* R = &(this->pcontainer_m->R.getView());
samplingR.generate(*R, rand_pool64);

After:

view_type R = this->pcontainer_m->R.getView();
samplingR.generate(R, rand_pool64);

This fixes compilers/backends that reject taking the address of a temporary
Kokkos view object and avoids dangling view-handle pointers.

Documentation Check

The implementation is documented in the places where review context matters:

Preconditioner.h documents why the operator wrapper must take fields by
reference.
PCG.h documents why workspaces are reused and why pcond_out / s keep
NoBcFace boundary conditions.
PoissonCG.h documents why the local wrapper definition was removed.
PR501_SPLIT_MAP.md documents the split rationale and dependencies between
the planned PR501 follow-up branches.

No user-facing manual update appears necessary because this PR changes internal
solver allocation behavior and preserves the public PCG/Poisson solver behavior.

Validation

Original split-map validation

The PCG split prototype was built and tested with unit tests:

100% tests passed, 0 tests failed out of 34
Total Test time (real) = 54.17 sec

Solver integration validation

A dedicated solver integration build was configured:

cmake -S . -B build-pr501-pcg-ssor-tests-release \
  -DCMAKE_BUILD_TYPE=Release \
  -DIPPL_PLATFORMS=SERIAL \
  -DIPPL_ENABLE_TESTS=ON \
  -DIPPL_ENABLE_UNIT_TESTS=OFF \
  -DIPPL_ENABLE_SOLVERS=ON \
  -DIPPL_ENABLE_FFT=OFF

Built targets:

cmake --build build-pr501-pcg-ssor-tests-release \
  --target TestCGSolver TestCGSolver_convergence TestSolverDesign -j 8

Results:

TestCGSolver: built and passed through CTest.
TestCGSolver_convergence: compile-only target built successfully.
TestSolverDesign: compile-only target built successfully.

Registered CTest:

ctest --test-dir build-pr501-pcg-ssor-tests-release \
  --output-on-failure -R '^TestCGSolver$'

Result:

TestCGSolver passed

The SSOR path was also run explicitly because the registered CTest uses Jacobi:

./test/solver/TestCGSolver 4 s 2 1 1.57079632679 --info 5

and with MPI:

/opt/homebrew/Cellar/open-mpi/5.0.8/bin/mpiexec -n 2 \
  ./test/solver/TestCGSolver 4 s 2 1 1.57079632679 --info 5

/opt/homebrew/Cellar/open-mpi/5.0.8/bin/mpiexec -n 4 \
  ./test/solver/TestCGSolver 4 s 2 1 1.57079632679 --info 5

Results:

ranks	residual	iterations
1	`1.514653902775893e-13`	42
2	`1.196970356284836e-13`	44
4	`1.588802017354866e-13`	38

The same explicit SSOR test was temporarily rerun with the preconditioner
changes reverted. Residuals and iteration counts were identical for 1, 2, and 4
ranks.

Reviewer Notes

The key semantic change is the preconditioner API: preconditioners now write
into a supplied output field rather than returning a new field.
The by-reference solver operator wrapper is intentional and important for halo
buffer reuse.
pcond_out and s should not inherit periodic BCs; that would change the MPI
call pattern inside preconditioner operator chains.
Some deepCopy() calls remain intentionally where expression aliasing can
otherwise corrupt field-valued preconditioner operations.

Reuse PCG and preconditioner work fields across iterations instead of allocating temporary fields during every solve step. This also passes the solver field by reference through OperatorF to avoid extra halo-related allocations.

s-mayani · 2026-06-11T09:18:45Z

Two things I don't understand:

"fix ALPINE particle initialization code that took pointers/references to temporary Kokkos views."
"pcond_out and s intentionally keep their default NoBcFace boundary
conditions. This preserves the old behavior where the preconditioner returned
a fresh field without periodic BCs. If these buffers inherited periodic BCs,
preconditioner-internal operators could trigger extra PeriodicFace::apply
MPI calls and diverge from the expected multi-rank communication sequence."

For the first one, I don't think I understand why this change is necessary. @srikrrish do you know?

For the second one, I am a bit confused by the wording, and I also think the comment in the code regarding this could be made more informative for future developers, especially considering (if I understood correctly) that this was the source of deadlocks in PCG.

s-mayani · 2026-06-11T09:20:33Z

+         * pcond_out so operator() does not allocate per solve. pcond_out keeps
+         * its default NoBcFace BCs: the preconditioner's internal operator
+         * chain therefore never triggers PeriodicFace::apply MPI calls -- if it
+         * did, the global MPI sequence would diverge from the master code path
+         * (where the preconditioner returned a fresh NoBcFace field) and
+         * intermittent multi-rank halo deadlocks would follow.


I do not understand this comment... I think it could be made more informative.

s-mayani · 2026-06-11T09:23:05Z

+            // r, d, q come from the CG base class; s and pcond_out are PCG
+            // members. All are pre-allocated via initializeFields(); operator()
+            // only refreshes their layout so a load-balance repartition is
+            // tracked. The preconditioner scratch must follow the lhs layout
+            // too, otherwise its halo-exchange neighbor list goes out of sync
+            // with r/d/s/q after a repartition and halo MPI calls deadlock.


too verbose, can change to
"Field layouts are updated such that we don't keep a stale domain decomposition if the lhs layout has been changed during the simulation, for example by the load balancer".

Yes maybe to verbose for the specialist but I personally like it and furthermore students maybe appreciate it very much

s-mayani · 2026-06-11T09:30:14Z

+            // Each preconditioner's init_fields() is responsible for being
+            // cheap on the steady-state path (refreshing layout in-place, not
+            // reallocating).


I don't understand this comment. Isn't the "refreshing layout in-place" is done by the lines of code above (updateLayout)? What does steady-state path mean?

s-mayani · 2026-06-11T09:31:24Z

+            // pcond_out keeps its default NoBcFace BCs so the preconditioner's
+            // internal operator chain does not trigger PeriodicFace::apply MPI;
+            // the expression assignment then folds it into d (which gets the
+            // periodic BCs via setFieldBC below).


I don't understand this comment

s-mayani · 2026-06-11T09:37:32Z

+        // NoBcFace BCs so the preconditioner's internal operator chain does
+        // NOT trigger PeriodicFace::apply MPI calls.


again I am not sure I understand this

s-mayani · 2026-06-11T09:47:11Z

+        // Recursive Newton expansion P_k(u) where:
+        //   P_0(u) = eta_0 * u
+        //   P_k(u) = eta_k * (2 P_{k-1}(u) - P_{k-1}(A P_{k-1}(u)))
+        // Writes the result of level `level` into `out`. Uses one scratch
+        // field per recursion depth (Pr, PA, PAPr) so that no Field is
+        // allocated per call. The two recursive calls at each level execute
+        // sequentially and may reuse scratch at lower depths because the
+        // first call's lower-depth values have already been folded into
+        // Pr_scratch[level] before the second call begins.
+        void recursive_preconditioner(Field& u, unsigned int level, Field& out) {
+            // Timers accumulated across all levels of the recursion: useful
+            // for "how much of pcond is laplace vs combine vs leaf-assign".
+            // Counts include every call at every level (so the call-count
+            // column equals 2^(level_m+1)-1, not the number of operator()
+            // invocations).
+            static IpplTimings::TimerRef t_total = IpplTimings::getTimer("newton/total");
+            static IpplTimings::TimerRef t_leaf  = IpplTimings::getTimer("newton/leaf_assign");
+            static IpplTimings::TimerRef t_op    = IpplTimings::getTimer("newton/op_m");
+            static IpplTimings::TimerRef t_comb  = IpplTimings::getTimer("newton/combine");
+
+            IpplTimings::startTimer(t_total);
+            if (level == 0) {
+                IpplTimings::startTimer(t_leaf);
+                out = eta_m[0] * u;
+                IpplTimings::stopTimer(t_leaf);
+                IpplTimings::stopTimer(t_total);
+                return;
+            }
+            Field& Pr   = Pr_scratch_m[level];
+            Field& PA   = PA_scratch_m[level];
+            Field& PAPr = PAPr_scratch_m[level];
+
+            // Inner recursive call owns its own timing; we don't double-count
+            // it here.
+            recursive_preconditioner(u, level - 1, Pr);
+            IpplTimings::startTimer(t_op);
+            PA = op_m(Pr);
+            IpplTimings::stopTimer(t_op);
+            recursive_preconditioner(PA, level - 1, PAPr);
+            IpplTimings::startTimer(t_comb);
+            out = eta_m[level] * (2.0 * Pr - PAPr);
+            IpplTimings::stopTimer(t_comb);
+            IpplTimings::stopTimer(t_total);
+        }


I think all these changes are not necessary, I am not even sure why this recursive preconditioner was touched.

s-mayani · 2026-06-11T09:48:37Z

+            // One outer timer per polynomial_newton apply, complementing the
+            // per-call recursion timers. Lets us see how many outer pcond
+            // calls a solve does vs. the time inside each.
+            static IpplTimings::TimerRef t_apply = IpplTimings::getTimer("newton/apply");
+            IpplTimings::startTimer(t_apply);
+            recursive_preconditioner(u, level_m, result);
+            IpplTimings::stopTimer(t_apply);


Same for all these extra comments and timers, not necessary changes, and this preconditioner is never used.

s-mayani · 2026-06-11T09:52:03Z

+        std::vector<Field> Pr_scratch_m;
+        std::vector<Field> PA_scratch_m;
+        std::vector<Field> PAPr_scratch_m;
+        bool fields_initialized_m = false;


Isn't it super expensive to store all these temporary fields which are not needed later (needed only for the recursion)?

s-mayani · 2026-06-11T09:54:21Z

+        void operator()(Field& r, Field& result) override {
+            // x_m, x_old_m, A_m, z_m are pre-allocated scratch (init_fields).
+            // Coefficients in rho_m are also computed once.
+            static IpplTimings::TimerRef t_apply = IpplTimings::getTimer("chebyshev/apply");
+            static IpplTimings::TimerRef t_init  = IpplTimings::getTimer("chebyshev/init_step");
+            static IpplTimings::TimerRef t_op    = IpplTimings::getTimer("chebyshev/op_m");
+            static IpplTimings::TimerRef t_combo = IpplTimings::getTimer("chebyshev/combine");
+            static IpplTimings::TimerRef t_copy  = IpplTimings::getTimer("chebyshev/deep_copy");
+
+            IpplTimings::startTimer(t_apply);
+
+            IpplTimings::startTimer(t_init);
+            x_old_m = r / theta_m;
+            IpplTimings::stopTimer(t_init);
+            IpplTimings::startTimer(t_op);
+            A_m     = op_m(r);
+            IpplTimings::stopTimer(t_op);
+            IpplTimings::startTimer(t_combo);
+            x_m     = 2.0 * rho_m[1] / delta_m * (2.0 * r - A_m / theta_m);
+            IpplTimings::stopTimer(t_combo);
+
+            if (degree_m == 0) {
+                // result = x_old
+                IpplTimings::startTimer(t_copy);
+                Kokkos::deep_copy(result.getView(), x_old_m.getView());
+                IpplTimings::stopTimer(t_copy);
+                IpplTimings::stopTimer(t_apply);
+                return;
+            }

-            Field res(mesh, layout);
-            Field x(mesh, layout);
-            Field x_old(mesh, layout);
-            Field A(mesh, layout);
-            Field z(mesh, layout);
+            if (degree_m == 1) {
+                // result = x
+                IpplTimings::startTimer(t_copy);
+                Kokkos::deep_copy(result.getView(), x_m.getView());
+                IpplTimings::stopTimer(t_copy);
+                IpplTimings::stopTimer(t_apply);
+                return;
+            }
+            for (unsigned int i = 2; i < degree_m + 1; ++i) {
+                IpplTimings::startTimer(t_op);
+                A_m   = op_m(x_m);
+                IpplTimings::stopTimer(t_op);
+                IpplTimings::startTimer(t_combo);
+                z_m   = 2.0 / delta_m * (r - A_m);
+                // Write the new x value into result (the caller's buffer);
+                // x_old gets a deep copy of the previous x.
+                result = rho_m[i] * (2 * sigma_m * x_m - rho_m[i - 1] * x_old_m + z_m);
+                IpplTimings::stopTimer(t_combo);
+                IpplTimings::startTimer(t_copy);
+                Kokkos::deep_copy(x_old_m.getView(), x_m.getView());
+                Kokkos::deep_copy(x_m.getView(), result.getView());
+                IpplTimings::stopTimer(t_copy);
+            }
+            IpplTimings::stopTimer(t_apply);
+        }


All these extra timers are not needed in this preconditioner, since we never use it, and the extra comments also don't seems necessary to me.
It makes the algorithm cumbersome to read.

s-mayani · 2026-06-11T10:03:36Z

+            // The running iterate lives in result; UL_m and r_m are scratch
+            // members. The inner deep copies remain because the (upper, lower,
+            // inverse) operators may return Field-valued expressions that
+            // alias their input.


Again same change: "In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."

s-mayani · 2026-06-11T10:04:03Z

+            // result holds the running iterate; Ag_m and g_old_m are scratch
+            // members. The inner deep copies remain because the operators
+            // (op_m, inverse_diagonal_m) may return Field-valued expressions
+            // that alias their input.


Same change:
"In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."

s-mayani · 2026-06-11T10:04:48Z

-                    // The inverse diagonal is applied to the
-                    // vector itself to return the result usually.
-                    // However, the operator for FEM already
-                    // returns the result of inv_diag * itself
-                    // due to the matrix-free evaluation.
-                    // Therefore, we need this if to differentiate
-                    // the two cases.


why was this comment removed?

s-mayani · 2026-06-11T10:04:59Z

-                    // The inverse diagonal is applied to the
-                    // vector itself to return the result usually.
-                    // However, the operator for FEM already
-                    // returns the result of inv_diag * itself
-                    // due to the matrix-free evaluation.
-                    // Therefore, we need this if to differentiate
-                    // the two cases.


Why was this comment removed?

s-mayani · 2026-06-11T10:05:27Z

+            // The running iterate lives in result; UL_m, r_m are scratch
+            // members. The inner deep copies remain because (upper, lower,
+            // inverse, diagonal) may return aliasing Field-valued expressions.


Again same change: "In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."

s-mayani · 2026-06-11T10:05:39Z

-            // In the FEM solver, which uses the preconditioner, 
-            // we re-use a resultField to avoid allocating new
-            // memory at every iteration.
-            // In order for the operator calls to not rewrite
-            // on this same field over and over when calling 
-            // the operators (upper, diag, inverse, lower, etc)
-            // we need deep copies to the preconditioner fields.
-
-            // The inverse diagonal is applied to the
-            // vector itself to return the result usually.
-            // However, the operator for FEM already
-            // returns the result of inv_diag * itself
-            // due to the matrix-free evaluation.
-            // Therefore, we need this if to differentiate
-            // the two cases.


Why was this comment removed?

srikrrish · 2026-06-11T10:18:32Z

This fixes compilers/backends that reject taking the address of a temporary Kokkos view object and avoids dangling view-handle pointers.

I think it is due to the explanation written here
"This fixes compilers/backends that reject taking the address of a temporary
Kokkos view object and avoids dangling view-handle pointers."
I am not sure if these things were observed before or it is just a pre-cautionary measure.

s-mayani · 2026-06-11T11:17:42Z

This fixes compilers/backends that reject taking the address of a temporary Kokkos view object and avoids dangling view-handle pointers.

I think it is due to the explanation written here "This fixes compilers/backends that reject taking the address of a temporary Kokkos view object and avoids dangling view-handle pointers." I am not sure if these things were observed before or it is just a pre-cautionary measure.

Okay thanks!

s-mayani · 2026-06-11T11:36:16Z

In general, I think I disagree with the use of "result buffer" in most of the comments regarding the fields stored in CG and PCG. It might create a confusion with the communication buffers, "results field" would be less ambiguous.

PaulFisch and others added 4 commits May 31, 2026 21:34

Hoist PCG allocations out of solve loop

60cf99e

Reuse PCG and preconditioner work fields across iterations instead of allocating temporary fields during every solve step. This also passes the solver field by reference through OperatorF to avoid extra halo-related allocations.

this is the how to split

87a546e

Fix ALPINE particle view handle lifetimes

f1465da

Merge remote-tracking branch 'origin/master' into pr501-pcg

8b8d250

aaadelmann self-assigned this Jun 10, 2026

aaadelmann added enhancement New feature or request cleanup labels Jun 10, 2026

aaadelmann requested a review from s-mayani June 10, 2026 21:06