Skip to content

Pr501 pcg#543

Open
aaadelmann wants to merge 4 commits into
masterfrom
pr501-pcg
Open

Pr501 pcg#543
aaadelmann wants to merge 4 commits into
masterfrom
pr501-pcg

Conversation

@aaadelmann

Copy link
Copy Markdown
Member

PCG/Preconditioner

Summary

This PR extracts the PCG allocation/performance part from the larger PR501 work.
It keeps the scope intentionally small:

  • reuse CG/PCG work fields instead of constructing them inside every solve,
  • make preconditioners write into caller-provided output fields,
  • pass solver operator fields by reference instead of by value,
  • keep preconditioner scratch fields synchronized with layout changes,
  • fix ALPINE particle initialization code that took pointers/references to temporary Kokkos views.

The primary motivation is to remove repeated allocation and avoid repeated
halo-buffer growth on copied fields in the PCG/preconditioner path. This was
observed as repeated allocation work in multi-rank profiling.

What Changed

PCG workspaces are reused

src/LinearSolvers/PCG.h

In the CG/PCG algorithm these are:

  • r: the current residual, initially rhs - A * lhs,

  • d: the current search direction,

  • q: the operator-applied search direction, i.e. A * d.

  • PCG adds reusable s and pcond_out buffers.
    Here s stores the preconditioned residual M^{-1} r, and pcond_out
    stores the initial preconditioner output before it is folded into d.

  • PCG::operator() refreshes workspace layouts instead of constructing local
    fields on every solve.

  • op_m(d) now assigns into the existing q storage instead of forcing a
    deepCopy().

  • Preconditioner applications now write into pcond_out / s instead of
    returning freshly allocated fields.

Important BC detail:

  • pcond_out and s intentionally keep their default NoBcFace boundary
    conditions. This preserves the old behavior where the preconditioner returned
    a fresh field without periodic BCs. If these buffers inherited periodic BCs,
    preconditioner-internal operators could trigger extra PeriodicFace::apply
    MPI calls and diverge from the expected multi-rank communication sequence.

Preconditioners now write into caller-owned output

src/LinearSolvers/Preconditioner.h

  • The base preconditioner interface changed from returning Field to:

    void operator()(Field& input, Field& result)
  • Concrete preconditioners now reuse member scratch fields where possible.

  • init_fields() allocates scratch fields once and updates their layout on
    subsequent calls.

  • Several preconditioners still keep necessary deepCopy() calls where
    expression aliasing can occur, especially for field-valued inverse-diagonal
    operators.

Covered preconditioners include:

  • identity/default,
  • Jacobi,
  • polynomial Newton,
  • polynomial Chebyshev,
  • Richardson,
  • alternative Richardson,
  • Gauss-Seidel,
  • SSOR.

Solver operators take fields by reference

src/LinearSolvers/PCG.h
src/LinearSolvers/Preconditioner.h
src/PoissonSolvers/PoissonCG.h

The solver operator wrapper now takes the field by reference:

#define IPPL_SOLVER_OPERATOR_WRAPPER(fun, type) \
    [](type& arg) {                             \
        return fun(arg);                        \
    }

This matters because a by-value Field copy shares the underlying data views but
does not share all mutable side state, notably halo exchange buffers that may be
grown through Kokkos::realloc. When the operator acts on a copied field, the
copy can grow its halo buffer without updating the original. In multi-rank runs
this showed up as repeated allocation work.

PoissonCG.h no longer redefines the wrapper locally. Keeping a single
by-reference wrapper avoids silently reintroducing the by-value behavior.

ALPINE Kokkos view lifetime fixes

The ALPINE managers no longer take addresses of temporary view handles returned
by getView().

Changed files:

  • alpine/LandauDampingManager.h
  • alpine/BumponTailInstabilityManager.h
  • alpine/PenningTrapManager.h

Before:

view_type* R = &(this->pcontainer_m->R.getView());
samplingR.generate(*R, rand_pool64);

After:

view_type R = this->pcontainer_m->R.getView();
samplingR.generate(R, rand_pool64);

This fixes compilers/backends that reject taking the address of a temporary
Kokkos view object and avoids dangling view-handle pointers.

Documentation Check

The implementation is documented in the places where review context matters:

  • Preconditioner.h documents why the operator wrapper must take fields by
    reference.
  • PCG.h documents why workspaces are reused and why pcond_out / s keep
    NoBcFace boundary conditions.
  • PoissonCG.h documents why the local wrapper definition was removed.
  • PR501_SPLIT_MAP.md documents the split rationale and dependencies between
    the planned PR501 follow-up branches.

No user-facing manual update appears necessary because this PR changes internal
solver allocation behavior and preserves the public PCG/Poisson solver behavior.

Validation

Original split-map validation

The PCG split prototype was built and tested with unit tests:

100% tests passed, 0 tests failed out of 34
Total Test time (real) = 54.17 sec

Solver integration validation

A dedicated solver integration build was configured:

cmake -S . -B build-pr501-pcg-ssor-tests-release \
  -DCMAKE_BUILD_TYPE=Release \
  -DIPPL_PLATFORMS=SERIAL \
  -DIPPL_ENABLE_TESTS=ON \
  -DIPPL_ENABLE_UNIT_TESTS=OFF \
  -DIPPL_ENABLE_SOLVERS=ON \
  -DIPPL_ENABLE_FFT=OFF

Built targets:

cmake --build build-pr501-pcg-ssor-tests-release \
  --target TestCGSolver TestCGSolver_convergence TestSolverDesign -j 8

Results:

  • TestCGSolver: built and passed through CTest.
  • TestCGSolver_convergence: compile-only target built successfully.
  • TestSolverDesign: compile-only target built successfully.

Registered CTest:

ctest --test-dir build-pr501-pcg-ssor-tests-release \
  --output-on-failure -R '^TestCGSolver$'

Result:

TestCGSolver passed

The SSOR path was also run explicitly because the registered CTest uses Jacobi:

./test/solver/TestCGSolver 4 s 2 1 1.57079632679 --info 5

and with MPI:

/opt/homebrew/Cellar/open-mpi/5.0.8/bin/mpiexec -n 2 \
  ./test/solver/TestCGSolver 4 s 2 1 1.57079632679 --info 5

/opt/homebrew/Cellar/open-mpi/5.0.8/bin/mpiexec -n 4 \
  ./test/solver/TestCGSolver 4 s 2 1 1.57079632679 --info 5

Results:

ranks residual iterations
1 1.514653902775893e-13 42
2 1.196970356284836e-13 44
4 1.588802017354866e-13 38

The same explicit SSOR test was temporarily rerun with the preconditioner
changes reverted. Residuals and iteration counts were identical for 1, 2, and 4
ranks.

Reviewer Notes

  • The key semantic change is the preconditioner API: preconditioners now write
    into a supplied output field rather than returning a new field.
  • The by-reference solver operator wrapper is intentional and important for halo
    buffer reuse.
  • pcond_out and s should not inherit periodic BCs; that would change the MPI
    call pattern inside preconditioner operator chains.
  • Some deepCopy() calls remain intentionally where expression aliasing can
    otherwise corrupt field-valued preconditioner operations.

PaulFisch and others added 4 commits May 31, 2026 21:34
Reuse PCG and preconditioner work fields across iterations instead of allocating temporary fields during every solve step. This also passes the solver field by reference through OperatorF to avoid extra halo-related allocations.
@aaadelmann aaadelmann self-assigned this Jun 10, 2026
@aaadelmann aaadelmann added enhancement New feature or request cleanup labels Jun 10, 2026
@aaadelmann aaadelmann requested a review from s-mayani June 10, 2026 21:06
@s-mayani

Copy link
Copy Markdown
Collaborator

Two things I don't understand:

  • "fix ALPINE particle initialization code that took pointers/references to temporary Kokkos views."
  • "pcond_out and s intentionally keep their default NoBcFace boundary
    conditions. This preserves the old behavior where the preconditioner returned
    a fresh field without periodic BCs. If these buffers inherited periodic BCs,
    preconditioner-internal operators could trigger extra PeriodicFace::apply
    MPI calls and diverge from the expected multi-rank communication sequence."

For the first one, I don't think I understand why this change is necessary. @srikrrish do you know?

For the second one, I am a bit confused by the wording, and I also think the comment in the code regarding this could be made more informative for future developers, especially considering (if I understood correctly) that this was the source of deadlocks in PCG.

Comment thread alpine/BumponTailInstabilityManager.h
Comment thread src/LinearSolvers/PCG.h
Comment on lines +350 to +355
* pcond_out so operator() does not allocate per solve. pcond_out keeps
* its default NoBcFace BCs: the preconditioner's internal operator
* chain therefore never triggers PeriodicFace::apply MPI calls -- if it
* did, the global MPI sequence would diverge from the master code path
* (where the preconditioner returned a fresh NoBcFace field) and
* intermittent multi-rank halo deadlocks would follow.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this comment... I think it could be made more informative.

Comment thread src/LinearSolvers/PCG.h
Comment on lines +452 to +457
// r, d, q come from the CG base class; s and pcond_out are PCG
// members. All are pre-allocated via initializeFields(); operator()
// only refreshes their layout so a load-balance repartition is
// tracked. The preconditioner scratch must follow the lhs layout
// too, otherwise its halo-exchange neighbor list goes out of sync
// with r/d/s/q after a repartition and halo MPI calls deadlock.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too verbose, can change to
"Field layouts are updated such that we don't keep a stale domain decomposition if the lhs layout has been changed during the simulation, for example by the load balancer".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes maybe to verbose for the specialist but I personally like it and furthermore students maybe appreciate it very much

Comment thread src/LinearSolvers/PCG.h
Comment on lines +464 to +466
// Each preconditioner's init_fields() is responsible for being
// cheap on the steady-state path (refreshing layout in-place, not
// reallocating).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. Isn't the "refreshing layout in-place" is done by the lines of code above (updateLayout)? What does steady-state path mean?

Comment thread src/LinearSolvers/PCG.h
Comment on lines +492 to +495
// pcond_out keeps its default NoBcFace BCs so the preconditioner's
// internal operator chain does not trigger PeriodicFace::apply MPI;
// the expression assignment then folds it into d (which gets the
// periodic BCs via setFieldBC below).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment

Comment thread src/LinearSolvers/PCG.h
Comment thread src/LinearSolvers/PCG.h
Comment on lines +544 to +545
// NoBcFace BCs so the preconditioner's internal operator chain does
// NOT trigger PeriodicFace::apply MPI calls.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again I am not sure I understand this

Comment on lines +148 to +191
// Recursive Newton expansion P_k(u) where:
// P_0(u) = eta_0 * u
// P_k(u) = eta_k * (2 P_{k-1}(u) - P_{k-1}(A P_{k-1}(u)))
// Writes the result of level `level` into `out`. Uses one scratch
// field per recursion depth (Pr, PA, PAPr) so that no Field is
// allocated per call. The two recursive calls at each level execute
// sequentially and may reuse scratch at lower depths because the
// first call's lower-depth values have already been folded into
// Pr_scratch[level] before the second call begins.
void recursive_preconditioner(Field& u, unsigned int level, Field& out) {
// Timers accumulated across all levels of the recursion: useful
// for "how much of pcond is laplace vs combine vs leaf-assign".
// Counts include every call at every level (so the call-count
// column equals 2^(level_m+1)-1, not the number of operator()
// invocations).
static IpplTimings::TimerRef t_total = IpplTimings::getTimer("newton/total");
static IpplTimings::TimerRef t_leaf = IpplTimings::getTimer("newton/leaf_assign");
static IpplTimings::TimerRef t_op = IpplTimings::getTimer("newton/op_m");
static IpplTimings::TimerRef t_comb = IpplTimings::getTimer("newton/combine");

IpplTimings::startTimer(t_total);
if (level == 0) {
IpplTimings::startTimer(t_leaf);
out = eta_m[0] * u;
IpplTimings::stopTimer(t_leaf);
IpplTimings::stopTimer(t_total);
return;
}
Field& Pr = Pr_scratch_m[level];
Field& PA = PA_scratch_m[level];
Field& PAPr = PAPr_scratch_m[level];

// Inner recursive call owns its own timing; we don't double-count
// it here.
recursive_preconditioner(u, level - 1, Pr);
IpplTimings::startTimer(t_op);
PA = op_m(Pr);
IpplTimings::stopTimer(t_op);
recursive_preconditioner(PA, level - 1, PAPr);
IpplTimings::startTimer(t_comb);
out = eta_m[level] * (2.0 * Pr - PAPr);
IpplTimings::stopTimer(t_comb);
IpplTimings::stopTimer(t_total);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all these changes are not necessary, I am not even sure why this recursive preconditioner was touched.

Comment on lines +194 to +200
// One outer timer per polynomial_newton apply, complementing the
// per-call recursion timers. Lets us see how many outer pcond
// calls a solve does vs. the time inside each.
static IpplTimings::TimerRef t_apply = IpplTimings::getTimer("newton/apply");
IpplTimings::startTimer(t_apply);
recursive_preconditioner(u, level_m, result);
IpplTimings::stopTimer(t_apply);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for all these extra comments and timers, not necessary changes, and this preconditioner is never used.

Comment thread src/LinearSolvers/Preconditioner.h
Comment on lines +251 to +254
std::vector<Field> Pr_scratch_m;
std::vector<Field> PA_scratch_m;
std::vector<Field> PAPr_scratch_m;
bool fields_initialized_m = false;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it super expensive to store all these temporary fields which are not needed later (needed only for the recursion)?

Comment on lines +304 to +358
void operator()(Field& r, Field& result) override {
// x_m, x_old_m, A_m, z_m are pre-allocated scratch (init_fields).
// Coefficients in rho_m are also computed once.
static IpplTimings::TimerRef t_apply = IpplTimings::getTimer("chebyshev/apply");
static IpplTimings::TimerRef t_init = IpplTimings::getTimer("chebyshev/init_step");
static IpplTimings::TimerRef t_op = IpplTimings::getTimer("chebyshev/op_m");
static IpplTimings::TimerRef t_combo = IpplTimings::getTimer("chebyshev/combine");
static IpplTimings::TimerRef t_copy = IpplTimings::getTimer("chebyshev/deep_copy");

IpplTimings::startTimer(t_apply);

IpplTimings::startTimer(t_init);
x_old_m = r / theta_m;
IpplTimings::stopTimer(t_init);
IpplTimings::startTimer(t_op);
A_m = op_m(r);
IpplTimings::stopTimer(t_op);
IpplTimings::startTimer(t_combo);
x_m = 2.0 * rho_m[1] / delta_m * (2.0 * r - A_m / theta_m);
IpplTimings::stopTimer(t_combo);

if (degree_m == 0) {
// result = x_old
IpplTimings::startTimer(t_copy);
Kokkos::deep_copy(result.getView(), x_old_m.getView());
IpplTimings::stopTimer(t_copy);
IpplTimings::stopTimer(t_apply);
return;
}

Field res(mesh, layout);
Field x(mesh, layout);
Field x_old(mesh, layout);
Field A(mesh, layout);
Field z(mesh, layout);
if (degree_m == 1) {
// result = x
IpplTimings::startTimer(t_copy);
Kokkos::deep_copy(result.getView(), x_m.getView());
IpplTimings::stopTimer(t_copy);
IpplTimings::stopTimer(t_apply);
return;
}
for (unsigned int i = 2; i < degree_m + 1; ++i) {
IpplTimings::startTimer(t_op);
A_m = op_m(x_m);
IpplTimings::stopTimer(t_op);
IpplTimings::startTimer(t_combo);
z_m = 2.0 / delta_m * (r - A_m);
// Write the new x value into result (the caller's buffer);
// x_old gets a deep copy of the previous x.
result = rho_m[i] * (2 * sigma_m * x_m - rho_m[i - 1] * x_old_m + z_m);
IpplTimings::stopTimer(t_combo);
IpplTimings::startTimer(t_copy);
Kokkos::deep_copy(x_old_m.getView(), x_m.getView());
Kokkos::deep_copy(x_m.getView(), result.getView());
IpplTimings::stopTimer(t_copy);
}
IpplTimings::stopTimer(t_apply);
}

@s-mayani s-mayani Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these extra timers are not needed in this preconditioner, since we never use it, and the extra comments also don't seems necessary to me.
It makes the algorithm cumbersome to read.

Comment thread src/LinearSolvers/Preconditioner.h
Comment on lines +564 to +567
// The running iterate lives in result; UL_m and r_m are scratch
// members. The inner deep copies remain because the (upper, lower,
// inverse) operators may return Field-valued expressions that
// alias their input.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again same change: "In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."

Comment on lines +493 to +496
// result holds the running iterate; Ag_m and g_old_m are scratch
// members. The inner deep copies remain because the operators
// (op_m, inverse_diagonal_m) may return Field-valued expressions
// that alias their input.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same change:
"In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."

Comment on lines -462 to -468
// The inverse diagonal is applied to the
// vector itself to return the result usually.
// However, the operator for FEM already
// returns the result of inv_diag * itself
// due to the matrix-free evaluation.
// Therefore, we need this if to differentiate
// the two cases.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this comment removed?

Comment on lines -482 to -488
// The inverse diagonal is applied to the
// vector itself to return the result usually.
// However, the operator for FEM already
// returns the result of inv_diag * itself
// due to the matrix-free evaluation.
// Therefore, we need this if to differentiate
// the two cases.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this comment removed?

Comment on lines +653 to +655
// The running iterate lives in result; UL_m, r_m are scratch
// members. The inner deep copies remain because (upper, lower,
// inverse, diagonal) may return aliasing Field-valued expressions.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again same change: "In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."

Comment on lines -557 to -571
// In the FEM solver, which uses the preconditioner,
// we re-use a resultField to avoid allocating new
// memory at every iteration.
// In order for the operator calls to not rewrite
// on this same field over and over when calling
// the operators (upper, diag, inverse, lower, etc)
// we need deep copies to the preconditioner fields.

// The inverse diagonal is applied to the
// vector itself to return the result usually.
// However, the operator for FEM already
// returns the result of inv_diag * itself
// due to the matrix-free evaluation.
// Therefore, we need this if to differentiate
// the two cases.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this comment removed?

Comment thread src/PoissonSolvers/PoissonCG.h
Comment thread PR501_SPLIT_MAP.md
@srikrrish

Copy link
Copy Markdown
Member

This fixes compilers/backends that reject taking the address of a temporary Kokkos view object and avoids dangling view-handle pointers.

I think it is due to the explanation written here
"This fixes compilers/backends that reject taking the address of a temporary
Kokkos view object and avoids dangling view-handle pointers."
I am not sure if these things were observed before or it is just a pre-cautionary measure.

@s-mayani

Copy link
Copy Markdown
Collaborator

This fixes compilers/backends that reject taking the address of a temporary Kokkos view object and avoids dangling view-handle pointers.

I think it is due to the explanation written here "This fixes compilers/backends that reject taking the address of a temporary Kokkos view object and avoids dangling view-handle pointers." I am not sure if these things were observed before or it is just a pre-cautionary measure.

Okay thanks!

@s-mayani

s-mayani commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

In general, I think I disagree with the use of "result buffer" in most of the comments regarding the fields stored in CG and PCG. It might create a confusion with the communication buffers, "results field" would be less ambiguous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cleanup enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants