Pr501 pcg#543
Conversation
Reuse PCG and preconditioner work fields across iterations instead of allocating temporary fields during every solve step. This also passes the solver field by reference through OperatorF to avoid extra halo-related allocations.
|
Two things I don't understand:
For the first one, I don't think I understand why this change is necessary. @srikrrish do you know? For the second one, I am a bit confused by the wording, and I also think the comment in the code regarding this could be made more informative for future developers, especially considering (if I understood correctly) that this was the source of deadlocks in PCG. |
| * pcond_out so operator() does not allocate per solve. pcond_out keeps | ||
| * its default NoBcFace BCs: the preconditioner's internal operator | ||
| * chain therefore never triggers PeriodicFace::apply MPI calls -- if it | ||
| * did, the global MPI sequence would diverge from the master code path | ||
| * (where the preconditioner returned a fresh NoBcFace field) and | ||
| * intermittent multi-rank halo deadlocks would follow. |
There was a problem hiding this comment.
I do not understand this comment... I think it could be made more informative.
| // r, d, q come from the CG base class; s and pcond_out are PCG | ||
| // members. All are pre-allocated via initializeFields(); operator() | ||
| // only refreshes their layout so a load-balance repartition is | ||
| // tracked. The preconditioner scratch must follow the lhs layout | ||
| // too, otherwise its halo-exchange neighbor list goes out of sync | ||
| // with r/d/s/q after a repartition and halo MPI calls deadlock. |
There was a problem hiding this comment.
too verbose, can change to
"Field layouts are updated such that we don't keep a stale domain decomposition if the lhs layout has been changed during the simulation, for example by the load balancer".
There was a problem hiding this comment.
Yes maybe to verbose for the specialist but I personally like it and furthermore students maybe appreciate it very much
| // Each preconditioner's init_fields() is responsible for being | ||
| // cheap on the steady-state path (refreshing layout in-place, not | ||
| // reallocating). |
There was a problem hiding this comment.
I don't understand this comment. Isn't the "refreshing layout in-place" is done by the lines of code above (updateLayout)? What does steady-state path mean?
| // pcond_out keeps its default NoBcFace BCs so the preconditioner's | ||
| // internal operator chain does not trigger PeriodicFace::apply MPI; | ||
| // the expression assignment then folds it into d (which gets the | ||
| // periodic BCs via setFieldBC below). |
There was a problem hiding this comment.
I don't understand this comment
| // NoBcFace BCs so the preconditioner's internal operator chain does | ||
| // NOT trigger PeriodicFace::apply MPI calls. |
There was a problem hiding this comment.
again I am not sure I understand this
| // Recursive Newton expansion P_k(u) where: | ||
| // P_0(u) = eta_0 * u | ||
| // P_k(u) = eta_k * (2 P_{k-1}(u) - P_{k-1}(A P_{k-1}(u))) | ||
| // Writes the result of level `level` into `out`. Uses one scratch | ||
| // field per recursion depth (Pr, PA, PAPr) so that no Field is | ||
| // allocated per call. The two recursive calls at each level execute | ||
| // sequentially and may reuse scratch at lower depths because the | ||
| // first call's lower-depth values have already been folded into | ||
| // Pr_scratch[level] before the second call begins. | ||
| void recursive_preconditioner(Field& u, unsigned int level, Field& out) { | ||
| // Timers accumulated across all levels of the recursion: useful | ||
| // for "how much of pcond is laplace vs combine vs leaf-assign". | ||
| // Counts include every call at every level (so the call-count | ||
| // column equals 2^(level_m+1)-1, not the number of operator() | ||
| // invocations). | ||
| static IpplTimings::TimerRef t_total = IpplTimings::getTimer("newton/total"); | ||
| static IpplTimings::TimerRef t_leaf = IpplTimings::getTimer("newton/leaf_assign"); | ||
| static IpplTimings::TimerRef t_op = IpplTimings::getTimer("newton/op_m"); | ||
| static IpplTimings::TimerRef t_comb = IpplTimings::getTimer("newton/combine"); | ||
|
|
||
| IpplTimings::startTimer(t_total); | ||
| if (level == 0) { | ||
| IpplTimings::startTimer(t_leaf); | ||
| out = eta_m[0] * u; | ||
| IpplTimings::stopTimer(t_leaf); | ||
| IpplTimings::stopTimer(t_total); | ||
| return; | ||
| } | ||
| Field& Pr = Pr_scratch_m[level]; | ||
| Field& PA = PA_scratch_m[level]; | ||
| Field& PAPr = PAPr_scratch_m[level]; | ||
|
|
||
| // Inner recursive call owns its own timing; we don't double-count | ||
| // it here. | ||
| recursive_preconditioner(u, level - 1, Pr); | ||
| IpplTimings::startTimer(t_op); | ||
| PA = op_m(Pr); | ||
| IpplTimings::stopTimer(t_op); | ||
| recursive_preconditioner(PA, level - 1, PAPr); | ||
| IpplTimings::startTimer(t_comb); | ||
| out = eta_m[level] * (2.0 * Pr - PAPr); | ||
| IpplTimings::stopTimer(t_comb); | ||
| IpplTimings::stopTimer(t_total); | ||
| } |
There was a problem hiding this comment.
I think all these changes are not necessary, I am not even sure why this recursive preconditioner was touched.
| // One outer timer per polynomial_newton apply, complementing the | ||
| // per-call recursion timers. Lets us see how many outer pcond | ||
| // calls a solve does vs. the time inside each. | ||
| static IpplTimings::TimerRef t_apply = IpplTimings::getTimer("newton/apply"); | ||
| IpplTimings::startTimer(t_apply); | ||
| recursive_preconditioner(u, level_m, result); | ||
| IpplTimings::stopTimer(t_apply); |
There was a problem hiding this comment.
Same for all these extra comments and timers, not necessary changes, and this preconditioner is never used.
| std::vector<Field> Pr_scratch_m; | ||
| std::vector<Field> PA_scratch_m; | ||
| std::vector<Field> PAPr_scratch_m; | ||
| bool fields_initialized_m = false; |
There was a problem hiding this comment.
Isn't it super expensive to store all these temporary fields which are not needed later (needed only for the recursion)?
| void operator()(Field& r, Field& result) override { | ||
| // x_m, x_old_m, A_m, z_m are pre-allocated scratch (init_fields). | ||
| // Coefficients in rho_m are also computed once. | ||
| static IpplTimings::TimerRef t_apply = IpplTimings::getTimer("chebyshev/apply"); | ||
| static IpplTimings::TimerRef t_init = IpplTimings::getTimer("chebyshev/init_step"); | ||
| static IpplTimings::TimerRef t_op = IpplTimings::getTimer("chebyshev/op_m"); | ||
| static IpplTimings::TimerRef t_combo = IpplTimings::getTimer("chebyshev/combine"); | ||
| static IpplTimings::TimerRef t_copy = IpplTimings::getTimer("chebyshev/deep_copy"); | ||
|
|
||
| IpplTimings::startTimer(t_apply); | ||
|
|
||
| IpplTimings::startTimer(t_init); | ||
| x_old_m = r / theta_m; | ||
| IpplTimings::stopTimer(t_init); | ||
| IpplTimings::startTimer(t_op); | ||
| A_m = op_m(r); | ||
| IpplTimings::stopTimer(t_op); | ||
| IpplTimings::startTimer(t_combo); | ||
| x_m = 2.0 * rho_m[1] / delta_m * (2.0 * r - A_m / theta_m); | ||
| IpplTimings::stopTimer(t_combo); | ||
|
|
||
| if (degree_m == 0) { | ||
| // result = x_old | ||
| IpplTimings::startTimer(t_copy); | ||
| Kokkos::deep_copy(result.getView(), x_old_m.getView()); | ||
| IpplTimings::stopTimer(t_copy); | ||
| IpplTimings::stopTimer(t_apply); | ||
| return; | ||
| } | ||
|
|
||
| Field res(mesh, layout); | ||
| Field x(mesh, layout); | ||
| Field x_old(mesh, layout); | ||
| Field A(mesh, layout); | ||
| Field z(mesh, layout); | ||
| if (degree_m == 1) { | ||
| // result = x | ||
| IpplTimings::startTimer(t_copy); | ||
| Kokkos::deep_copy(result.getView(), x_m.getView()); | ||
| IpplTimings::stopTimer(t_copy); | ||
| IpplTimings::stopTimer(t_apply); | ||
| return; | ||
| } | ||
| for (unsigned int i = 2; i < degree_m + 1; ++i) { | ||
| IpplTimings::startTimer(t_op); | ||
| A_m = op_m(x_m); | ||
| IpplTimings::stopTimer(t_op); | ||
| IpplTimings::startTimer(t_combo); | ||
| z_m = 2.0 / delta_m * (r - A_m); | ||
| // Write the new x value into result (the caller's buffer); | ||
| // x_old gets a deep copy of the previous x. | ||
| result = rho_m[i] * (2 * sigma_m * x_m - rho_m[i - 1] * x_old_m + z_m); | ||
| IpplTimings::stopTimer(t_combo); | ||
| IpplTimings::startTimer(t_copy); | ||
| Kokkos::deep_copy(x_old_m.getView(), x_m.getView()); | ||
| Kokkos::deep_copy(x_m.getView(), result.getView()); | ||
| IpplTimings::stopTimer(t_copy); | ||
| } | ||
| IpplTimings::stopTimer(t_apply); | ||
| } |
There was a problem hiding this comment.
All these extra timers are not needed in this preconditioner, since we never use it, and the extra comments also don't seems necessary to me.
It makes the algorithm cumbersome to read.
| // The running iterate lives in result; UL_m and r_m are scratch | ||
| // members. The inner deep copies remain because the (upper, lower, | ||
| // inverse) operators may return Field-valued expressions that | ||
| // alias their input. |
There was a problem hiding this comment.
Again same change: "In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."
| // result holds the running iterate; Ag_m and g_old_m are scratch | ||
| // members. The inner deep copies remain because the operators | ||
| // (op_m, inverse_diagonal_m) may return Field-valued expressions | ||
| // that alias their input. |
There was a problem hiding this comment.
Same change:
"In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."
| // The inverse diagonal is applied to the | ||
| // vector itself to return the result usually. | ||
| // However, the operator for FEM already | ||
| // returns the result of inv_diag * itself | ||
| // due to the matrix-free evaluation. | ||
| // Therefore, we need this if to differentiate | ||
| // the two cases. |
There was a problem hiding this comment.
why was this comment removed?
| // The inverse diagonal is applied to the | ||
| // vector itself to return the result usually. | ||
| // However, the operator for FEM already | ||
| // returns the result of inv_diag * itself | ||
| // due to the matrix-free evaluation. | ||
| // Therefore, we need this if to differentiate | ||
| // the two cases. |
There was a problem hiding this comment.
Why was this comment removed?
| // The running iterate lives in result; UL_m, r_m are scratch | ||
| // members. The inner deep copies remain because (upper, lower, | ||
| // inverse, diagonal) may return aliasing Field-valued expressions. |
There was a problem hiding this comment.
Again same change: "In order for the operator calls to not rewrite on this same field over and over when calling the operators (upper, diag, inverse, lower, etc) we need deep copies to the preconditioner fields."
| // In the FEM solver, which uses the preconditioner, | ||
| // we re-use a resultField to avoid allocating new | ||
| // memory at every iteration. | ||
| // In order for the operator calls to not rewrite | ||
| // on this same field over and over when calling | ||
| // the operators (upper, diag, inverse, lower, etc) | ||
| // we need deep copies to the preconditioner fields. | ||
|
|
||
| // The inverse diagonal is applied to the | ||
| // vector itself to return the result usually. | ||
| // However, the operator for FEM already | ||
| // returns the result of inv_diag * itself | ||
| // due to the matrix-free evaluation. | ||
| // Therefore, we need this if to differentiate | ||
| // the two cases. |
There was a problem hiding this comment.
Why was this comment removed?
I think it is due to the explanation written here |
Okay thanks! |
|
In general, I think I disagree with the use of "result buffer" in most of the comments regarding the fields stored in CG and PCG. It might create a confusion with the communication buffers, "results field" would be less ambiguous. |
PCG/Preconditioner
Summary
This PR extracts the PCG allocation/performance part from the larger PR501 work.
It keeps the scope intentionally small:
The primary motivation is to remove repeated allocation and avoid repeated
halo-buffer growth on copied fields in the PCG/preconditioner path. This was
observed as repeated allocation work in multi-rank profiling.
What Changed
PCG workspaces are reused
src/LinearSolvers/PCG.hIn the CG/PCG algorithm these are:
r: the current residual, initiallyrhs - A * lhs,d: the current search direction,q: the operator-applied search direction, i.e.A * d.PCG adds reusable
sandpcond_outbuffers.Here
sstores the preconditioned residualM^{-1} r, andpcond_outstores the initial preconditioner output before it is folded into
d.PCG::operator()refreshes workspace layouts instead of constructing localfields on every solve.
op_m(d)now assigns into the existingqstorage instead of forcing adeepCopy().Preconditioner applications now write into
pcond_out/sinstead ofreturning freshly allocated fields.
Important BC detail:
pcond_outandsintentionally keep their defaultNoBcFaceboundaryconditions. This preserves the old behavior where the preconditioner returned
a fresh field without periodic BCs. If these buffers inherited periodic BCs,
preconditioner-internal operators could trigger extra
PeriodicFace::applyMPI calls and diverge from the expected multi-rank communication sequence.
Preconditioners now write into caller-owned output
src/LinearSolvers/Preconditioner.hThe base preconditioner interface changed from returning
Fieldto:Concrete preconditioners now reuse member scratch fields where possible.
init_fields()allocates scratch fields once and updates their layout onsubsequent calls.
Several preconditioners still keep necessary
deepCopy()calls whereexpression aliasing can occur, especially for field-valued inverse-diagonal
operators.
Covered preconditioners include:
Solver operators take fields by reference
src/LinearSolvers/PCG.hsrc/LinearSolvers/Preconditioner.hsrc/PoissonSolvers/PoissonCG.hThe solver operator wrapper now takes the field by reference:
This matters because a by-value
Fieldcopy shares the underlying data views butdoes not share all mutable side state, notably halo exchange buffers that may be
grown through
Kokkos::realloc. When the operator acts on a copied field, thecopy can grow its halo buffer without updating the original. In multi-rank runs
this showed up as repeated allocation work.
PoissonCG.hno longer redefines the wrapper locally. Keeping a singleby-reference wrapper avoids silently reintroducing the by-value behavior.
ALPINE Kokkos view lifetime fixes
The ALPINE managers no longer take addresses of temporary view handles returned
by
getView().Changed files:
alpine/LandauDampingManager.halpine/BumponTailInstabilityManager.halpine/PenningTrapManager.hBefore:
view_type* R = &(this->pcontainer_m->R.getView()); samplingR.generate(*R, rand_pool64);After:
view_type R = this->pcontainer_m->R.getView(); samplingR.generate(R, rand_pool64);This fixes compilers/backends that reject taking the address of a temporary
Kokkos view object and avoids dangling view-handle pointers.
Documentation Check
The implementation is documented in the places where review context matters:
Preconditioner.hdocuments why the operator wrapper must take fields byreference.
PCG.hdocuments why workspaces are reused and whypcond_out/skeepNoBcFaceboundary conditions.PoissonCG.hdocuments why the local wrapper definition was removed.PR501_SPLIT_MAP.mddocuments the split rationale and dependencies betweenthe planned PR501 follow-up branches.
No user-facing manual update appears necessary because this PR changes internal
solver allocation behavior and preserves the public PCG/Poisson solver behavior.
Validation
Original split-map validation
The PCG split prototype was built and tested with unit tests:
Solver integration validation
A dedicated solver integration build was configured:
cmake -S . -B build-pr501-pcg-ssor-tests-release \ -DCMAKE_BUILD_TYPE=Release \ -DIPPL_PLATFORMS=SERIAL \ -DIPPL_ENABLE_TESTS=ON \ -DIPPL_ENABLE_UNIT_TESTS=OFF \ -DIPPL_ENABLE_SOLVERS=ON \ -DIPPL_ENABLE_FFT=OFFBuilt targets:
Results:
TestCGSolver: built and passed through CTest.TestCGSolver_convergence: compile-only target built successfully.TestSolverDesign: compile-only target built successfully.Registered CTest:
ctest --test-dir build-pr501-pcg-ssor-tests-release \ --output-on-failure -R '^TestCGSolver$'Result:
The SSOR path was also run explicitly because the registered CTest uses Jacobi:
and with MPI:
Results:
1.514653902775893e-131.196970356284836e-131.588802017354866e-13The same explicit SSOR test was temporarily rerun with the preconditioner
changes reverted. Residuals and iteration counts were identical for 1, 2, and 4
ranks.
Reviewer Notes
into a supplied output field rather than returning a new field.
buffer reuse.
pcond_outandsshould not inherit periodic BCs; that would change the MPIcall pattern inside preconditioner operator chains.
deepCopy()calls remain intentionally where expression aliasing canotherwise corrupt field-valued preconditioner operations.