Skip to content

DAOS-18633 rebuild: abort orphaned reclaim rpt after PS leader switch#17652

Draft
wangshilong wants to merge 2 commits intomasterfrom
shilongw/DAOS-18633
Draft

DAOS-18633 rebuild: abort orphaned reclaim rpt after PS leader switch#17652
wangshilong wants to merge 2 commits intomasterfrom
shilongw/DAOS-18633

Conversation

@wangshilong
Copy link
Contributor

After PS leader switch, ds_rebuild_regenerate_task() only regenerates rebuild tasks for DOWN/DRAIN/UP targets. RECLAIM tasks are not regenerated because reintegrated targets are already UPIN. This leaves orphaned rpt on every target with a stale leader term, whose IV updates are silently dropped by the new leader (no matching rgt). The result is sp_rebuilding > 0 permanently, blocking EC aggregation and causing system-wide performance degradation.

Fix: detect stale leader term in rebuild_tgt_status_check_ult() and abort the orphaned rpt.

TODO: persist in-progress reclaim tasks in RDB so they can be properly re-triggered on PS leader step_up.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-18633

DP_UUID(rpt->rt_pool_uuid), rpt->rt_rebuild_ver,
rpt->rt_rebuild_gen, RB_OP_STR(rpt->rt_rebuild_op),
rpt->rt_leader_term, ns->iv_master_term);
rpt->rt_abort = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks not safe, currently it is supported that in PS leader change, if the rebuild is not failed before, then each tgt engine just continue its rebuild job.
see rebuild_task_ult(), "If the leader rebuild is aborted due to a leader change"

probably fine to only abort RECLAIM/FAIL_RECLAIM's RPT if PS leader switched, please consider to see if can combine the check/process with rpt_stale().
But even with it, I am not very sure if it is the reason of perf downgrade, because if the RECLAIM SCAN locally done and only RECLAIM not globally done, seems it should not affect IO perf.
Not sure if can find something if check the logs of did not report RECLAIM SCAN done's engines.

@wangshilong wangshilong force-pushed the shilongw/DAOS-18633 branch from c76d9fd to 01a4148 Compare March 6, 2026 09:04
@daosbuild3
Copy link
Collaborator

After PS leader switch, ds_rebuild_regenerate_task() only regenerates
rebuild tasks for DOWN/DRAIN/UP targets. RECLAIM tasks are not
regenerated because reintegrated targets are already UPIN. This
leaves orphaned rpt on every target with a stale leader term, whose
IV updates are silently dropped by the new leader (no matching rgt).
The result is sp_rebuilding > 0 permanently, blocking EC aggregation
and causing system-wide performance degradation.

Fix: detect stale leader term in rebuild_tgt_status_check_ult() and
abort the orphaned rpt.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong force-pushed the shilongw/DAOS-18633 branch from 01a4148 to 0860470 Compare March 6, 2026 14:25
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants