Skip to content

DAOS-19028 test: DO NOT LAND test_rebuild_29 repro attempt#18477

Draft
kccain wants to merge 2 commits into
release/2.6from
kccain/daos_19028_debug_rel2p6
Draft

DAOS-19028 test: DO NOT LAND test_rebuild_29 repro attempt#18477
kccain wants to merge 2 commits into
release/2.6from
kccain/daos_19028_debug_rel2p6

Conversation

@kccain

@kccain kccain commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Debug logging for MGMT_TGT_MAP_UPDATE map_update_bcast() and ds_mgmt_tgt_map_update_pre_forward(), to inspect on any reproducer possible URI and incarnation mismatches, between the PS leader, forwarding engines in the knomial tree, and the restarted engine itself.

Latest changes include:

  • MGMT map distribution logging in map_update_bcast(), with verbose per-rank
    map dumps controlled by DAOS_MAP_UPDATE_VERBOSE (in addition to needing
    log_mask: DEBUG)
  • MGMT target pre-forward logging in ds_mgmt_tgt_map_update_pre_forward(),
    including a "MISMATCH " prefix when self/map state differs.
  • MGMT map update aggregation warning for non-zero member return codes.
  • CaRT group replace-path diagnostics in crt_group_primary_modify() for
    existing-rank SWIM-check flow (incoming rank/incarnation/URI visibility).
  • ftest suite env updates to enable DAOS_MAP_UPDATE_VERBOSE=1 on both engines.
  • launch.py CI repeat cap increased from 10 to 20.

Test-tag: test_rebuild_29
Test-Repeat: 20
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test-rpms: true
Test-provider-hw-medium: ofi+tcp

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Ticket title is 'daos_test/rebuild.py:DaosCoreTestRebuild.test_rebuild_29 - pool reintegrate failed'
Status is 'In Progress'
Labels: '2.6.5rc3,pr_test,scrubbed_2.8,tcp_provider,test_2.6.5rc1'
https://daosio.atlassian.net/browse/DAOS-19028

@kccain kccain force-pushed the kccain/daos_19028_debug_rel2p6 branch from 00dc907 to 37c732e Compare June 10, 2026 00:41
@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18477/3/display/redirect

1 similar comment
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18477/3/display/redirect

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18477/4/display/redirect

1 similar comment
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18477/4/display/redirect

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18477/5/display/redirect

1 similar comment
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18477/5/display/redirect

@kccain kccain force-pushed the kccain/daos_19028_debug_rel2p6 branch from 37c732e to 470e41f Compare June 11, 2026 17:35
Debug logging for MGMT_TGT_MAP_UPDATE map_update_bcast() and
ds_mgmt_tgt_map_update_pre_forward(), to inspect on any reproducer
possible URI and incarnation mismatches, between the PS leader,
forwarding engines in the knomial tree, and the restarted engine
itself.

Test-tag: test_rebuild_29
Test-Repeat: 10
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test-rpms: true
Test-provider-hw-medium: ofi+tcp

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@kccain kccain force-pushed the kccain/daos_19028_debug_rel2p6 branch from 470e41f to e8b0ecd Compare June 11, 2026 17:37
Latest changes include:
- MGMT map distribution logging in map_update_bcast(), with verbose per-rank
  map dumps controlled by DAOS_MAP_UPDATE_VERBOSE (in addition to needing
  log_mask: DEBUG)
- MGMT target pre-forward logging in ds_mgmt_tgt_map_update_pre_forward(),
  including a "MISMATCH " prefix when self/map state differs.
- MGMT map update aggregation warning for non-zero member return codes.
- CaRT group replace-path diagnostics in crt_group_primary_modify() for
  existing-rank SWIM-check flow (incoming rank/incarnation/URI visibility).
- ftest suite env updates to enable DAOS_MAP_UPDATE_VERBOSE=1 on both engines.
- launch.py CI repeat cap increased from 10 to 20.

Looking for potential stale membership/address state during reintegrate hangs.

Test-tag: test_rebuild_29
Test-Repeat: 20
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test-rpms: true
Test-provider-hw-medium: ofi+tcp

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants