fix(machine-controller): gate initial-discovery cleanup on a confirmed host identity by chet · Pull Request #2287 · NVIDIA/infra-controller

chet · 2026-06-07T17:30:18Z

Description

TLDR is stop doing a Transition for predicted hosts, so WaitingForDiscovery continues to be a Wait.

Storage cleanup during initial discovery wipes a host's NVMe/HDD by driving it through a reset, so it should run only against a machine whose identity is confirmed. This scopes that cleanup to identified hosts: a proactively-created predicted host now [goes back to] waiting at WaitingForDiscovery until discovery promotes it, and the promoted host performs the cleanup.

A veerrrry subtle bug had been introduced in #1748 which a predicted host (aka an fm100p*, derived from a DPU before its own discovery) could reach WaitingForDiscovery first, but then be routed into a cleanup it can never complete -- nothing is bound to its ID to answer cleanup_machine_completed -- leaving it wedged in WaitingForCleanup{InitialDiscovery}. That surfaced as a test flake in test_find_instances_by_ids, which waits for the host to get to HostInit while it works through the DPU states.

This became flaky because it was a race:

The dpu_state_controller_iterations step polls the [still predicted] host and expects it to settle in some HostInit-specific sub-state while the DPU phase wraps up.
Before #1748, HostInit{WaitingForDiscovery} would call a Wait on a predicted host — the host sat there waiting for discovery via scout -- so the poll reliably caught it.
After #1748, WaitingForDiscovery effectively became a passthrough predicted hosts: the instant the host would land there, we would see last_cleanup_time was unset and transition immediately into WaitingForCleanup{InitialDiscovery} (...which isn't HostInit, and as such can never finish, since nothing answers its cleanup), but since we call Wait, it... waits...

After narrowing it down to #1748, I was able to confirm this test passed 50/50 times (I ran it 50 times) before #1748, and then began to flake ~10% after. With this adjustment, its now back to passing 100% (50/50 times).

Fwiw, no blame! I purposely didn't look at who the author was of that PR, because this is just as likely to happen to me as it is to anyone else!

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

…d host identity TLDR is stop doing a `Transition` for predicted hosts, so `WaitingForDiscovery` continues to be a `Wait`. Storage cleanup during initial discovery wipes a host's NVMe/HDD by driving it through a reset, so it should run only against a machine whose identity is confirmed. This scopes that cleanup to identified hosts: a predicted host now [goes back to] waiting at `WaitingForDiscovery` until discovery promotes it, and the promoted host performs the cleanup. A veerrrry subtle bug had been introduced in NVIDIA#1748 which a predicted host (aka an `fm100p*`, derived from a DPU before its own discovery) could reach `WaitingForDiscovery` first, but then be routed into a cleanup it can never complete -- nothing is bound to its ID to answer `cleanup_machine_completed` -- leaving it wedged in `WaitingForCleanup{InitialDiscovery}`. That surfaced as a test flake in `test_find_instances_by_ids`, which waits for the host to get to `HostInit` while it works through the DPU states. This became flaky because it was a race: - The `dpu_state_controller_iterations` step polls the [still predicted] host and expects it to settle in some `HostInit`-specific sub-state while the DPU phase wraps up. - Before NVIDIA#1748, `HostInit{WaitingForDiscovery}` would call a `Wait` on a predicted host — the host sat there waiting for discovery via scout -- so the poll reliably caught it. - After NVIDIA#1748, `WaitingForDiscovery` effectively became a passthrough predicted hosts: the instant the host would land there, we would see `last_cleanup_time` was unset and transition immediately into `WaitingForCleanup{InitialDiscovery}` (...which isn't `HostInit`, and as such can never finish, since nothing answers its cleanup), but since we call `Wait`, it... waits... After narrowing it down to NVIDIA#1748, I was able to confirm this test passed 50/50 times (I ran it 50 times) *before* NVIDIA#1748, and then began to flake ~10% after. With this adjustment, its now back to passing 100% (50/50 times). Fwiw, no blame! I purposely didn't look at who the author was of that PR, because this is just as likely to happen to me as it is to anyone else! Signed-off-by: Chet Nichols III <chetn@nvidia.com>

coderabbitai · 2026-06-07T17:30:28Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e45e0de5-9256-41c6-83c0-a4fa465b35e2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chet requested a review from a team as a code owner June 7, 2026 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(machine-controller): gate initial-discovery cleanup on a confirmed host identity#2287

fix(machine-controller): gate initial-discovery cleanup on a confirmed host identity#2287
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:fix-test-find-instances-by-ids

chet commented Jun 7, 2026

Uh oh!

coderabbitai Bot commented Jun 7, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chet commented Jun 7, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

coderabbitai Bot commented Jun 7, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant