Skip to content

fix(machine-controller): gate initial-discovery cleanup on a confirmed host identity#2287

Open
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:fix-test-find-instances-by-ids
Open

fix(machine-controller): gate initial-discovery cleanup on a confirmed host identity#2287
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:fix-test-find-instances-by-ids

Conversation

@chet
Copy link
Copy Markdown
Contributor

@chet chet commented Jun 7, 2026

Description

TLDR is stop doing a Transition for predicted hosts, so WaitingForDiscovery continues to be a Wait.

Storage cleanup during initial discovery wipes a host's NVMe/HDD by driving it through a reset, so it should run only against a machine whose identity is confirmed. This scopes that cleanup to identified hosts: a proactively-created predicted host now [goes back to] waiting at WaitingForDiscovery until discovery promotes it, and the promoted host performs the cleanup.

A veerrrry subtle bug had been introduced in #1748 which a predicted host (aka an fm100p*, derived from a DPU before its own discovery) could reach WaitingForDiscovery first, but then be routed into a cleanup it can never complete -- nothing is bound to its ID to answer cleanup_machine_completed -- leaving it wedged in WaitingForCleanup{InitialDiscovery}. That surfaced as a test flake in test_find_instances_by_ids, which waits for the host to get to HostInit while it works through the DPU states.

This became flaky because it was a race:

  • The dpu_state_controller_iterations step polls the [still predicted] host and expects it to settle in some HostInit-specific sub-state while the DPU phase wraps up.
  • Before #1748, HostInit{WaitingForDiscovery} would call a Wait on a predicted host — the host sat there waiting for discovery via scout -- so the poll reliably caught it.
  • After #1748, WaitingForDiscovery effectively became a passthrough predicted hosts: the instant the host would land there, we would see last_cleanup_time was unset and transition immediately into WaitingForCleanup{InitialDiscovery} (...which isn't HostInit, and as such can never finish, since nothing answers its cleanup), but since we call Wait, it... waits...

After narrowing it down to #1748, I was able to confirm this test passed 50/50 times (I ran it 50 times) before #1748, and then began to flake ~10% after. With this adjustment, its now back to passing 100% (50/50 times).

Fwiw, no blame! I purposely didn't look at who the author was of that PR, because this is just as likely to happen to me as it is to anyone else!

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

…d host identity

TLDR is stop doing a `Transition` for predicted hosts, so `WaitingForDiscovery` continues to be a `Wait`.

Storage cleanup during initial discovery wipes a host's NVMe/HDD by driving it through a reset, so it should run only against a machine whose identity is confirmed. This scopes that cleanup to identified hosts: a predicted host now [goes back to] waiting at `WaitingForDiscovery` until discovery promotes it, and the promoted host performs the cleanup.

A veerrrry subtle bug had been introduced in NVIDIA#1748 which a predicted host (aka an `fm100p*`, derived from a DPU before its own discovery) could reach `WaitingForDiscovery` first, but then be routed into a cleanup it can never complete -- nothing is bound to its ID to answer `cleanup_machine_completed` -- leaving it wedged in `WaitingForCleanup{InitialDiscovery}`. That surfaced as a test flake in `test_find_instances_by_ids`, which waits for the host to get to `HostInit` while it works through the DPU states.

This became flaky because it was a race:
- The `dpu_state_controller_iterations` step polls the [still predicted] host and expects it to settle in some `HostInit`-specific sub-state while the DPU phase wraps up.
- Before NVIDIA#1748, `HostInit{WaitingForDiscovery}` would call a `Wait` on a predicted host — the host sat there waiting for discovery via scout -- so the poll reliably caught it.
- After NVIDIA#1748, `WaitingForDiscovery` effectively became a passthrough predicted hosts: the instant the host would land there, we would see `last_cleanup_time` was unset and transition immediately into `WaitingForCleanup{InitialDiscovery}` (...which isn't `HostInit`, and as such can never finish, since nothing answers its cleanup), but since we call `Wait`, it... waits...

After narrowing it down to NVIDIA#1748, I was able to confirm this test passed 50/50 times (I ran it 50 times) *before* NVIDIA#1748, and then began to flake ~10% after. With this adjustment, its now back to passing 100% (50/50 times).

Fwiw, no blame! I purposely didn't look at who the author was of that PR, because this is just as likely to happen to me as it is to anyone else!

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet requested a review from a team as a code owner June 7, 2026 17:30
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 7, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e45e0de5-9256-41c6-83c0-a4fa465b35e2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant