fix(machine-controller): gate initial-discovery cleanup on a confirmed host identity#2287
Open
chet wants to merge 1 commit into
Open
fix(machine-controller): gate initial-discovery cleanup on a confirmed host identity#2287chet wants to merge 1 commit into
chet wants to merge 1 commit into
Conversation
…d host identity TLDR is stop doing a `Transition` for predicted hosts, so `WaitingForDiscovery` continues to be a `Wait`. Storage cleanup during initial discovery wipes a host's NVMe/HDD by driving it through a reset, so it should run only against a machine whose identity is confirmed. This scopes that cleanup to identified hosts: a predicted host now [goes back to] waiting at `WaitingForDiscovery` until discovery promotes it, and the promoted host performs the cleanup. A veerrrry subtle bug had been introduced in NVIDIA#1748 which a predicted host (aka an `fm100p*`, derived from a DPU before its own discovery) could reach `WaitingForDiscovery` first, but then be routed into a cleanup it can never complete -- nothing is bound to its ID to answer `cleanup_machine_completed` -- leaving it wedged in `WaitingForCleanup{InitialDiscovery}`. That surfaced as a test flake in `test_find_instances_by_ids`, which waits for the host to get to `HostInit` while it works through the DPU states. This became flaky because it was a race: - The `dpu_state_controller_iterations` step polls the [still predicted] host and expects it to settle in some `HostInit`-specific sub-state while the DPU phase wraps up. - Before NVIDIA#1748, `HostInit{WaitingForDiscovery}` would call a `Wait` on a predicted host — the host sat there waiting for discovery via scout -- so the poll reliably caught it. - After NVIDIA#1748, `WaitingForDiscovery` effectively became a passthrough predicted hosts: the instant the host would land there, we would see `last_cleanup_time` was unset and transition immediately into `WaitingForCleanup{InitialDiscovery}` (...which isn't `HostInit`, and as such can never finish, since nothing answers its cleanup), but since we call `Wait`, it... waits... After narrowing it down to NVIDIA#1748, I was able to confirm this test passed 50/50 times (I ran it 50 times) *before* NVIDIA#1748, and then began to flake ~10% after. With this adjustment, its now back to passing 100% (50/50 times). Fwiw, no blame! I purposely didn't look at who the author was of that PR, because this is just as likely to happen to me as it is to anyone else! Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Contributor
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
TLDR is stop doing a
Transitionfor predicted hosts, soWaitingForDiscoverycontinues to be aWait.Storage cleanup during initial discovery wipes a host's NVMe/HDD by driving it through a reset, so it should run only against a machine whose identity is confirmed. This scopes that cleanup to identified hosts: a proactively-created predicted host now [goes back to] waiting at
WaitingForDiscoveryuntil discovery promotes it, and the promoted host performs the cleanup.A veerrrry subtle bug had been introduced in #1748 which a predicted host (aka an
fm100p*, derived from a DPU before its own discovery) could reachWaitingForDiscoveryfirst, but then be routed into a cleanup it can never complete -- nothing is bound to its ID to answercleanup_machine_completed-- leaving it wedged inWaitingForCleanup{InitialDiscovery}. That surfaced as a test flake intest_find_instances_by_ids, which waits for the host to get toHostInitwhile it works through the DPU states.This became flaky because it was a race:
dpu_state_controller_iterationsstep polls the [still predicted] host and expects it to settle in someHostInit-specific sub-state while the DPU phase wraps up.HostInit{WaitingForDiscovery}would call aWaiton a predicted host — the host sat there waiting for discovery via scout -- so the poll reliably caught it.WaitingForDiscoveryeffectively became a passthrough predicted hosts: the instant the host would land there, we would seelast_cleanup_timewas unset and transition immediately intoWaitingForCleanup{InitialDiscovery}(...which isn'tHostInit, and as such can never finish, since nothing answers its cleanup), but since we callWait, it... waits...After narrowing it down to #1748, I was able to confirm this test passed 50/50 times (I ran it 50 times) before #1748, and then began to flake ~10% after. With this adjustment, its now back to passing 100% (50/50 times).
Fwiw, no blame! I purposely didn't look at who the author was of that PR, because this is just as likely to happen to me as it is to anyone else!
Signed-off-by: Chet Nichols III chetn@nvidia.com
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes