fix(deploy): nil pointer in updateProcessGroup error path#4931
Open
Lucais11 wants to merge 1 commit into
Open
Conversation
When updateMachineWChecks returns an error for a scale-up pairing (oldMachine=nil), the error-formatting line dereferenced oldMachine.ID and panicked inside an errgroup goroutine, crashing the deploy. Use the locally-scoped machineID variable that already handles the nil-oldMachine case. The originally-reported panic in #4528 (at the now-removed healthChecksPassed.Load(machPair.oldMachine.ID) site) was patched in #4180; this closes the remaining oldMachine.ID deref in the same function. Sibling fix to #4750. Fixes #4528
dangra
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4528
When fly deploy creates a brand new machine (scale-up) and that machine fails to come up. For example the API rejects the create, lease acquisition fails, or smoke checks don't pass. flyctl crashes with a nil pointer dereference panic instead of returning a clear error. Users then see a Go stack trace and a deploy left in an inconsistent state.
The error formatting line in updateProcessGroup tries to print "failed to update machine : " by reading oldMachine.ID. For a scale-up there is no old machine — oldMachine is nil — so reading .ID panics on the exact line that was supposed to report the error. Three lines above, the code already computes a machineID variable that handles both cases (existing machine → use the old ID; new machine → use the new ID). The fix is to use that variable.
updateProcessGroup historically had two unsafe oldMachine.ID reads. #4180 fixed the first (the health-check lookup), which made the common scale-up path stop panicking. This PR closes the second (the error path), which still panics any time a scale-up hits a downstream error. Same shape of bug as #4750, one function over.
Also adds TestUpdateProcessGroupScaleUpErrorDoesNotPanic, which reproduces the exact panic from #4528 when the fix is reverted and passes aft