fix: scale-to-zero and shutdown improvements#150
Conversation
- middleware should skip disabling stz for internal requests - add logging for stz file writes - add support for stz.Drain() which is meant to be called upon shutdown - shutdown now kills all running processes and calls stz.Drain()
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| // The process was already started; kill it immediately. | ||
| _ = cmd.Process.Kill() | ||
| return oapi.ProcessSpawn500JSONResponse{InternalErrorJSONResponse: oapi.InternalErrorJSONResponse{Message: "server is shutting down"}}, nil | ||
| } |
There was a problem hiding this comment.
Shutdown spawn path leaks process lifecycle
Medium Severity
When ProcessSpawn hits s.shuttingDown after cmd.Start(), it returns immediately after cmd.Process.Kill() without waiting for exit or running normal cleanup. This skips the waiter path that closes pipes and balances s.stz.Enable, and it also ignores kill failure, leaving a potentially untracked process lifecycle.
| o.mu.Unlock() | ||
| } | ||
| }) | ||
| return o.drainedErr |
There was a problem hiding this comment.
Drain failures become permanently unrecoverable
Low Severity
Oncer.Drain uses sync.Once, so a transient error from o.ctrl.Drain is cached forever. Later calls return the same old error and never retry, even if the underlying condition recovers. That makes Drain effectively one-shot-fail and can prevent scaletozero from ever entering the drained state in that process.
| } | ||
| c.disabled = false | ||
| } | ||
|
|
There was a problem hiding this comment.
Drain skips underlying reset when already enabled
Medium Severity
DebouncedController.Drain only calls c.ctrl.Drain(ctx) when c.disabled is true. If c.disabled is false, shutdown marks only in-memory state as drained and never writes =0 via the underlying controller. This can leave the scale-to-zero control file unchanged and not reliably re-enabled after shutdown.
Sayan-
left a comment
There was a problem hiding this comment.
high level changes lgtm and help forward fix some of the issues we've been seeing! this is a lot of changes though so please update pr description with how you tested these bits, esp edge cases.
| defer c.mu.Unlock() | ||
|
|
||
| if !c.drained { | ||
| err := c.write(ctx, "=0") |
There was a problem hiding this comment.
=0 is a global reset of the Unikraft STZ counter — it clobbers holds from all writers, not just kernel-images-api. right now I think the only other writers in the scope of a VM are neko and the startup script. I'm not sure we can guarantee setting =0 is entirely safe since it could cause surprises to other consumers in the VM. I wonder if instead this Drain should only be responsible for undoing the stz counter for this process instead (e.g. -N in https://unikraft.com/docs/features/scale-to-zero#application-support-for-scale-to-zero). That seems more well scoped to this responsibility. Comparatively from our control plane we could always process exec and nuke the value ourselves if we want.
| defer shutdownCancel() | ||
| g, _ := errgroup.WithContext(shutdownCtx) | ||
|
|
||
| g.Go(func() error { |
There was a problem hiding this comment.
stz.Drain runs concurrently with srv.Shutdown here. this works but makes the state transitions harder to reason about — Drain races with in-flight requests' deferred Enable() calls. consider sequencing: drain HTTP servers first (letting all in-flight Enables run normally), then call stz.Drain after. at that point the controller is already at rest and Drain is just a safety net freeze. same outcome, easier to verify correctness.
|
I had opus do a broader inventory / analysis since the API server has a lot of moving pieces, some of which are inadvertently stateful. This isn't to say we shouldn't be taking this approach but if we are we need to be mindful of the current and future complexity it introduces Deep Review: PR #150 - Scale-to-Zero and Shutdown ImprovementsSummaryThis PR hardens the kernel-images API for restart scenarios (driven by supervisorctl). The key changes are:
Your question is exactly right: this introduces an API restart cycle as a deliberate mechanism, and the API is deeply stateful. Below is an analysis of every stateful component and whether it survives, is cleanly torn down, or is silently lost. Stateful Components Matrix
Got it. Here's the updated Critical Findings section: Critical Findings1. Filesystem Watches Are Not Cleaned Up (Must Fix)
Recommendation: add watch cleanup to func (s *ApiService) Shutdown(ctx context.Context) error {
// Close all filesystem watches
s.watchMu.Lock()
for id, w := range s.watches {
w.Close()
delete(s.watches, id)
}
s.watchMu.Unlock()
// ... existing kill + stop logic ...
}2. Shutdown Timeout Does Not Guarantee Component Completion (Correctness Note)The 10-second
In practice, the orchestrating layer drains recordings before initiating restart, so If the process hangs long enough, supervisorctl will SIGKILL it, but there's no visibility into what was left incomplete. Consider:
3. CDP WebSocket Connections Are Not Closed During Shutdown (Recommended Fix)
This means CDP clients get an unclean disconnect (TCP RST or similar) rather than a proper WebSocket close frame. 4. Recording State Is Lost Across Restarts (Recommended Improvement)After restart, the Yes, a few adjustments needed. Here's the updated versions: Brittleness AssessmentYes, this architecture is brittle. Three reasons: First, there is no explicit registry of "things that need cleanup during shutdown." Each feature author adds state independently ( Second, if someone adds a new stateful component in the future (a persistent cache, a background sync loop, a new external process), they need to: (a) know that the API can be restarted mid-flight, (b) decide whether to add cleanup logic to Third, Recommendations
|
imo the goal of the restart is to reset the state of the api and that Shutdown() should gracefully reset pieces external to the api but things internal to the api can be simply destroyed and reset (of course we don't want to leave orphaned go routines and have memory leaks). The restart should only be called in the case where the VM is not in use anymore like |
rgarcia
left a comment
There was a problem hiding this comment.
solid work — drain/freeze semantics are clean and well-tested. left a few inline comments, mostly around naming and the supervisorctl exception. will also need a rebase onto main since the loopback middleware changes from #151 are already merged and this branch has its own copy.
| return c.write(ctx, "-") | ||
| } | ||
|
|
||
| func (c *unikraftCloudController) Drain(ctx context.Context) error { |
There was a problem hiding this comment.
should this be called Shutdown instead of Drain to signal that it's terminal and only meant to be used during server shutdown? "drain" might be confused with just "set the count to 0" without the freeze semantics.
| return oapi.ListRecorders200JSONResponse(infos), nil | ||
| } | ||
|
|
||
| // killAllProcesses sends SIGKILL to every tracked process that is still running. |
There was a problem hiding this comment.
this is new behavior — we'll need to reach out to heavy browser pool users and make sure they don't depend on process execs carrying over between session re-use.
| if h.cmd.Process == nil { | ||
| continue | ||
| } | ||
| // supervisorctl handles the lifecycle of long running processes so we don't want to kill |
There was a problem hiding this comment.
not sure we should make an exception for supervisorctl here. if someone sent a process exec for a supervisorctl command it shouldn't matter — we're going to hard reset supervisor services anyway, right? or are we doing our own supervisorctl hard-reset of things like chromium during server shutdown? that feels a little weird but i could live with it — just want to make sure the reasoning is clear.


Note
Medium Risk
Touches shutdown and process lifecycle plus STZ state transitions; mistakes could leave processes running or keep scale-to-zero incorrectly disabled/enabled under load or during restarts.
Overview
Scale-to-zero control is hardened for shutdown/restart scenarios. The
scaletozero.Controllernow supportsDrain(), which resets STZ state and freezes furtherDisable/Enablecalls; this is implemented acrossUnikraftCloudController,DebouncedController,Oncer, andNoopController, with added logging for control-file writes and comprehensive new tests.Request/middleware and server shutdown behavior changes. STZ middleware now ignores loopback clients (so internal calls don’t prevent scale-to-zero),
maininvokesstz.Drain()during graceful shutdown, andApiService.Shutdown()now concurrently stops recordings and SIGKILLs tracked running processes (skippingsupervisorctl) while rejecting newProcessSpawnrequests once shutdown begins.Written by Cursor Bugbot for commit 03e7c29. This will update automatically on new commits. Configure here.