Skip to content

fix: scale-to-zero and shutdown improvements#150

Open
tnsardesai wants to merge 2 commits intomainfrom
tanmay/kernel-965-billing-swap-debug-high-browser-usage-costs-from-orphaned
Open

fix: scale-to-zero and shutdown improvements#150
tnsardesai wants to merge 2 commits intomainfrom
tanmay/kernel-965-billing-swap-debug-high-browser-usage-costs-from-orphaned

Conversation

@tnsardesai
Copy link
Contributor

@tnsardesai tnsardesai commented Feb 11, 2026

  • middleware should skip disabling stz for internal requests
  • add logging for stz file writes
  • add support for stz.Drain() which is meant to be called upon shutdown
  • shutdown now kills all running processes and calls stz.Drain()

Note

Medium Risk
Touches shutdown and process lifecycle plus STZ state transitions; mistakes could leave processes running or keep scale-to-zero incorrectly disabled/enabled under load or during restarts.

Overview
Scale-to-zero control is hardened for shutdown/restart scenarios. The scaletozero.Controller now supports Drain(), which resets STZ state and freezes further Disable/Enable calls; this is implemented across UnikraftCloudController, DebouncedController, Oncer, and NoopController, with added logging for control-file writes and comprehensive new tests.

Request/middleware and server shutdown behavior changes. STZ middleware now ignores loopback clients (so internal calls don’t prevent scale-to-zero), main invokes stz.Drain() during graceful shutdown, and ApiService.Shutdown() now concurrently stops recordings and SIGKILLs tracked running processes (skipping supervisorctl) while rejecting new ProcessSpawn requests once shutdown begins.

Written by Cursor Bugbot for commit 03e7c29. This will update automatically on new commits. Configure here.

- middleware should skip disabling stz for internal requests
- add logging for stz file writes
- add support for stz.Drain() which is meant to be called upon shutdown
- shutdown now kills all running processes and calls stz.Drain()
@tnsardesai tnsardesai marked this pull request as ready for review February 11, 2026 20:46
@tnsardesai tnsardesai requested review from Sayan- and rgarcia February 11, 2026 20:46
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

// The process was already started; kill it immediately.
_ = cmd.Process.Kill()
return oapi.ProcessSpawn500JSONResponse{InternalErrorJSONResponse: oapi.InternalErrorJSONResponse{Message: "server is shutting down"}}, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown spawn path leaks process lifecycle

Medium Severity

When ProcessSpawn hits s.shuttingDown after cmd.Start(), it returns immediately after cmd.Process.Kill() without waiting for exit or running normal cleanup. This skips the waiter path that closes pipes and balances s.stz.Enable, and it also ignores kill failure, leaving a potentially untracked process lifecycle.

Fix in Cursor Fix in Web

o.mu.Unlock()
}
})
return o.drainedErr
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drain failures become permanently unrecoverable

Low Severity

Oncer.Drain uses sync.Once, so a transient error from o.ctrl.Drain is cached forever. Later calls return the same old error and never retry, even if the underlying condition recovers. That makes Drain effectively one-shot-fail and can prevent scaletozero from ever entering the drained state in that process.

Fix in Cursor Fix in Web

}
c.disabled = false
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drain skips underlying reset when already enabled

Medium Severity

DebouncedController.Drain only calls c.ctrl.Drain(ctx) when c.disabled is true. If c.disabled is false, shutdown marks only in-memory state as drained and never writes =0 via the underlying controller. This can leave the scale-to-zero control file unchanged and not reliably re-enabled after shutdown.

Fix in Cursor Fix in Web

Copy link
Contributor

@Sayan- Sayan- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high level changes lgtm and help forward fix some of the issues we've been seeing! this is a lot of changes though so please update pr description with how you tested these bits, esp edge cases.

defer c.mu.Unlock()

if !c.drained {
err := c.write(ctx, "=0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=0 is a global reset of the Unikraft STZ counter — it clobbers holds from all writers, not just kernel-images-api. right now I think the only other writers in the scope of a VM are neko and the startup script. I'm not sure we can guarantee setting =0 is entirely safe since it could cause surprises to other consumers in the VM. I wonder if instead this Drain should only be responsible for undoing the stz counter for this process instead (e.g. -N in https://unikraft.com/docs/features/scale-to-zero#application-support-for-scale-to-zero). That seems more well scoped to this responsibility. Comparatively from our control plane we could always process exec and nuke the value ourselves if we want.

defer shutdownCancel()
g, _ := errgroup.WithContext(shutdownCtx)

g.Go(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stz.Drain runs concurrently with srv.Shutdown here. this works but makes the state transitions harder to reason about — Drain races with in-flight requests' deferred Enable() calls. consider sequencing: drain HTTP servers first (letting all in-flight Enables run normally), then call stz.Drain after. at that point the controller is already at rest and Drain is just a safety net freeze. same outcome, easier to verify correctness.

@Sayan-
Copy link
Contributor

Sayan- commented Feb 11, 2026

I had opus do a broader inventory / analysis since the API server has a lot of moving pieces, some of which are inadvertently stateful. This isn't to say we shouldn't be taking this approach but if we are we need to be mindful of the current and future complexity it introduces

Deep Review: PR #150 - Scale-to-Zero and Shutdown Improvements

Summary

This PR hardens the kernel-images API for restart scenarios (driven by supervisorctl). The key changes are:

  1. Scale-to-zero Drain() - resets the STZ counter and freezes the controller so no further Disable/Enable calls can prevent the VM from scaling down
  2. Loopback middleware bypass - internal (127.0.0.1) requests no longer disable STZ
  3. killAllProcesses() - on shutdown, SIGKILLs all tracked spawned processes (except supervisorctl itself)
  4. shuttingDown gate - rejects new ProcessSpawn requests once shutdown begins

Your question is exactly right: this introduces an API restart cycle as a deliberate mechanism, and the API is deeply stateful. Below is an analysis of every stateful component and whether it survives, is cleanly torn down, or is silently lost.


Stateful Components Matrix

# Component State Type Initialized Cleaned Up in Shutdown? State Preserved Across Restart? Risk Level Notes
1 Process Map (procs) In-memory map + goroutines api.go:80 Yes (PR adds killAllProcesses) No - all processes killed, map lost Medium Processes are SIGKILL'd (no graceful shutdown). Waiter goroutines, retention timers, and stdout/stderr reader goroutines race the process exit but clean up once cmd.Wait() returns.
2 shuttingDown flag In-memory bool api.go:36 Set to true during shutdown No - resets on restart Low One-way latch, correct behavior.
3 Recording Manager (recorders map) In-memory map + FFmpeg processes main.go:93 Yes (StopAll calls Stop per recorder: SIGINT, then SIGTERM, then SIGKILL) No - map lost, but .mp4 files persist on disk Medium After restart, no recorder is registered, so the API has no knowledge of previously-recorded files. Download/delete of old recordings becomes impossible via API.
4 FFmpeg Finalization (singleflight, finalizeComplete) In-memory state + spawned ffmpeg remux process ffmpeg.go:282 Partially - StopAll triggers Stop() which blocks on finalization No High Stop() calls shutdownInPhases with context.Background(), allowing up to ~62 seconds for the signal cascade. Then finalizeRecording uses context.WithoutCancel to resist cancellation. The total time for StopAll is not bounded by the 10s shutdown context (see Finding #3 below).
5 Filesystem Watches (watches map) In-memory map + fsnotify watchers + goroutines fs.go:412 No - not cleaned up in Shutdown() No - lost on restart High Active fsnotify.Watcher handles and their forwarding goroutines are not torn down during shutdown. They continue until the process is killed by supervisorctl or the OS.
6 DevTools Upstream Manager Background goroutine (tail -f) + atomic URL + subscriber channels main.go:78-79 Yes (Stop() cancels context, which kills the tail process) No - rediscovered on restart Low Clean shutdown.
7 WebSocket Proxy Connections (CDP) Per-connection pair of goroutines + 2 WebSocket conns proxy.go:199 No - srvDevtools.Shutdown() stops the listener but does not close hijacked WebSocket connections No Medium coder/websocket hijacks the underlying TCP connection via http.Hijack. Once hijacked, http.Server.Shutdown() no longer tracks the connection. Existing CDP sessions survive until the process is killed.
8 Neko Auth Client Cached auth token main.go:86 No explicit cleanup No - token lost, re-authenticates on demand Low Token is lazily obtained. No concern.
9 Playwright Daemon (playwrightDaemonCmd) External node process + unix socket playwright.go:62-71 No - not killed in Shutdown() No - but ensurePlaywrightDaemon will reconnect or re-launch High The daemon is stored in s.playwrightDaemonCmd, which is separate from the procs map. killAllProcesses only iterates procs, so the daemon is untouched. After restart, ensurePlaywrightDaemon tries the unix socket first. If the old daemon is still alive, it's reused. If it died, a new one starts.
10 Scale-to-Zero Controller File-backed (/uk/libukp/scale_to_zero_disable) + in-memory debounce counters main.go:51 Yes (PR adds Drain()) File state: Yes. In-memory counters: No - start fresh at 0 Low This is the best-handled component. Drain() writes =0 to the file if STZ was currently disabled, and freezes the controller. On restart, a fresh DebouncedController starts with activeCount=0, matching the file state. If STZ was not disabled at drain time, no file write is needed since the file is already in the enabled state.
11 PTY Attach WebSocket Sessions Per-session: 4 goroutines + WebSocket conn + WaitGroup process.go (HandleProcessAttachWS) Indirectly - processes are killed, which closes the PTY and propagates errors to the goroutines No Medium SIGKILL on the underlying process closes the PTY master fd, which causes the PTY read goroutines to error out. The WebSocket read goroutine also terminates once the write side errors. The WaitGroup ensures all 4 goroutines finish before the handler returns.
12 Input/Playwright Mutexes sync.Mutex api.go:46-49 N/A (in-memory, no cleanup needed) No - fresh on restart Low If a mutex is held when the process dies, it doesn't matter.
13 Process Reader Goroutines (stdout/stderr) 2 goroutines per non-PTY process process.go:344-373 Indirectly - pipe closes when process is killed No Low SIGKILL closes the process's end of the pipe, causing the readers to get EOF and exit.
14 Process Retention Timers time.Sleep(10s) goroutines process.go:429-434 No - orphaned goroutines No Low These goroutines sleep 10s then delete from the map. On shutdown they're orphaned but harmless since the process is about to die.
15 Policy Stateless struct api.go:84 N/A N/A None Stateless.

Got it. Here's the updated Critical Findings section:


Critical Findings

1. Filesystem Watches Are Not Cleaned Up (Must Fix)

Shutdown() does not close filesystem watches. Each fsWatch holds an fsnotify.Watcher (which holds inotify/kqueue file descriptors), a forwarding goroutine, and a buffered event channel. On shutdown, these remain alive until the process is killed externally.

Recommendation: add watch cleanup to Shutdown():

func (s *ApiService) Shutdown(ctx context.Context) error {
    // Close all filesystem watches
    s.watchMu.Lock()
    for id, w := range s.watches {
        w.Close()
        delete(s.watches, id)
    }
    s.watchMu.Unlock()

    // ... existing kill + stop logic ...
}

2. Shutdown Timeout Does Not Guarantee Component Completion (Correctness Note)

The 10-second shutdownCtx does not actually bound the shutdown. g.Wait() blocks until all errgroup goroutines complete regardless of the context expiring, and several shutdown paths ignore the context entirely:

  • FFmpegRecorder.Stop() calls shutdownInPhases(context.Background(), ...), allowing up to ~62 seconds for the signal cascade
  • finalizeRecording spawns a new ffmpeg remux process with context.WithoutCancel(ctx), which runs unbounded

In practice, the orchestrating layer drains recordings before initiating restart, so StopAll should not encounter active recordings. But if any component does fail to complete within the timeout, there is no recovery path. The errgroup collects errors but the only action is a log line. There's no mechanism to detect that a component is stuck, skip it, or retry cleanup on restart.

If the process hangs long enough, supervisorctl will SIGKILL it, but there's no visibility into what was left incomplete. Consider:

  • Logging which shutdown steps completed vs timed out
  • Adding a hard select deadline in main() after g.Wait() that force-exits if shutdown exceeds a maximum duration
  • On restart, scanning for orphaned state (e.g., leftover .mp4.tmp finalization files, stale unix sockets)

3. CDP WebSocket Connections Are Not Closed During Shutdown (Recommended Fix)

srvDevtools.Shutdown() stops the listener and waits for non-hijacked connections to drain. But coder/websocket's Accept uses http.Hijack, so the HTTP server no longer tracks these connections. Existing CDP proxy sessions (two goroutines per connection, see proxyWebSocket) survive the server shutdown and continue until process death.

This means CDP clients get an unclean disconnect (TCP RST or similar) rather than a proper WebSocket close frame.

4. Recording State Is Lost Across Restarts (Recommended Improvement)

After restart, the FFmpegManager.recorders map is empty. Any recordings completed before the restart become invisible. The .mp4 files still exist on disk, but the API can't list, download, or delete them.


Yes, a few adjustments needed. Here's the updated versions:


Brittleness Assessment

Yes, this architecture is brittle. Three reasons:

First, there is no explicit registry of "things that need cleanup during shutdown." Each feature author adds state independently (procs, watches, recorders, playwrightDaemonCmd, upstreamMgr) but there's no compile-time or runtime guarantee that Shutdown() handles all of them. This PR fixes processes and recordings but misses watches. Some components are intentionally left alive across restarts (the playwright daemon), but that decision is implicit rather than documented, making it hard to distinguish intentional omissions from accidental ones.

Second, if someone adds a new stateful component in the future (a persistent cache, a background sync loop, a new external process), they need to: (a) know that the API can be restarted mid-flight, (b) decide whether to add cleanup logic to Shutdown(), and (c) handle the case where state from a previous incarnation might exist. There's no framework or convention that enforces this.

Third, killAllProcesses sends SIGKILL directly, denying user-spawned processes the opportunity to clean up. For processes doing file I/O or holding locks, this can cause data corruption. A SIGTERM-first approach with a short timeout before SIGKILL (similar to what the FFmpeg recorder already does) would be more robust.


Recommendations

  1. Add a shutdown registry pattern: instead of hand-maintaining Shutdown(), introduce a hook slice that components register into at construction time. Components that are intentionally long-lived across restarts (like the playwright daemon) simply don't register. Shutdown() iterates and calls each hook.

  2. Clean up filesystem watches in Shutdown() (see Finding [OSS 1/1] Port over the container code from the kernel repo + add OSS docs #1).

  3. Use SIGTERM before SIGKILL for user-spawned processes. Give them 2-3 seconds to exit gracefully.

  4. Add observability to the shutdown path: log which steps completed, which timed out, and what was left incomplete. On restart, consider scanning for orphaned state (leftover .mp4.tmp finalization files, stale unix sockets) so the system can self-heal rather than silently lose track of resources.

  5. Add a comment or doc block on ApiService listing all stateful components and their shutdown contracts (cleaned up, intentionally preserved, or not applicable), so future contributors know what to maintain.

  6. Test the full shutdown path. The existing tests cover STZ drain and middleware well, but there are no integration tests for the Shutdown() orchestration itself (concurrent kill + recording stop + watch cleanup).

@tnsardesai
Copy link
Contributor Author

I had opus do a broader inventory / analysis since the API server has a lot of moving pieces, some of which are inadvertently stateful.

imo the goal of the restart is to reset the state of the api and that Shutdown() should gracefully reset pieces external to the api but things internal to the api can be simply destroyed and reset (of course we don't want to leave orphaned go routines and have memory leaks). The restart should only be called in the case where the VM is not in use anymore like browser-pools release. Opened #151 to focus on skipping stz for local connections while I work on a broader solution for this

Copy link
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solid work — drain/freeze semantics are clean and well-tested. left a few inline comments, mostly around naming and the supervisorctl exception. will also need a rebase onto main since the loopback middleware changes from #151 are already merged and this branch has its own copy.

return c.write(ctx, "-")
}

func (c *unikraftCloudController) Drain(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called Shutdown instead of Drain to signal that it's terminal and only meant to be used during server shutdown? "drain" might be confused with just "set the count to 0" without the freeze semantics.

return oapi.ListRecorders200JSONResponse(infos), nil
}

// killAllProcesses sends SIGKILL to every tracked process that is still running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is new behavior — we'll need to reach out to heavy browser pool users and make sure they don't depend on process execs carrying over between session re-use.

if h.cmd.Process == nil {
continue
}
// supervisorctl handles the lifecycle of long running processes so we don't want to kill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we should make an exception for supervisorctl here. if someone sent a process exec for a supervisorctl command it shouldn't matter — we're going to hard reset supervisor services anyway, right? or are we doing our own supervisorctl hard-reset of things like chromium during server shutdown? that feels a little weird but i could live with it — just want to make sure the reasoning is clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants