fix: scale-to-zero and shutdown improvements by tnsardesai · Pull Request #150 · kernel/kernel-images

tnsardesai · 2026-02-11T18:18:37Z

middleware should skip disabling stz for internal requests
add logging for stz file writes
add support for stz.Drain() which is meant to be called upon shutdown
shutdown now kills all running processes and calls stz.Drain()

Note

Medium Risk
Touches shutdown and process lifecycle plus STZ state transitions; mistakes could leave processes running or keep scale-to-zero incorrectly disabled/enabled under load or during restarts.

Overview
Scale-to-zero control is hardened for shutdown/restart scenarios. The scaletozero.Controller now supports Drain(), which resets STZ state and freezes further Disable/Enable calls; this is implemented across UnikraftCloudController, DebouncedController, Oncer, and NoopController, with added logging for control-file writes and comprehensive new tests.

Request/middleware and server shutdown behavior changes. STZ middleware now ignores loopback clients (so internal calls don’t prevent scale-to-zero), main invokes stz.Drain() during graceful shutdown, and ApiService.Shutdown() now concurrently stops recordings and SIGKILLs tracked running processes (skipping supervisorctl) while rejecting new ProcessSpawn requests once shutdown begins.

^{Written by Cursor Bugbot for commit 03e7c29. This will update automatically on new commits. Configure here.}

- middleware should skip disabling stz for internal requests - add logging for stz file writes - add support for stz.Drain() which is meant to be called upon shutdown - shutdown now kills all running processes and calls stz.Drain()

server/lib/scaletozero/scaletozero.go

server/cmd/api/api/api.go

server/lib/scaletozero/scaletozero.go

server/cmd/api/api/api.go

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-11T21:30:00Z

server/cmd/api/api/process.go

+		// The process was already started; kill it immediately.
+		_ = cmd.Process.Kill()
+		return oapi.ProcessSpawn500JSONResponse{InternalErrorJSONResponse: oapi.InternalErrorJSONResponse{Message: "server is shutting down"}}, nil
+	}


Shutdown spawn path leaks process lifecycle

Medium Severity

When ProcessSpawn hits s.shuttingDown after cmd.Start(), it returns immediately after cmd.Process.Kill() without waiting for exit or running normal cleanup. This skips the waiter path that closes pipes and balances s.stz.Enable, and it also ignores kill failure, leaving a potentially untracked process lifecycle.

cursor · 2026-02-11T21:30:00Z

server/lib/scaletozero/scaletozero.go

+			o.mu.Unlock()
+		}
+	})
+	return o.drainedErr


Drain failures become permanently unrecoverable

Low Severity

Oncer.Drain uses sync.Once, so a transient error from o.ctrl.Drain is cached forever. Later calls return the same old error and never retry, even if the underlying condition recovers. That makes Drain effectively one-shot-fail and can prevent scaletozero from ever entering the drained state in that process.

cursor · 2026-02-11T21:30:00Z

server/lib/scaletozero/scaletozero.go

+		}
+		c.disabled = false
+	}
+


Drain skips underlying reset when already enabled

Medium Severity

DebouncedController.Drain only calls c.ctrl.Drain(ctx) when c.disabled is true. If c.disabled is false, shutdown marks only in-memory state as drained and never writes =0 via the underlying controller. This can leave the scale-to-zero control file unchanged and not reliably re-enabled after shutdown.

Sayan-

high level changes lgtm and help forward fix some of the issues we've been seeing! this is a lot of changes though so please update pr description with how you tested these bits, esp edge cases.

Sayan- · 2026-02-11T21:50:07Z

server/lib/scaletozero/scaletozero.go

+	defer c.mu.Unlock()
+
+	if !c.drained {
+		err := c.write(ctx, "=0")


=0 is a global reset of the Unikraft STZ counter — it clobbers holds from all writers, not just kernel-images-api. right now I think the only other writers in the scope of a VM are neko and the startup script. I'm not sure we can guarantee setting =0 is entirely safe since it could cause surprises to other consumers in the VM. I wonder if instead this Drain should only be responsible for undoing the stz counter for this process instead (e.g. -N in https://unikraft.com/docs/features/scale-to-zero#application-support-for-scale-to-zero). That seems more well scoped to this responsibility. Comparatively from our control plane we could always process exec and nuke the value ourselves if we want.

Sayan- · 2026-02-11T21:50:17Z

server/cmd/api/main.go

 	defer shutdownCancel()
 	g, _ := errgroup.WithContext(shutdownCtx)

+	g.Go(func() error {


stz.Drain runs concurrently with srv.Shutdown here. this works but makes the state transitions harder to reason about — Drain races with in-flight requests' deferred Enable() calls. consider sequencing: drain HTTP servers first (letting all in-flight Enables run normally), then call stz.Drain after. at that point the controller is already at rest and Drain is just a safety net freeze. same outcome, easier to verify correctness.

Sayan- · 2026-02-11T22:17:04Z

I had opus do a broader inventory / analysis since the API server has a lot of moving pieces, some of which are inadvertently stateful. This isn't to say we shouldn't be taking this approach but if we are we need to be mindful of the current and future complexity it introduces

Deep Review: PR #150 - Scale-to-Zero and Shutdown Improvements

Summary

This PR hardens the kernel-images API for restart scenarios (driven by supervisorctl). The key changes are:

Scale-to-zero Drain() - resets the STZ counter and freezes the controller so no further Disable/Enable calls can prevent the VM from scaling down
Loopback middleware bypass - internal (127.0.0.1) requests no longer disable STZ
killAllProcesses() - on shutdown, SIGKILLs all tracked spawned processes (except supervisorctl itself)
shuttingDown gate - rejects new ProcessSpawn requests once shutdown begins

Your question is exactly right: this introduces an API restart cycle as a deliberate mechanism, and the API is deeply stateful. Below is an analysis of every stateful component and whether it survives, is cleanly torn down, or is silently lost.

Stateful Components Matrix

#	Component	State Type	Initialized	Cleaned Up in Shutdown?	State Preserved Across Restart?	Risk Level	Notes
1	Process Map (`procs`)	In-memory map + goroutines	`api.go:80`	Yes (PR adds `killAllProcesses`)	No - all processes killed, map lost	Medium	Processes are SIGKILL'd (no graceful shutdown). Waiter goroutines, retention timers, and stdout/stderr reader goroutines race the process exit but clean up once `cmd.Wait()` returns.
2	`shuttingDown` flag	In-memory bool	`api.go:36`	Set to `true` during shutdown	No - resets on restart	Low	One-way latch, correct behavior.
3	Recording Manager (`recorders` map)	In-memory map + FFmpeg processes	`main.go:93`	Yes (`StopAll` calls `Stop` per recorder: SIGINT, then SIGTERM, then SIGKILL)	No - map lost, but .mp4 files persist on disk	Medium	After restart, no recorder is registered, so the API has no knowledge of previously-recorded files. Download/delete of old recordings becomes impossible via API.
4	FFmpeg Finalization (`singleflight`, `finalizeComplete`)	In-memory state + spawned ffmpeg remux process	`ffmpeg.go:282`	Partially - `StopAll` triggers `Stop()` which blocks on finalization	No	High	`Stop()` calls `shutdownInPhases` with `context.Background()`, allowing up to ~62 seconds for the signal cascade. Then `finalizeRecording` uses `context.WithoutCancel` to resist cancellation. The total time for `StopAll` is not bounded by the 10s shutdown context (see Finding #3 below).
5	Filesystem Watches (`watches` map)	In-memory map + fsnotify watchers + goroutines	`fs.go:412`	No - not cleaned up in `Shutdown()`	No - lost on restart	High	Active `fsnotify.Watcher` handles and their forwarding goroutines are not torn down during shutdown. They continue until the process is killed by supervisorctl or the OS.
6	DevTools Upstream Manager	Background goroutine (`tail -f`) + atomic URL + subscriber channels	`main.go:78-79`	Yes (`Stop()` cancels context, which kills the tail process)	No - rediscovered on restart	Low	Clean shutdown.
7	WebSocket Proxy Connections (CDP)	Per-connection pair of goroutines + 2 WebSocket conns	`proxy.go:199`	No - `srvDevtools.Shutdown()` stops the listener but does not close hijacked WebSocket connections	No	Medium	`coder/websocket` hijacks the underlying TCP connection via `http.Hijack`. Once hijacked, `http.Server.Shutdown()` no longer tracks the connection. Existing CDP sessions survive until the process is killed.
8	Neko Auth Client	Cached auth token	`main.go:86`	No explicit cleanup	No - token lost, re-authenticates on demand	Low	Token is lazily obtained. No concern.
9	Playwright Daemon (`playwrightDaemonCmd`)	External node process + unix socket	`playwright.go:62-71`	No - not killed in `Shutdown()`	No - but `ensurePlaywrightDaemon` will reconnect or re-launch	High	The daemon is stored in `s.playwrightDaemonCmd`, which is separate from the `procs` map. `killAllProcesses` only iterates `procs`, so the daemon is untouched. After restart, `ensurePlaywrightDaemon` tries the unix socket first. If the old daemon is still alive, it's reused. If it died, a new one starts.
10	Scale-to-Zero Controller	File-backed (`/uk/libukp/scale_to_zero_disable`) + in-memory debounce counters	`main.go:51`	Yes (PR adds `Drain()`)	File state: Yes. In-memory counters: No - start fresh at 0	Low	This is the best-handled component. `Drain()` writes `=0` to the file if STZ was currently disabled, and freezes the controller. On restart, a fresh `DebouncedController` starts with `activeCount=0`, matching the file state. If STZ was not disabled at drain time, no file write is needed since the file is already in the enabled state.
11	PTY Attach WebSocket Sessions	Per-session: 4 goroutines + WebSocket conn + WaitGroup	`process.go` (HandleProcessAttachWS)	Indirectly - processes are killed, which closes the PTY and propagates errors to the goroutines	No	Medium	SIGKILL on the underlying process closes the PTY master fd, which causes the PTY read goroutines to error out. The WebSocket read goroutine also terminates once the write side errors. The WaitGroup ensures all 4 goroutines finish before the handler returns.
12	Input/Playwright Mutexes	`sync.Mutex`	`api.go:46-49`	N/A (in-memory, no cleanup needed)	No - fresh on restart	Low	If a mutex is held when the process dies, it doesn't matter.
13	Process Reader Goroutines (stdout/stderr)	2 goroutines per non-PTY process	`process.go:344-373`	Indirectly - pipe closes when process is killed	No	Low	SIGKILL closes the process's end of the pipe, causing the readers to get EOF and exit.
14	Process Retention Timers	`time.Sleep(10s)` goroutines	`process.go:429-434`	No - orphaned goroutines	No	Low	These goroutines sleep 10s then delete from the map. On shutdown they're orphaned but harmless since the process is about to die.
15	Policy	Stateless struct	`api.go:84`	N/A	N/A	None	Stateless.

Got it. Here's the updated Critical Findings section:

Critical Findings

1. Filesystem Watches Are Not Cleaned Up (Must Fix)

Shutdown() does not close filesystem watches. Each fsWatch holds an fsnotify.Watcher (which holds inotify/kqueue file descriptors), a forwarding goroutine, and a buffered event channel. On shutdown, these remain alive until the process is killed externally.

Recommendation: add watch cleanup to Shutdown():

func (s *ApiService) Shutdown(ctx context.Context) error {
    // Close all filesystem watches
    s.watchMu.Lock()
    for id, w := range s.watches {
        w.Close()
        delete(s.watches, id)
    }
    s.watchMu.Unlock()

    // ... existing kill + stop logic ...
}

2. Shutdown Timeout Does Not Guarantee Component Completion (Correctness Note)

The 10-second shutdownCtx does not actually bound the shutdown. g.Wait() blocks until all errgroup goroutines complete regardless of the context expiring, and several shutdown paths ignore the context entirely:

FFmpegRecorder.Stop() calls shutdownInPhases(context.Background(), ...), allowing up to ~62 seconds for the signal cascade
finalizeRecording spawns a new ffmpeg remux process with context.WithoutCancel(ctx), which runs unbounded

In practice, the orchestrating layer drains recordings before initiating restart, so StopAll should not encounter active recordings. But if any component does fail to complete within the timeout, there is no recovery path. The errgroup collects errors but the only action is a log line. There's no mechanism to detect that a component is stuck, skip it, or retry cleanup on restart.

If the process hangs long enough, supervisorctl will SIGKILL it, but there's no visibility into what was left incomplete. Consider:

Logging which shutdown steps completed vs timed out
Adding a hard select deadline in main() after g.Wait() that force-exits if shutdown exceeds a maximum duration
On restart, scanning for orphaned state (e.g., leftover .mp4.tmp finalization files, stale unix sockets)

3. CDP WebSocket Connections Are Not Closed During Shutdown (Recommended Fix)

srvDevtools.Shutdown() stops the listener and waits for non-hijacked connections to drain. But coder/websocket's Accept uses http.Hijack, so the HTTP server no longer tracks these connections. Existing CDP proxy sessions (two goroutines per connection, see proxyWebSocket) survive the server shutdown and continue until process death.

This means CDP clients get an unclean disconnect (TCP RST or similar) rather than a proper WebSocket close frame.

4. Recording State Is Lost Across Restarts (Recommended Improvement)

After restart, the FFmpegManager.recorders map is empty. Any recordings completed before the restart become invisible. The .mp4 files still exist on disk, but the API can't list, download, or delete them.

Yes, a few adjustments needed. Here's the updated versions:

Brittleness Assessment

Yes, this architecture is brittle. Three reasons:

First, there is no explicit registry of "things that need cleanup during shutdown." Each feature author adds state independently (procs, watches, recorders, playwrightDaemonCmd, upstreamMgr) but there's no compile-time or runtime guarantee that Shutdown() handles all of them. This PR fixes processes and recordings but misses watches. Some components are intentionally left alive across restarts (the playwright daemon), but that decision is implicit rather than documented, making it hard to distinguish intentional omissions from accidental ones.

Second, if someone adds a new stateful component in the future (a persistent cache, a background sync loop, a new external process), they need to: (a) know that the API can be restarted mid-flight, (b) decide whether to add cleanup logic to Shutdown(), and (c) handle the case where state from a previous incarnation might exist. There's no framework or convention that enforces this.

Third, killAllProcesses sends SIGKILL directly, denying user-spawned processes the opportunity to clean up. For processes doing file I/O or holding locks, this can cause data corruption. A SIGTERM-first approach with a short timeout before SIGKILL (similar to what the FFmpeg recorder already does) would be more robust.

Recommendations

Add a shutdown registry pattern: instead of hand-maintaining Shutdown(), introduce a hook slice that components register into at construction time. Components that are intentionally long-lived across restarts (like the playwright daemon) simply don't register. Shutdown() iterates and calls each hook.
Clean up filesystem watches in Shutdown() (see Finding [OSS 1/1] Port over the container code from the kernel repo + add OSS docs #1).
Use SIGTERM before SIGKILL for user-spawned processes. Give them 2-3 seconds to exit gracefully.
Add observability to the shutdown path: log which steps completed, which timed out, and what was left incomplete. On restart, consider scanning for orphaned state (leftover .mp4.tmp finalization files, stale unix sockets) so the system can self-heal rather than silently lose track of resources.
Add a comment or doc block on ApiService listing all stateful components and their shutdown contracts (cleaned up, intentionally preserved, or not applicable), so future contributors know what to maintain.
Test the full shutdown path. The existing tests cover STZ drain and middleware well, but there are no integration tests for the Shutdown() orchestration itself (concurrent kill + recording stop + watch cleanup).

tnsardesai · 2026-02-11T23:04:00Z

I had opus do a broader inventory / analysis since the API server has a lot of moving pieces, some of which are inadvertently stateful.

imo the goal of the restart is to reset the state of the api and that Shutdown() should gracefully reset pieces external to the api but things internal to the api can be simply destroyed and reset (of course we don't want to leave orphaned go routines and have memory leaks). The restart should only be called in the case where the VM is not in use anymore like browser-pools release. Opened #151 to focus on skipping stz for local connections while I work on a broader solution for this

rgarcia

solid work — drain/freeze semantics are clean and well-tested. left a few inline comments, mostly around naming and the supervisorctl exception. will also need a rebase onto main since the loopback middleware changes from #151 are already merged and this branch has its own copy.

rgarcia · 2026-02-13T17:40:26Z

server/lib/scaletozero/scaletozero.go

 	return c.write(ctx, "-")
 }

+func (c *unikraftCloudController) Drain(ctx context.Context) error {


should this be called Shutdown instead of Drain to signal that it's terminal and only meant to be used during server shutdown? "drain" might be confused with just "set the count to 0" without the freeze semantics.

rgarcia · 2026-02-13T17:40:27Z

server/cmd/api/api/api.go

 	return oapi.ListRecorders200JSONResponse(infos), nil
 }

+// killAllProcesses sends SIGKILL to every tracked process that is still running.


this is new behavior — we'll need to reach out to heavy browser pool users and make sure they don't depend on process execs carrying over between session re-use.

rgarcia · 2026-02-13T17:40:29Z

server/cmd/api/api/api.go

+		if h.cmd.Process == nil {
+			continue
+		}
+		// supervisorctl handles the lifecycle of long running processes so we don't want to kill


not sure we should make an exception for supervisorctl here. if someone sent a process exec for a supervisorctl command it shouldn't matter — we're going to hard reset supervisor services anyway, right? or are we doing our own supervisorctl hard-reset of things like chromium during server shutdown? that feels a little weird but i could live with it — just want to make sure the reasoning is clear.

fix: scale-to-zero and shutdown improvements

fe2e1e7

- middleware should skip disabling stz for internal requests - add logging for stz file writes - add support for stz.Drain() which is meant to be called upon shutdown - shutdown now kills all running processes and calls stz.Drain()

tnsardesai marked this pull request as ready for review February 11, 2026 20:46

tnsardesai requested review from Sayan- and rgarcia February 11, 2026 20:46

cursor bot reviewed Feb 11, 2026

View reviewed changes

server/lib/scaletozero/scaletozero.go Outdated Show resolved Hide resolved

server/cmd/api/api/api.go Show resolved Hide resolved

server/lib/scaletozero/scaletozero.go Show resolved Hide resolved

server/cmd/api/api/api.go Outdated Show resolved Hide resolved

address bugbot comments

03e7c29

cursor bot reviewed Feb 11, 2026

View reviewed changes

Sayan- reviewed Feb 11, 2026

View reviewed changes

rgarcia approved these changes Feb 13, 2026

View reviewed changes

Conversation

tnsardesai commented Feb 11, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

Shutdown spawn path leaks process lifecycle

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

Drain failures become permanently unrecoverable

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

Drain skips underlying reset when already enabled

Uh oh!

Sayan- left a comment

Choose a reason for hiding this comment

Uh oh!

Sayan- Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Sayan- Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Sayan- commented Feb 11, 2026

Deep Review: PR #150 - Scale-to-Zero and Shutdown Improvements

Summary

Stateful Components Matrix

Critical Findings

1. Filesystem Watches Are Not Cleaned Up (Must Fix)

2. Shutdown Timeout Does Not Guarantee Component Completion (Correctness Note)

3. CDP WebSocket Connections Are Not Closed During Shutdown (Recommended Fix)

4. Recording State Is Lost Across Restarts (Recommended Improvement)

Brittleness Assessment

Recommendations

Uh oh!

tnsardesai commented Feb 11, 2026

Uh oh!

rgarcia left a comment

Choose a reason for hiding this comment

Uh oh!

rgarcia Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

rgarcia Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

rgarcia Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tnsardesai commented Feb 11, 2026 •

edited by cursor bot

Loading