Skip to content

Commit d541cae

Browse files
authored
feat(supervisor): wide events + warm-start trace propagation (#3669)
Adds wide-event observability for the supervisor: one flat-keyed JSON line per dequeue iteration, workload-server route, and run socket lifecycle event. Events carry `trace_id` sourced from the inbound W3C traceparent plus `meta.run_id` and related identifiers, so they join across services by run. The outbound warm-start POST also forwards the inbound traceparent so the upstream receiver continues the same trace instead of minting a new one. Off by default behind `TRIGGER_WIDE_EVENTS_ENABLED`. With the flag off, no events are emitted, no ALS state is allocated, and the outbound warm-start request is unchanged — every call site was audited to confirm the off path is byte-identical to current behavior. Dequeue-path phase timings recorded under `phase.<name>.duration_ms`: `restore`, `warm_start`, `workload_create`. A `path_taken` extra distinguishes `restore` / `warm_start` / `cold_create` / `skipped_no_image`. Refs TRI-9480.
1 parent 4f8cf4c commit d541cae

25 files changed

Lines changed: 2036 additions & 375 deletions

.server-changes/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,14 @@ Speed up batch queue processing by removing stalls and fixing retry race
3838

3939
The body text (below the frontmatter) is a one-line description of the change. Keep it concise — it will appear in release notes.
4040

41+
### Writing guidance
42+
43+
These entries are public-facing - they ship verbatim in user-visible release notes. A few rules to keep them clean:
44+
45+
- **One sentence is usually enough.** The body is the bullet in the changelog. If you need a paragraph, you're probably describing the implementation rather than the change.
46+
- **Describe behavior, not implementation.** Skip internal scopes, middleware names, library specifics, framework internals. Users care about what's different for them, not how it's wired.
47+
- **Never name internal tools or infra.** Observability stacks, internal services, infra components, monitoring backends, CI surfaces, AWS specifics - none of these belong in user-facing notes.
48+
4149
## Lifecycle
4250

4351
1. Engineer adds a `.server-changes/` file in their PR
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
area: supervisor
3+
type: feature
4+
---
5+
6+
Optional structured event logging for the supervisor - one canonical event per request and per run lifecycle step, with trace context propagated to downstream services so distributed traces stay continuous. Off by default behind `TRIGGER_WIDE_EVENTS_ENABLED`.

apps/supervisor/src/env.ts

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -256,6 +256,15 @@ const Env = z
256256
// Debug
257257
DEBUG: BoolEnv.default(false),
258258
SEND_RUN_DEBUG_LOGS: BoolEnv.default(false),
259+
260+
// Wide-event observability - off by default. Emits one flat-keyed JSON
261+
// line per natural unit of work (dequeue iteration, HTTP request, socket
262+
// lifecycle). High-QPS hotpath, so the kill switch must be honoured.
263+
TRIGGER_WIDE_EVENTS_ENABLED: BoolEnv.default(false),
264+
// When true, also emit wide events for high-frequency HTTP routes
265+
// (heartbeat, snapshots-since, logs/debug). Off in prod to keep event
266+
// volume manageable; on in test environments for full-fidelity debugging.
267+
TRIGGER_WIDE_EVENTS_NOISY_ROUTES: BoolEnv.default(false),
259268
})
260269
.superRefine((data, ctx) => {
261270
if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.TRIGGER_METADATA_URL) {

0 commit comments

Comments
 (0)