Skip to content

Commit 9cb3dcb

Browse files
authored
feat(supervisor): compute workload manager (#3114)
Adds the `ComputeWorkloadManager` for routing task execution through the compute gateway, including full checkpoint/restore support, OTel trace integration, and template pre-warming. ## Changes **Compute workload manager** (`apps/supervisor/src/workloadManager/compute.ts`) - Routes instance create, snapshot, delete, and restore through the compute gateway API - Wide event logging on create with full timing and context - Configurable gateway timeout, auth token, image digest stripping **Compute snapshot service** (`apps/supervisor/src/services/computeSnapshotService.ts`) - Timer wheel for delayed snapshot dispatch (avoids wasted work on short-lived waitpoints) - Configurable dispatch concurrency limit (`COMPUTE_SNAPSHOT_DISPATCH_LIMIT`) - Snapshot-complete callback handler with suspend completion reporting - Trace context management and OTel span emission for snapshot operations **OTel trace service** (`apps/supervisor/src/services/otlpTraceService.ts`) - Fire-and-forget OTLP span emission for compute operations (provision, restore, snapshot) - BigInt nanosecond conversion preserving sub-ms precision for span ordering **Template creation** (`apps/webapp/app/v3/services/computeTemplateCreation.server.ts`) - Three-mode rollout: required (MICROVM projects), shadow (feature flag / percentage), skip - Integrated into deploy finalize flow **Shared compute package** (`internal-packages/compute/`) - Gateway client with namespace-based API (instances, templates, snapshots) - Zod schemas for all gateway request/response types **Database** - `COMPUTE` variant added to `TaskRunCheckpointType` enum - `WorkloadType` enum and column on `WorkerInstanceGroup` - `hasComputeAccess` feature flag **Env / config** - Compute gateway URL, auth token, timeout - Snapshot enable flag, delay, dispatch limit - Dedicated OTLP endpoint for compute spans (`COMPUTE_TRACE_OTLP_ENDPOINT`)
1 parent 2366b21 commit 9cb3dcb

File tree

44 files changed

+2500
-294
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+2500
-294
lines changed

.changeset/fix-local-build-load.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"trigger.dev": patch
3+
---
4+
5+
Fix `--load` flag being silently ignored on local/self-hosted builds.
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
---
2+
name: span-timeline-events
3+
description: Use when adding, modifying, or debugging OTel span timeline events in the trace view. Covers event structure, ClickHouse storage constraints, rendering in SpanTimeline component, admin visibility, and the step-by-step process for adding new events.
4+
allowed-tools: Read, Write, Edit, Glob, Grep, Bash
5+
---
6+
7+
# Span Timeline Events
8+
9+
The trace view's right panel shows a timeline of events for the selected span. These are OTel span events rendered by `app/utils/timelineSpanEvents.ts` and the `SpanTimeline` component.
10+
11+
## How They Work
12+
13+
1. **Span events** in OTel are attached to a parent span. In ClickHouse, they're stored as separate rows with `kind: "SPAN_EVENT"` sharing the parent span's `span_id`. The `#mergeRecordsIntoSpanDetail` method reassembles them into the span's `events` array at query time.
14+
2. The timeline only renders events whose `name` starts with `trigger.dev/` - all others are silently filtered out.
15+
3. The **display name** comes from `properties.event` (not the span event name), mapped through `getFriendlyNameForEvent()`.
16+
4. Events are shown on the **span they belong to** - events on one span don't appear in another span's timeline.
17+
18+
## ClickHouse Storage Constraint
19+
20+
When events are written to ClickHouse, `spanEventsToTaskEventV1Input()` filters out events whose `start_time` is not greater than the parent span's `startTime`. Events at or before the span start are silently dropped. This means span events must have timestamps strictly after the span's own `startTimeUnixNano`.
21+
22+
## Timeline Rendering (SpanTimeline component)
23+
24+
The `SpanTimeline` component in `app/components/run/RunTimeline.tsx` renders:
25+
26+
1. **Events** (thin 1px line with hollow dots) - all events from `createTimelineSpanEventsFromSpanEvents()`
27+
2. **"Started"** marker (thick cap) - at the span's `startTime`
28+
3. **Duration bar** (thick 7px line) - from "Started" to "Finished"
29+
4. **"Finished"** marker (thick cap) - at `startTime + duration`
30+
31+
The thin line before "Started" only appears when there are events with timestamps between the span start and the first child span. For the Attempt span this works well (Dequeued -> Pod scheduled -> Launched -> etc. all happen before execution starts). Events all get `lineVariant: "light"` (thin) while the execution bar gets `variant: "normal"` (thick).
32+
33+
## Trace View Sort Order
34+
35+
Sibling spans (same parent) are sorted by `start_time ASC` from the ClickHouse query. The `createTreeFromFlatItems` function preserves this order. Event timestamps don't affect sort order - only the span's own `start_time`.
36+
37+
## Event Structure
38+
39+
```typescript
40+
// OTel span event format
41+
{
42+
name: "trigger.dev/run", // Must start with "trigger.dev/" to render
43+
timeUnixNano: "1711200000000000000",
44+
attributes: [
45+
{ key: "event", value: { stringValue: "dequeue" } }, // The actual event type
46+
{ key: "duration", value: { intValue: 150 } }, // Optional: duration in ms
47+
]
48+
}
49+
```
50+
51+
## Admin-Only Events
52+
53+
`getAdminOnlyForEvent()` controls visibility. Events default to **admin-only** (`true`).
54+
55+
| Event | Admin-only | Friendly name |
56+
|-------|-----------|---------------|
57+
| `dequeue` | No | Dequeued |
58+
| `fork` | No | Launched |
59+
| `import` | No (if no fork event) | Importing task file |
60+
| `create_attempt` | Yes | Attempt created |
61+
| `lazy_payload` | Yes | Lazy attempt initialized |
62+
| `pod_scheduled` | Yes | Pod scheduled |
63+
| (default) | Yes | (raw event name) |
64+
65+
## Adding New Timeline Events
66+
67+
1. Add OTLP span event with `name: "trigger.dev/<scope>"` and `properties.event: "<type>"`
68+
2. Event timestamp must be strictly after the parent span's `startTimeUnixNano` (ClickHouse drops earlier events)
69+
3. Add friendly name in `getFriendlyNameForEvent()` in `app/utils/timelineSpanEvents.ts`
70+
4. Set admin visibility in `getAdminOnlyForEvent()`
71+
5. Optionally add help text in `getHelpTextForEvent()`
72+
73+
## Key Files
74+
75+
- `app/utils/timelineSpanEvents.ts` - filtering, naming, admin logic
76+
- `app/components/run/RunTimeline.tsx` - `SpanTimeline` component (thin line + thick bar rendering)
77+
- `app/presenters/v3/SpanPresenter.server.ts` - loads span data including events
78+
- `app/v3/eventRepository/clickhouseEventRepository.server.ts` - `spanEventsToTaskEventV1Input()` (storage filter), `#mergeRecordsIntoSpanDetail` (reassembly)
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
area: webapp
3+
type: feature
4+
---
5+
6+
Pre-warm compute templates on deploy for orgs with compute access. Required for projects using a compute region, background-only for others.

CLAUDE.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,10 @@ pnpm run changeset:add
8080

8181
When modifying only server components (`apps/webapp/`, `apps/supervisor/`, etc.) with no package changes, add a `.server-changes/` file instead. See `.server-changes/README.md` for format and documentation.
8282

83+
## Dependency Pinning
84+
85+
Zod is pinned to a single version across the entire monorepo (currently `3.25.76`). When adding zod to a new or existing package, use the **exact same version** as the rest of the repo - never a different version or a range. Mismatched zod versions cause runtime type incompatibilities (e.g., schemas from one package can't be used as body validators in another).
86+
8387
## Architecture Overview
8488

8589
### Request Flow

apps/supervisor/package.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,11 @@
1414
},
1515
"dependencies": {
1616
"@aws-sdk/client-ecr": "^3.839.0",
17+
"@internal/compute": "workspace:*",
1718
"@kubernetes/client-node": "^1.0.0",
1819
"@trigger.dev/core": "workspace:*",
1920
"dockerode": "^4.0.6",
21+
"p-limit": "^6.2.0",
2022
"prom-client": "^15.1.0",
2123
"socket.io": "4.7.4",
2224
"std-env": "^3.8.0",

0 commit comments

Comments
 (0)