Observability — OpenTelemetry Tracing v1 (initiative tracking)

# Observability — OpenTelemetry Tracing v1

> **Type:** initiative tracking issue. **Total effort:** ~15–20 engineer-days across 7 phases. **Risk:** medium overall (Phase 3 is highest; Phase 0 / 4 are XS). **Depends on:** nothing (Phase 0 starts fresh). Canonical implementation guide: the `FORGE_OTEL_TRACING.md` design doc.

## Goal

Forge emits OpenTelemetry-compatible **traces** automatically so any OTLP backend — including the Initializ Platform — can ingest them. End-to-end multi-agent tracing across the A2A boundary is in scope for v1.

## Ruled decisions (do not re-litigate)

1. **Enablement gate**: tracing is active **iff `enabled == true` AND a non-empty OTLP endpoint resolves.** If `enabled == true` but no endpoint resolves, log a warning and install the **no-op tracer** — never crash the agent over telemetry config. If `enabled == false`, no-op regardless of endpoint.
2. **forge-core owns the otel dependency**, behind a no-op `TracerProvider` seam. Real OTLP provider lives in a new `forge-core/observability` subpackage and is injected from `forge-cli` (same pattern as audit: defined in core, wired in cli).
3. **Content capture (prompts/responses) is OFF by default** — enterprise opt-in only. The config toggle exists in v1 but the GenAI content events stay gated and default-inert.
4. **End-to-end multi-agent A2A propagation is in v1** (inbound `traceparent` extraction + outbound injection).

## Invariants this work must not violate

- Single static binary; **OTel Go SDK is pure Go (no cgo / no OS deps)** — it vendors cleanly. No sidecar collector required (OTLP can target one, but it's optional).
- Outbound-only. The OTLP exporter's traffic is outbound and **must pass the egress enforcer** — the endpoint host is registered in the build-time allowlist and NetworkPolicy, and the exporter uses the egress-enforced transport.
- Audit NDJSON (`forge-core/runtime/audit.go`) is **unchanged in behavior** — traces are purely additive. The only audit edit is adding two optional cross-link fields (`trace_id` / `span_id`), and those are `omitempty`.
- **Capability-enumeration guardrails are respected:** spans carry the generic tool name (`cli_execute`), **never** the underlying binary name.
- forge-core stays a pure library: it may import otel, but the call sites depend on the runtime seam, not on cli/build packages.

## Out of scope for v1 (do not build)

- OTel **logs** signal.
- OTel **metrics** signal — **deferred to v1.1.** Span attributes already carry token counts and latency, so backends can derive aggregates from spans for now. v1.1 adds the two GenAI histograms (`gen_ai.client.token.usage` with `gen_ai.token.type`, and `gen_ai.client.operation.duration`) for native rollups.
- Trace visualization inside `forge ui`.
- Platform-side ingestion endpoint.
- Replacing or restructuring audit NDJSON.
- Sampling strategies beyond the standard OTel samplers.

## Phase plan (each is its own issue + PR)

| Phase | Issue | Effort | Risk | Title |
|---|---|---|---|---|
| **Phase 0** | #101 | XS (~0.5 day) | low | Tracer seam in forge-core (no-op default) |
| **Phase 1** | #102 | S (2–3 days) | medium | OTLP provider subpackage (`forge-core/observability`) |
| **Phase 2** | #103 | S (2–3 days) | medium | Config resolution + CLI flags + runner wiring |
| **Phase 3** | #104 | M (5–7 days) | medium-high | Span instrumentation (session / llm / tool / scheduled) |
| **Phase 4** | #105 | XS–S (1–2 days) | low | Audit ↔ trace cross-linking (`trace_id` / `span_id`) |
| **Phase 5** | #106 | S (2–3 days) | medium | End-to-end A2A propagation (multi-agent traces) |
| **Phase 6** | #107 | S (2–3 days) | medium | Build-time manifest + egress wiring |

### Dependencies

```
Phase 0  →  Phase 1  →  Phase 2  →  Phase 3  →  Phase 4
                                       ↓
                                    Phase 5
                                       ↓
                          (Phase 2 also  →  Phase 6)
```

Phases 4 / 5 / 6 can ship in parallel after Phase 3 and Phase 2 land respectively.

## Final acceptance (run after all phases)

1. `go work sync && go build ./... && go test ./...` clean.
2. Tracing off by default — audit NDJSON byte-identical to pre-change for an identical task.
3. `enabled:true` + endpoint → session/inference/tool spans land in Jaeger, correctly nested; the inference span is named `chat {model}` (CLIENT) with `gen_ai.provider.name` + `gen_ai.usage.input_tokens`/`output_tokens`; no binary names / no content.
4. `enabled:true` + no endpoint → WARN + agent healthy on noop.
5. Two-agent A2A chain → single connected trace.
6. Audit lines carry matching `trace_id` / `span_id` when traced, absent when not.
7. `forge build` with `tracing.enabled` (no committed endpoint) → dynamic `$OTEL_EXPORTER_OTLP_ENDPOINT` entry in `egress_allowlist.json` (source `otel`) + Deployment env reference (`configMapKeyRef`, `optional:true`) + ConfigMap stub; at runtime the enforcer allowlists the *host* extracted from the injected endpoint, matching the exporter destination with no drift.
8. `capture_content:false` (default) → zero prompt/response content on any span.

## Cross-references with other initiatives

- **FWS-3 (#87) — token usage / duration emission**: shares the LLM call site with Phase 3. FWS-3 captures usage / duration for the audit path; Phase 3 emits the same data as `gen_ai.*` span attributes. Single capture, multiple emission targets, independent failure domains. See FWS-3's "Relationship with OTel tracing" section for the full rationale.
- **FWS-2 (#86) — workflow correlation**: orthogonal but complementary. Workflow correlation IDs travel on context.Context the same way trace context will, and the audit emitter (post-FWS-3 rebase) routes through `EmitFromContext` so both correlation kinds auto-tag every event. Phase 5's `traceparent` extraction sits at the same dispatch boundary as FWS-2's `X-Workflow-*` extraction.
- **FWS-7 (#95) — audit export sink**: parallel observability path. OTel traces serve customer observability stacks (Datadog/Jaeger/SigNoz); audit NDJSON via the UDS/HTTP sink serves the Platform's cost aggregator and SIEM. Both pipelines share signal sources in Forge but emit independently. **Do not couple** the OTel exporter to the audit sink — when something interesting happens in Forge, instrumentation should emit to OTel AND to audit at the same call site, not tap one from the other.

## v1.1 roadmap (not in this initiative)

OTel **metrics** signal — the two GenAI histograms (`gen_ai.client.token.usage` with `gen_ai.token.type`, and `gen_ai.client.operation.duration`) for native rollups. Backends can derive rollups from v1's span attributes today; v1.1 makes the cost-aggregation path more efficient. File when v1 ships.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability — OpenTelemetry Tracing v1 (initiative tracking) #108

Observability — OpenTelemetry Tracing v1

Goal

Ruled decisions (do not re-litigate)

Invariants this work must not violate

Out of scope for v1 (do not build)

Phase plan (each is its own issue + PR)

Dependencies

Final acceptance (run after all phases)

Cross-references with other initiatives

v1.1 roadmap (not in this initiative)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	Issue	Effort	Risk	Title
Phase 0	#101	XS (~0.5 day)	low	Tracer seam in forge-core (no-op default)
Phase 1	#102	S (2–3 days)	medium	OTLP provider subpackage (`forge-core/observability`)
Phase 2	#103	S (2–3 days)	medium	Config resolution + CLI flags + runner wiring
Phase 3	#104	M (5–7 days)	medium-high	Span instrumentation (session / llm / tool / scheduled)
Phase 4	#105	XS–S (1–2 days)	low	Audit ↔ trace cross-linking (`trace_id` / `span_id`)
Phase 5	#106	S (2–3 days)	medium	End-to-end A2A propagation (multi-agent traces)
Phase 6	#107	S (2–3 days)	medium	Build-time manifest + egress wiring

Observability — OpenTelemetry Tracing v1 (initiative tracking) #108

Description

Observability — OpenTelemetry Tracing v1

Goal

Ruled decisions (do not re-litigate)

Invariants this work must not violate

Out of scope for v1 (do not build)

Phase plan (each is its own issue + PR)

Dependencies

Final acceptance (run after all phases)

Cross-references with other initiatives

v1.1 roadmap (not in this initiative)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions