Skip to content

Observability — OpenTelemetry Tracing v1 (initiative tracking) #108

@initializ-mk

Description

@initializ-mk

Observability — OpenTelemetry Tracing v1

Type: initiative tracking issue. Total effort: ~15–20 engineer-days across 7 phases. Risk: medium overall (Phase 3 is highest; Phase 0 / 4 are XS). Depends on: nothing (Phase 0 starts fresh). Canonical implementation guide: the FORGE_OTEL_TRACING.md design doc.

Goal

Forge emits OpenTelemetry-compatible traces automatically so any OTLP backend — including the Initializ Platform — can ingest them. End-to-end multi-agent tracing across the A2A boundary is in scope for v1.

Ruled decisions (do not re-litigate)

  1. Enablement gate: tracing is active iff enabled == true AND a non-empty OTLP endpoint resolves. If enabled == true but no endpoint resolves, log a warning and install the no-op tracer — never crash the agent over telemetry config. If enabled == false, no-op regardless of endpoint.
  2. forge-core owns the otel dependency, behind a no-op TracerProvider seam. Real OTLP provider lives in a new forge-core/observability subpackage and is injected from forge-cli (same pattern as audit: defined in core, wired in cli).
  3. Content capture (prompts/responses) is OFF by default — enterprise opt-in only. The config toggle exists in v1 but the GenAI content events stay gated and default-inert.
  4. End-to-end multi-agent A2A propagation is in v1 (inbound traceparent extraction + outbound injection).

Invariants this work must not violate

  • Single static binary; OTel Go SDK is pure Go (no cgo / no OS deps) — it vendors cleanly. No sidecar collector required (OTLP can target one, but it's optional).
  • Outbound-only. The OTLP exporter's traffic is outbound and must pass the egress enforcer — the endpoint host is registered in the build-time allowlist and NetworkPolicy, and the exporter uses the egress-enforced transport.
  • Audit NDJSON (forge-core/runtime/audit.go) is unchanged in behavior — traces are purely additive. The only audit edit is adding two optional cross-link fields (trace_id / span_id), and those are omitempty.
  • Capability-enumeration guardrails are respected: spans carry the generic tool name (cli_execute), never the underlying binary name.
  • forge-core stays a pure library: it may import otel, but the call sites depend on the runtime seam, not on cli/build packages.

Out of scope for v1 (do not build)

  • OTel logs signal.
  • OTel metrics signal — deferred to v1.1. Span attributes already carry token counts and latency, so backends can derive aggregates from spans for now. v1.1 adds the two GenAI histograms (gen_ai.client.token.usage with gen_ai.token.type, and gen_ai.client.operation.duration) for native rollups.
  • Trace visualization inside forge ui.
  • Platform-side ingestion endpoint.
  • Replacing or restructuring audit NDJSON.
  • Sampling strategies beyond the standard OTel samplers.

Phase plan (each is its own issue + PR)

Phase Issue Effort Risk Title
Phase 0 #101 XS (~0.5 day) low Tracer seam in forge-core (no-op default)
Phase 1 #102 S (2–3 days) medium OTLP provider subpackage (forge-core/observability)
Phase 2 #103 S (2–3 days) medium Config resolution + CLI flags + runner wiring
Phase 3 #104 M (5–7 days) medium-high Span instrumentation (session / llm / tool / scheduled)
Phase 4 #105 XS–S (1–2 days) low Audit ↔ trace cross-linking (trace_id / span_id)
Phase 5 #106 S (2–3 days) medium End-to-end A2A propagation (multi-agent traces)
Phase 6 #107 S (2–3 days) medium Build-time manifest + egress wiring

Dependencies

Phase 0  →  Phase 1  →  Phase 2  →  Phase 3  →  Phase 4
                                       ↓
                                    Phase 5
                                       ↓
                          (Phase 2 also  →  Phase 6)

Phases 4 / 5 / 6 can ship in parallel after Phase 3 and Phase 2 land respectively.

Final acceptance (run after all phases)

  1. go work sync && go build ./... && go test ./... clean.
  2. Tracing off by default — audit NDJSON byte-identical to pre-change for an identical task.
  3. enabled:true + endpoint → session/inference/tool spans land in Jaeger, correctly nested; the inference span is named chat {model} (CLIENT) with gen_ai.provider.name + gen_ai.usage.input_tokens/output_tokens; no binary names / no content.
  4. enabled:true + no endpoint → WARN + agent healthy on noop.
  5. Two-agent A2A chain → single connected trace.
  6. Audit lines carry matching trace_id / span_id when traced, absent when not.
  7. forge build with tracing.enabled (no committed endpoint) → dynamic $OTEL_EXPORTER_OTLP_ENDPOINT entry in egress_allowlist.json (source otel) + Deployment env reference (configMapKeyRef, optional:true) + ConfigMap stub; at runtime the enforcer allowlists the host extracted from the injected endpoint, matching the exporter destination with no drift.
  8. capture_content:false (default) → zero prompt/response content on any span.

Cross-references with other initiatives

  • FWS-3 (FWS-3 — Token usage and execution duration emission (per LLM call + per invocation) #87) — token usage / duration emission: shares the LLM call site with Phase 3. FWS-3 captures usage / duration for the audit path; Phase 3 emits the same data as gen_ai.* span attributes. Single capture, multiple emission targets, independent failure domains. See FWS-3's "Relationship with OTel tracing" section for the full rationale.
  • FWS-2 (FWS-2 — Workflow correlation ID threading (X-Initializ-* headers → audit events) #86) — workflow correlation: orthogonal but complementary. Workflow correlation IDs travel on context.Context the same way trace context will, and the audit emitter (post-FWS-3 rebase) routes through EmitFromContext so both correlation kinds auto-tag every event. Phase 5's traceparent extraction sits at the same dispatch boundary as FWS-2's X-Workflow-* extraction.
  • FWS-7 (FWS-7 — Audit event export capability (Unix Domain Socket sink + HTTP fallback) #95) — audit export sink: parallel observability path. OTel traces serve customer observability stacks (Datadog/Jaeger/SigNoz); audit NDJSON via the UDS/HTTP sink serves the Platform's cost aggregator and SIEM. Both pipelines share signal sources in Forge but emit independently. Do not couple the OTel exporter to the audit sink — when something interesting happens in Forge, instrumentation should emit to OTel AND to audit at the same call site, not tap one from the other.

v1.1 roadmap (not in this initiative)

OTel metrics signal — the two GenAI histograms (gen_ai.client.token.usage with gen_ai.token.type, and gen_ai.client.operation.duration) for native rollups. Backends can derive rollups from v1's span attributes today; v1.1 makes the cost-aggregation path more efficient. File when v1 ships.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions