Skip to content

OTel Phase 3 — Span instrumentation (session / llm / tool / scheduled) #104

@initializ-mk

Description

@initializ-mk

Part of the Observability — OpenTelemetry Tracing v1 initiative (master tracking: #108). Effort: M (5–7 engineer-days). Risk: medium-high (many call sites, semconv evolution, redaction guarantees, content gating). Depends on: Phase 2 (#103).

Goal

Emit a correct nested span tree from real call sites. Redaction on by default. GenAI conventions spec-correct so Datadog / New Relic / SigNoz auto-map.

Files

File Change
forge-core runtime executor entry (agent loop Execute) Start root forge.session span; end on return
forge-core/runtime LLMExecutor (LLM round-trip) Child GenAI inference span (convention-named) via the semconv helper
forge-core/runtime/genai_semconv.go New. Single helper that owns ALL gen_ai.* emission + the version switch
forge-core/tools registry Execute Child forge.tool_exec span (generic tool name only)
forge-core/scheduler dispatch Root forge.scheduled_task span per tick
forge-core egress decision + guardrail check sites span.AddEvent on the active span (not child spans)

Span tree

forge.session                 (root; one per task / Execute)
 ├─ chat {model}              (GenAI inference span, CLIENT kind — per round-trip)
 ├─ forge.tool_exec           (per tool call)
 │    └─ event: egress.allowed / egress.blocked  (domain attr)
 │    └─ event: guardrail.block / guardrail.redact (rule attr)
 └─ ...
forge.scheduled_task          (root; scheduler dispatch — no inbound ctx)

forge.session attributes

agent.id, agent.version, forge.task_id, forge.correlation_id, forge.channel (if any), forge.session.state (set at end). Use trace.SpanFromContext / Tracer().Start(ctx, ...) so children inherit via ctx — thread the returned ctx through the loop, do not start orphan spans.

GenAI inference span — spec-correct gen_ai.* (Datadog / New Relic / SigNoz auto-map)

All gen_ai.* emission goes through genai_semconv.go — one place. The conventions are still experimental and rename often (tokens went prompt/completioninput/output; gen_ai.systemgen_ai.provider.name). Centralizing means the next rename is a one-file change, not a call-site hunt. The helper reads OTEL_SEMCONV_STABILITY_OPT_IN: when it contains gen_ai_latest_experimental, emit the newest names; otherwise dual-emit (new + legacy) for older backends. This is the spec's own transition mechanism — follow it rather than picking one naming and freezing.

Span name: chat {model} (e.g. chat claude-sonnet-4-6); span kind CLIENT. This convention is what GenAI-aware backends key on for auto-recognition — do NOT name it forge.llm_call.

Tier 1 — always-on metadata (non-sensitive; safe with CaptureContent=false)

  • gen_ai.operation.name = chat
  • gen_ai.provider.name (anthropic / openai / gcp.vertex_ai / ollama / …); dual-emit gen_ai.system unless opted into latest-only
  • gen_ai.request.model, gen_ai.response.model, gen_ai.response.id, gen_ai.response.finish_reasons
  • gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.max_tokens (only when set)
  • gen_ai.usage.input_tokens, gen_ai.usage.output_tokens (legacy prompt_tokens/completion_tokens only under dual-emit)
  • error.type + span.RecordError(err) + span.SetStatus(codes.Error, …) on failure
  • Forge-namespaced extras (won't collide with future gen_ai.*): forge.llm.fallback_used (bool), forge.llm.fallback_provider

Tier 2 — content (ONLY when CaptureContent=true; default off)

Emit as span events, not large string attributes — gen_ai.system.message, gen_ai.user.message, gen_ai.assistant.message, gen_ai.choice. Events (not attributes) is where the spec landed and matches our default-off posture; downstream PII scrubbing is expected to happen in the Collector, so never rely on content being safe.

Do NOT emit

gen_ai.usage.cost — it needs price tables that go stale; the Platform/backend derives cost from token counts. No pricing logic in forge-core.

forge.tool_exec attributes

forge.tool.name (the registry name, e.g. cli_execute, http_request, tavily_research), forge.skill (skill id if skill-backed), forge.tool.success (bool), duration via span timing. Never set the underlying binary name, the raw command, or tool args when Redact == true (default). On failure: RecordError + error status.

Redaction rule

When cfg.Redact (default true), tool spans omit args/commands/binary names; egress events keep only the domain (already coarse). CaptureContent is the only switch that can add prompt/response text, and it's independent of Redact.

Verify

go build ./... && go test ./forge-core/...
go test ./forge-core/runtime/ -run GenAISemconv -v
# token rename + provider.name dual-emit covered

# With Jaeger from Phase 2 running:
FORGE_TRACING_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
  forge run --port 8098 &
# send a task that triggers an LLM call + a tool call, then open http://localhost:16686
# Confirm:
#   - forge.session root with a "chat {model}" CLIENT span + forge.tool_exec children, correctly nested
#   - gen_ai.provider.name + gen_ai.usage.input_tokens/output_tokens present
#   - NO binary name / NO content on any span

# Latest-only mode:
OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental ... forge run ...
# confirm legacy names dropped

Anti-patterns to avoid

  • Naming the inference span forge.llm_call instead of the chat {model} convention (breaks backend auto-mapping).
  • Scattering gen_ai.* across call sites instead of genai_semconv.go.
  • Freezing one naming version instead of honoring OTEL_SEMCONV_STABILITY_OPT_IN.
  • Content as big attributes instead of events.
  • Computing gen_ai.usage.cost in core.
  • Flat sibling spans (broken nesting from not threading ctx).
  • Binary names / commands in attributes (capability-enumeration guardrail violation).
  • Content with CaptureContent=false.
  • Starting spans in hooks (use call sites — hooks stay for audit only).

Cross-reference with FWS-3 (#87)

FWS-3 captures token usage / duration / model / provider at the LLM call site for the audit path. Phase 3 of this issue is the OTel span side of that same capture point. They share the call site by design — when an LLM call completes, one piece of code captures token counts and duration once, then writes to audit AND sets span attributes (no-op when tracing disabled). See FWS-3's "Relationship with OTel tracing" section.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions