Part of the Observability — OpenTelemetry Tracing v1 initiative (master tracking: #108). Effort: M (5–7 engineer-days). Risk: medium-high (many call sites, semconv evolution, redaction guarantees, content gating). Depends on: Phase 2 (#103).
Goal
Emit a correct nested span tree from real call sites. Redaction on by default. GenAI conventions spec-correct so Datadog / New Relic / SigNoz auto-map.
Files
| File |
Change |
forge-core runtime executor entry (agent loop Execute) |
Start root forge.session span; end on return |
forge-core/runtime LLMExecutor (LLM round-trip) |
Child GenAI inference span (convention-named) via the semconv helper |
forge-core/runtime/genai_semconv.go |
New. Single helper that owns ALL gen_ai.* emission + the version switch |
forge-core/tools registry Execute |
Child forge.tool_exec span (generic tool name only) |
forge-core/scheduler dispatch |
Root forge.scheduled_task span per tick |
| forge-core egress decision + guardrail check sites |
span.AddEvent on the active span (not child spans) |
Span tree
forge.session (root; one per task / Execute)
├─ chat {model} (GenAI inference span, CLIENT kind — per round-trip)
├─ forge.tool_exec (per tool call)
│ └─ event: egress.allowed / egress.blocked (domain attr)
│ └─ event: guardrail.block / guardrail.redact (rule attr)
└─ ...
forge.scheduled_task (root; scheduler dispatch — no inbound ctx)
forge.session attributes
agent.id, agent.version, forge.task_id, forge.correlation_id, forge.channel (if any), forge.session.state (set at end). Use trace.SpanFromContext / Tracer().Start(ctx, ...) so children inherit via ctx — thread the returned ctx through the loop, do not start orphan spans.
GenAI inference span — spec-correct gen_ai.* (Datadog / New Relic / SigNoz auto-map)
All gen_ai.* emission goes through genai_semconv.go — one place. The conventions are still experimental and rename often (tokens went prompt/completion → input/output; gen_ai.system → gen_ai.provider.name). Centralizing means the next rename is a one-file change, not a call-site hunt. The helper reads OTEL_SEMCONV_STABILITY_OPT_IN: when it contains gen_ai_latest_experimental, emit the newest names; otherwise dual-emit (new + legacy) for older backends. This is the spec's own transition mechanism — follow it rather than picking one naming and freezing.
Span name: chat {model} (e.g. chat claude-sonnet-4-6); span kind CLIENT. This convention is what GenAI-aware backends key on for auto-recognition — do NOT name it forge.llm_call.
Tier 1 — always-on metadata (non-sensitive; safe with CaptureContent=false)
gen_ai.operation.name = chat
gen_ai.provider.name (anthropic / openai / gcp.vertex_ai / ollama / …); dual-emit gen_ai.system unless opted into latest-only
gen_ai.request.model, gen_ai.response.model, gen_ai.response.id, gen_ai.response.finish_reasons
gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.max_tokens (only when set)
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens (legacy prompt_tokens/completion_tokens only under dual-emit)
error.type + span.RecordError(err) + span.SetStatus(codes.Error, …) on failure
- Forge-namespaced extras (won't collide with future
gen_ai.*): forge.llm.fallback_used (bool), forge.llm.fallback_provider
Tier 2 — content (ONLY when CaptureContent=true; default off)
Emit as span events, not large string attributes — gen_ai.system.message, gen_ai.user.message, gen_ai.assistant.message, gen_ai.choice. Events (not attributes) is where the spec landed and matches our default-off posture; downstream PII scrubbing is expected to happen in the Collector, so never rely on content being safe.
Do NOT emit
gen_ai.usage.cost — it needs price tables that go stale; the Platform/backend derives cost from token counts. No pricing logic in forge-core.
forge.tool_exec attributes
forge.tool.name (the registry name, e.g. cli_execute, http_request, tavily_research), forge.skill (skill id if skill-backed), forge.tool.success (bool), duration via span timing. Never set the underlying binary name, the raw command, or tool args when Redact == true (default). On failure: RecordError + error status.
Redaction rule
When cfg.Redact (default true), tool spans omit args/commands/binary names; egress events keep only the domain (already coarse). CaptureContent is the only switch that can add prompt/response text, and it's independent of Redact.
Verify
go build ./... && go test ./forge-core/...
go test ./forge-core/runtime/ -run GenAISemconv -v
# token rename + provider.name dual-emit covered
# With Jaeger from Phase 2 running:
FORGE_TRACING_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
forge run --port 8098 &
# send a task that triggers an LLM call + a tool call, then open http://localhost:16686
# Confirm:
# - forge.session root with a "chat {model}" CLIENT span + forge.tool_exec children, correctly nested
# - gen_ai.provider.name + gen_ai.usage.input_tokens/output_tokens present
# - NO binary name / NO content on any span
# Latest-only mode:
OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental ... forge run ...
# confirm legacy names dropped
Anti-patterns to avoid
- Naming the inference span
forge.llm_call instead of the chat {model} convention (breaks backend auto-mapping).
- Scattering
gen_ai.* across call sites instead of genai_semconv.go.
- Freezing one naming version instead of honoring
OTEL_SEMCONV_STABILITY_OPT_IN.
- Content as big attributes instead of events.
- Computing
gen_ai.usage.cost in core.
- Flat sibling spans (broken nesting from not threading ctx).
- Binary names / commands in attributes (capability-enumeration guardrail violation).
- Content with
CaptureContent=false.
- Starting spans in hooks (use call sites — hooks stay for audit only).
Cross-reference with FWS-3 (#87)
FWS-3 captures token usage / duration / model / provider at the LLM call site for the audit path. Phase 3 of this issue is the OTel span side of that same capture point. They share the call site by design — when an LLM call completes, one piece of code captures token counts and duration once, then writes to audit AND sets span attributes (no-op when tracing disabled). See FWS-3's "Relationship with OTel tracing" section.
Goal
Emit a correct nested span tree from real call sites. Redaction on by default. GenAI conventions spec-correct so Datadog / New Relic / SigNoz auto-map.
Files
Execute)forge.sessionspan; end on returnforge-core/runtimeLLMExecutor(LLM round-trip)forge-core/runtime/genai_semconv.gogen_ai.*emission + the version switchforge-core/toolsregistryExecuteforge.tool_execspan (generic tool name only)forge-core/schedulerdispatchforge.scheduled_taskspan per tickspan.AddEventon the active span (not child spans)Span tree
forge.sessionattributesagent.id,agent.version,forge.task_id,forge.correlation_id,forge.channel(if any),forge.session.state(set at end). Usetrace.SpanFromContext/Tracer().Start(ctx, ...)so children inherit via ctx — thread the returned ctx through the loop, do not start orphan spans.GenAI inference span — spec-correct
gen_ai.*(Datadog / New Relic / SigNoz auto-map)Span name:
chat {model}(e.g.chat claude-sonnet-4-6); span kindCLIENT. This convention is what GenAI-aware backends key on for auto-recognition — do NOT name itforge.llm_call.Tier 1 — always-on metadata (non-sensitive; safe with
CaptureContent=false)gen_ai.operation.name=chatgen_ai.provider.name(anthropic/openai/gcp.vertex_ai/ollama/ …); dual-emitgen_ai.systemunless opted into latest-onlygen_ai.request.model,gen_ai.response.model,gen_ai.response.id,gen_ai.response.finish_reasonsgen_ai.request.temperature,gen_ai.request.top_p,gen_ai.request.max_tokens(only when set)gen_ai.usage.input_tokens,gen_ai.usage.output_tokens(legacyprompt_tokens/completion_tokensonly under dual-emit)error.type+span.RecordError(err)+span.SetStatus(codes.Error, …)on failuregen_ai.*):forge.llm.fallback_used(bool),forge.llm.fallback_providerTier 2 — content (ONLY when
CaptureContent=true; default off)Emit as span events, not large string attributes —
gen_ai.system.message,gen_ai.user.message,gen_ai.assistant.message,gen_ai.choice. Events (not attributes) is where the spec landed and matches our default-off posture; downstream PII scrubbing is expected to happen in the Collector, so never rely on content being safe.Do NOT emit
gen_ai.usage.cost— it needs price tables that go stale; the Platform/backend derives cost from token counts. No pricing logic in forge-core.forge.tool_execattributesforge.tool.name(the registry name, e.g.cli_execute,http_request,tavily_research),forge.skill(skill id if skill-backed),forge.tool.success(bool), duration via span timing. Never set the underlying binary name, the raw command, or tool args whenRedact == true(default). On failure:RecordError+ error status.Redaction rule
When
cfg.Redact(default true), tool spans omit args/commands/binary names; egress events keep only the domain (already coarse).CaptureContentis the only switch that can add prompt/response text, and it's independent ofRedact.Verify
Anti-patterns to avoid
forge.llm_callinstead of thechat {model}convention (breaks backend auto-mapping).gen_ai.*across call sites instead ofgenai_semconv.go.OTEL_SEMCONV_STABILITY_OPT_IN.gen_ai.usage.costin core.CaptureContent=false.Cross-reference with FWS-3 (#87)
FWS-3 captures token usage / duration / model / provider at the LLM call site for the audit path. Phase 3 of this issue is the OTel span side of that same capture point. They share the call site by design — when an LLM call completes, one piece of code captures token counts and duration once, then writes to audit AND sets span attributes (no-op when tracing disabled). See FWS-3's "Relationship with OTel tracing" section.