feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412
Conversation
d21c54a to
ebc468d
Compare
|
@greptile review |
|
@greptile review |
b4b8b33 to
da780a1
Compare
|
On AGX1-371 (no Golden agent's Since it's an additive optional ( |
Will land this in scale-agentex and then have a follow up here. |
Approach A (Agentex event stream as canonical source of truth): one tap per harness feeds shared yield/auto-send delivery adapters and a span-deriving tracing tap. Additive backwards-compat, stacked PRs <1000 lines, conformance + live-matrix testing (3 test agents per harness: sync/async/temporal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… golden-agent integration - Make tracing-tap span derivation explicit (tool open on Done of a ToolRequestContent index, close on matching ToolResponseContent by tool_call_id; parallel-safe; reasoning start->done). Flag missing is_error on ToolResponseContent as an additive upstream decision. - Add first-class TurnUsage/TurnResult shape (aligned to llm_metrics token taxonomy) attached to the turn span via span(data=) and reused for metrics. - Document golden-agent integration: all SGP/sandbox/secret/MCP coupling stays in the agent; only parsing/streaming/tracing/usage move to SDK taps + emitter; sandbox-setup events chain before the harness stream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 1-3) Bite-sized TDD tasks: foundation types, pure SpanDeriver, SpanTracer adapter, yield + auto_send delivery, UnifiedEmitter facade, conformance scaffold + CI job. Migration/parser PRs (4-9) listed as follow-on plans. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… signals Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… handling in SpanDeriver Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sts for SpanTracer Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on early close Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reaming + tracing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… + cover error/finally paths in auto_send Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…send_turn + doc tracer modes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egistry semantics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…he package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or consistency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reportImplicitOverride) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MissingImports) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ast-segment, created_at (AGX1-377, AGX1-378) auto_send.py: - Replace single current_ctx with ctx_map[index] so parallel streams route correctly - Open a streaming context for ALL content types on Start (not just text/reasoning), fixing tool_request/tool_response stream delivery (AGX1-377) - Reset final_text_parts on each new Start(TextContent) and on Full(TextContent) so multi-step turns return the LAST text segment, not the full accumulation - Add created_at: datetime | None param; forward to every streaming_task_message_context call (AGX1-378) span_derivation.py: - _on_full: handle Full(ToolRequestContent) by opening a tool span keyed by tool_call_id if not already open; adds LangGraph full-event harness support Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ting, last-segment, created_at, Full ToolRequest spans test_auto_send.py: - Fix test 2: remove bare Start(ToolRequestContent) from events (old behavior was that Start did not open a ctx; new behavior does, so test was updated to use Full-only events that still verify the two-context behavior) - Extend _FakeStreaming to record created_at on each context call - Add test 6: streamed tool_request opens a ctx + routes deltas (AGX1-377 core) - Add test 7: interleaved indexes route deltas to correct per-index contexts - Add test 8: multi-step turns return the LAST text segment only - Add test 9: Full(TextContent) contributes its content to final_text - Add test 10: created_at is forwarded to every streaming context call (AGX1-378) test_span_derivation.py: - Add test_full_tool_request_opens_span: Full(ToolRequestContent) opens a span - Add test_full_tool_request_and_response_paired: paired Full request+response produces a complete OpenSpan+CloseSpan - Add test_full_tool_request_does_not_double_open: idempotent; a Full for an already-open tool_call_id is a no-op Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n (AGX1-378) So migration helpers can restore the deterministic first-message timestamp on the temporal path. Default None preserves current behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…turn Bug: auto_send_turn passed usage=turn.usage() eagerly, capturing the empty default before the stream was consumed — so TurnResult.usage came back empty for every auto_send harness (real turns populate usage during consumption). Now drives delivery first, then reads usage. Adds a contract-honoring regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_close_all() iterated open contexts and closed them sequentially with no per-context error guard, so a failure on the first context (e.g. a backend hiccup during teardown) abandoned every remaining open context and their task messages were never finalized. Wrap each close() in its own try/except (warn and continue) so all contexts are attempted regardless of individual failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These were brainstorming/spec artifacts not meant to ship; per review feedback they don't belong in the foundation PR. Also de-references the now-removed docs path in the harness-integration TODO comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntent.is_error Now that the SDK carries ToolResponseContent.is_error (AGX1-371), thread it through the harness surface: CloseSpan gains an is_error field, the span deriver populates it when closing a tool span on a ToolResponseContent, and the tracer records the status on span.data (Span has no dedicated error field; None means the harness reported no status, so data is left untouched). This preserves tool-error tracing fidelity for harnesses that report tool failures (e.g. golden agent's ToolCompleted.is_error) once they move onto the unified surface, instead of silently dropping the status. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3384505 to
41b1a32
Compare
What this is
Foundation (PRs 1–3 of the rollout) for a unified harness tracing/message-emitting surface: the Agentex
StreamTaskMessage*stream is the single source of truth, and shared harness-independent machinery derives spans from it and delivers it over both channels:adk.streaming(async + temporal agents, from inside an activity),with tracing on by default (derived from the same stream) and overridable, and a unified
TurnUsage/TurnResultshape for per-harness usage normalization.Design:
docs/superpowers/specs/2026-06-18-unified-harness-surface-design.mdPlan:
docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.mdWhat's in
src/agentex/lib/core/harness/types.py—StreamTaskMessage,OpenSpan/CloseSpan/SpanSignal,TurnUsage,TurnResult,HarnessTurnprotocol.span_derivation.py—SpanDeriver: pure reducer (noadkdep), canonical stream → span signals. Tool span opens on theDoneof aToolRequestContentindex, closes on the matchingToolResponseContentbytool_call_id; reasoning span open-on-Start / close-on-Done; parallel-safe;flush()closes unclosed spans.tracer.py—SpanTracer: best-effort adapter from span signals toadk.tracing(never raises; overridable; guardedmake_logger).yield_delivery.py/auto_send.py— the two delivery adapters (both feed the sameSpanDeriver/SpanTracer;finally-flush on early close/error).emitter.py—UnifiedEmitter: ties trace context + delivery + usage; default-on/overridable tracing; injectable tracing/streaming backends.conformance/— shared conformance scaffold each future harness tap registers fixtures with..github/workflows/harness-integration.yml— conformance CI job (via./scripts/test) + anif: falselive-matrixplaceholder enabled by the migration PRs.Scope / what's NOT here
Per-harness migration (pydantic-ai / langgraph / openai) and parser taps (claude-code / codex), plus their 3 e2e test agents each (sync/async/temporal), are future migration PRs (4–8) — not in this branch.
Quality gates
./scripts/test tests/lib/core/harness/).# type: ignorein the package.Follow-ups (filed)
Fulltool-message wire shape (blocks migration backward-compat claims).adkfacade before the first consumer migration.pathstoagentex.types;SpanTracerduplicate-open guard.is_erroronToolResponseContent(tool-span error status).🤖 Generated with Claude Code
Greptile Summary
This PR introduces the foundation for a unified harness tracing/message-emitting surface: a
StreamTaskMessage*-based canonical stream is the single source of truth, and shared machinery derives spans from it and delivers it via two channels —yield(sync HTTP ACP) andauto-send(async/Temporal activities) — with tracing on by default and a unifiedTurnUsage/TurnResultshape.span_derivation.py— pureSpanDeriverreducer with dual open paths:Start+Delta+Done(streamed tool requests) andFull(ToolRequestContent)(LangGraph-style); parallel-safe viatool_call_idkeying;flush()closes any unclosed spans on early exit.auto_send.py— index-keyedctx_maprouting, per-context error-guarded_close_all(), last-segmentfinal_textsemantics, and correctfinally-flush; all previously flagged gaps (single-slot ctx, eager usage read, missing ToolRequest Start delivery) are resolved in this HEAD.emitter.py—UnifiedEmittercorrectly readsturn.usage()afterauto_sendexhausts the stream, not eagerly.Confidence Score: 5/5
Safe to merge; all delivery paths have corresponding tests and the foundational gaps called out in earlier review rounds have been addressed in this HEAD.
The core machinery — index-keyed context routing, per-context error-guarded teardown, correct post-exhaustion usage read, dual span-open paths in SpanDeriver, and finally-flush on early close — all work correctly and are well-covered by 30 tests across 6 modules. The only items worth flagging are the conformance test's idempotency-only assertion (correctness verification is deferred to AGX1-373) and the string-comparison reasoning guard in SpanDeriver.
tests/lib/core/harness/conformance/test_conformance.py — conformance suite currently provides no signal-correctness coverage, only idempotency.
Important Files Changed
content.type == "reasoning") rather than isinstance, which is fragile if the type string changes.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Harness as Harness tap participant Emitter as UnifiedEmitter participant YD as yield_events participant AS as auto_send participant SD as SpanDeriver participant ST as SpanTracer participant Streaming as adk.streaming Note over Emitter: yield_turn (sync HTTP ACP) Harness->>Emitter: yield_turn(turn) Emitter->>YD: yield_events(turn.events, tracer) loop each event YD->>SD: observe(event) SD-->>YD: [SpanSignals] YD->>ST: handle(signal) YD-->>Emitter: yield event end YD->>SD: flush() SD-->>YD: [unclosed signals] YD->>ST: handle(signal) Note over Emitter: auto_send_turn (async/Temporal) Harness->>Emitter: auto_send_turn(turn) Emitter->>AS: auto_send(turn.events, ...) loop each event AS->>SD: observe(event) SD-->>AS: [SpanSignals] AS->>ST: handle(signal) AS->>Streaming: streaming_task_message_context(...) Streaming-->>AS: ctx AS->>Streaming: ctx.stream_update(delta) / ctx.close() end AS->>SD: flush() AS-->>Emitter: TurnResult(final_text) Emitter->>Harness: "result (with result.usage = turn.usage())"%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Harness as Harness tap participant Emitter as UnifiedEmitter participant YD as yield_events participant AS as auto_send participant SD as SpanDeriver participant ST as SpanTracer participant Streaming as adk.streaming Note over Emitter: yield_turn (sync HTTP ACP) Harness->>Emitter: yield_turn(turn) Emitter->>YD: yield_events(turn.events, tracer) loop each event YD->>SD: observe(event) SD-->>YD: [SpanSignals] YD->>ST: handle(signal) YD-->>Emitter: yield event end YD->>SD: flush() SD-->>YD: [unclosed signals] YD->>ST: handle(signal) Note over Emitter: auto_send_turn (async/Temporal) Harness->>Emitter: auto_send_turn(turn) Emitter->>AS: auto_send(turn.events, ...) loop each event AS->>SD: observe(event) SD-->>AS: [SpanSignals] AS->>ST: handle(signal) AS->>Streaming: streaming_task_message_context(...) Streaming-->>AS: ctx AS->>Streaming: ctx.stream_update(delta) / ctx.close() end AS->>SD: flush() AS-->>Emitter: TurnResult(final_text) Emitter->>Harness: "result (with result.usage = turn.usage())"Comments Outside Diff (1)
General comment
handle()method toUnifiedEmitter(..., tracer=custom_tracer)did not use that object. Instead, the emitter fell through to default construction ofSpanTracer, which attempted to import the real ADK stack and failed in this environment withModuleNotFoundError: No module named 'temporalio'. This contradicts the requested override behavior for tracing in the unified emitter surface.src/agentex/lib/core/harness/emitter.pyonly accepts overrides whenisinstance(tracer, SpanTracer)is true. A valid injected/duck-typed tracer object is ignored, causing fallback toSpanTracer(...)whenevertrace_idis truthy.handle(signal)contract, or define and use a runtime-checkable tracer Protocol instead of requiringisinstance(tracer, SpanTracer).Reviews (12): Last reviewed commit: "feat(harness): mark derived tool spans a..." | Re-trigger Greptile