Skip to content

feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412

Merged
declan-scale merged 30 commits into
nextfrom
declan-scale/unified-harness-surface
Jun 22, 2026
Merged

feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412
declan-scale merged 30 commits into
nextfrom
declan-scale/unified-harness-surface

Conversation

@declan-scale

@declan-scale declan-scale commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What this is

Foundation (PRs 1–3 of the rollout) for a unified harness tracing/message-emitting surface: the Agentex StreamTaskMessage* stream is the single source of truth, and shared harness-independent machinery derives spans from it and delivers it over both channels:

  • yield — pass the canonical stream through to the caller (sync HTTP ACP agents),
  • auto-send — push to the task stream via adk.streaming (async + temporal agents, from inside an activity),

with tracing on by default (derived from the same stream) and overridable, and a unified TurnUsage/TurnResult shape for per-harness usage normalization.

Design: docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
Plan: docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md

What's in src/agentex/lib/core/harness/

  • types.pyStreamTaskMessage, OpenSpan/CloseSpan/SpanSignal, TurnUsage, TurnResult, HarnessTurn protocol.
  • span_derivation.pySpanDeriver: pure reducer (no adk dep), canonical stream → span signals. Tool span opens on the Done of a ToolRequestContent index, closes on the matching ToolResponseContent by tool_call_id; reasoning span open-on-Start / close-on-Done; parallel-safe; flush() closes unclosed spans.
  • tracer.pySpanTracer: best-effort adapter from span signals to adk.tracing (never raises; overridable; guarded make_logger).
  • yield_delivery.py / auto_send.py — the two delivery adapters (both feed the same SpanDeriver/SpanTracer; finally-flush on early close/error).
  • emitter.pyUnifiedEmitter: ties trace context + delivery + usage; default-on/overridable tracing; injectable tracing/streaming backends.
  • conformance/ — shared conformance scaffold each future harness tap registers fixtures with.
  • .github/workflows/harness-integration.yml — conformance CI job (via ./scripts/test) + an if: false live-matrix placeholder enabled by the migration PRs.

Scope / what's NOT here

Per-harness migration (pydantic-ai / langgraph / openai) and parser taps (claude-code / codex), plus their 3 e2e test agents each (sync/async/temporal), are future migration PRs (4–8) — not in this branch.

Quality gates

  • 30 tests passing on Python 3.12 + 3.13 (via ./scripts/test tests/lib/core/harness/).
  • pyright clean (0 errors/warnings), no # type: ignore in the package.
  • Each task spec- + quality-reviewed; final whole-branch review passed with no Critical issues.

Follow-ups (filed)

  • AGX1-373 (High) — make conformance assert true yield-vs-auto-send equivalence + reconcile Full tool-message wire shape (blocks migration backward-compat claims).
  • AGX1-374 (Medium) — auto_send reasoning + mixed-ordering tests.
  • AGX1-375 (Medium) — expose the surface via the public adk facade before the first consumer migration.
  • AGX1-376 (Low) — widen CI paths to agentex.types; SpanTracer duplicate-open guard.
  • AGX1-371 — deferred optional is_error on ToolResponseContent (tool-span error status).

Note: total diff is ~3k lines but ~1.6k of that is the spec + plan docs; the package code + tests + CI is ~1.4k. Reviewable per-commit (one commit per plan task).

🤖 Generated with Claude Code

Greptile Summary

This PR introduces the foundation for a unified harness tracing/message-emitting surface: a StreamTaskMessage*-based canonical stream is the single source of truth, and shared machinery derives spans from it and delivers it via two channels — yield (sync HTTP ACP) and auto-send (async/Temporal activities) — with tracing on by default and a unified TurnUsage/TurnResult shape.

  • span_derivation.py — pure SpanDeriver reducer with dual open paths: Start+Delta+Done (streamed tool requests) and Full(ToolRequestContent) (LangGraph-style); parallel-safe via tool_call_id keying; flush() closes any unclosed spans on early exit.
  • auto_send.py — index-keyed ctx_map routing, per-context error-guarded _close_all(), last-segment final_text semantics, and correct finally-flush; all previously flagged gaps (single-slot ctx, eager usage read, missing ToolRequest Start delivery) are resolved in this HEAD.
  • emitter.pyUnifiedEmitter correctly reads turn.usage() after auto_send exhausts the stream, not eagerly.

Confidence Score: 5/5

Safe to merge; all delivery paths have corresponding tests and the foundational gaps called out in earlier review rounds have been addressed in this HEAD.

The core machinery — index-keyed context routing, per-context error-guarded teardown, correct post-exhaustion usage read, dual span-open paths in SpanDeriver, and finally-flush on early close — all work correctly and are well-covered by 30 tests across 6 modules. The only items worth flagging are the conformance test's idempotency-only assertion (correctness verification is deferred to AGX1-373) and the string-comparison reasoning guard in SpanDeriver.

tests/lib/core/harness/conformance/test_conformance.py — conformance suite currently provides no signal-correctness coverage, only idempotency.

Important Files Changed

Filename Overview
src/agentex/lib/core/harness/types.py Defines core types: StreamTaskMessage union, OpenSpan/CloseSpan/SpanSignal, TurnUsage, TurnResult, HarnessTurn protocol. Clean dataclass/Pydantic design; runtime-checkable protocol for structural duck-typing.
src/agentex/lib/core/harness/span_derivation.py Pure stateful reducer converting StreamTaskMessage events to span open/close signals. Handles both Start+Delta+Done (streamed) and Full (LangGraph-style) tool-request paths. Minor: reasoning detection uses string comparison (content.type == "reasoning") rather than isinstance, which is fragile if the type string changes.
src/agentex/lib/core/harness/auto_send.py Async delivery adapter: index-keyed ctx_map routing, per-context error-guarded _close_all(), last-segment final_text semantics, Full(TextContent) contributing to final_text, and correct streaming of Start(ToolRequestContent). All previously flagged gaps addressed in current HEAD.
src/agentex/lib/core/harness/emitter.py UnifiedEmitter facade wiring trace context + delivery mode. auto_send_turn correctly calls turn.usage() after auto_send consumes the stream; the previously flagged eager-evaluation bug is resolved.
src/agentex/lib/core/harness/tracer.py Best-effort span adapter to adk.tracing; correctly swallows tracing errors, records is_error via span.data mutation. Duplicate-open span orphan behavior is documented and tracked in AGX1-376.
src/agentex/lib/core/harness/yield_delivery.py Straightforward passthrough async generator with try/finally flush; correctly handles early aclose() via GeneratorExit triggering the finally block.
tests/lib/core/harness/conformance/test_conformance.py Conformance test asserts only idempotency (derive_all(x)==derive_all(x)), not correctness of the actual span signals. The registered builtin-single-tool fixture never has its expected signals verified. Tracked in AGX1-373 but currently provides near-zero regression value.
.github/workflows/harness-integration.yml New CI job delegating to ./scripts/test (preserving exact multi-Python-version invocation); placeholder live-matrix job with if:false is intentional and harmless.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Harness as Harness tap
    participant Emitter as UnifiedEmitter
    participant YD as yield_events
    participant AS as auto_send
    participant SD as SpanDeriver
    participant ST as SpanTracer
    participant Streaming as adk.streaming

    Note over Emitter: yield_turn (sync HTTP ACP)
    Harness->>Emitter: yield_turn(turn)
    Emitter->>YD: yield_events(turn.events, tracer)
    loop each event
        YD->>SD: observe(event)
        SD-->>YD: [SpanSignals]
        YD->>ST: handle(signal)
        YD-->>Emitter: yield event
    end
    YD->>SD: flush()
    SD-->>YD: [unclosed signals]
    YD->>ST: handle(signal)

    Note over Emitter: auto_send_turn (async/Temporal)
    Harness->>Emitter: auto_send_turn(turn)
    Emitter->>AS: auto_send(turn.events, ...)
    loop each event
        AS->>SD: observe(event)
        SD-->>AS: [SpanSignals]
        AS->>ST: handle(signal)
        AS->>Streaming: streaming_task_message_context(...)
        Streaming-->>AS: ctx
        AS->>Streaming: ctx.stream_update(delta) / ctx.close()
    end
    AS->>SD: flush()
    AS-->>Emitter: TurnResult(final_text)
    Emitter->>Harness: "result (with result.usage = turn.usage())"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Harness as Harness tap
    participant Emitter as UnifiedEmitter
    participant YD as yield_events
    participant AS as auto_send
    participant SD as SpanDeriver
    participant ST as SpanTracer
    participant Streaming as adk.streaming

    Note over Emitter: yield_turn (sync HTTP ACP)
    Harness->>Emitter: yield_turn(turn)
    Emitter->>YD: yield_events(turn.events, tracer)
    loop each event
        YD->>SD: observe(event)
        SD-->>YD: [SpanSignals]
        YD->>ST: handle(signal)
        YD-->>Emitter: yield event
    end
    YD->>SD: flush()
    SD-->>YD: [unclosed signals]
    YD->>ST: handle(signal)

    Note over Emitter: auto_send_turn (async/Temporal)
    Harness->>Emitter: auto_send_turn(turn)
    Emitter->>AS: auto_send(turn.events, ...)
    loop each event
        AS->>SD: observe(event)
        SD-->>AS: [SpanSignals]
        AS->>ST: handle(signal)
        AS->>Streaming: streaming_task_message_context(...)
        Streaming-->>AS: ctx
        AS->>Streaming: ctx.stream_update(delta) / ctx.close()
    end
    AS->>SD: flush()
    AS-->>Emitter: TurnResult(final_text)
    Emitter->>Harness: "result (with result.usage = turn.usage())"
Loading

Comments Outside Diff (1)

  1. General comment

    P1 UnifiedEmitter ignores duck-typed tracer override and constructs real SpanTracer

    • Bug
      • Passing a custom tracer object with an async handle() method to UnifiedEmitter(..., tracer=custom_tracer) did not use that object. Instead, the emitter fell through to default construction of SpanTracer, which attempted to import the real ADK stack and failed in this environment with ModuleNotFoundError: No module named 'temporalio'. This contradicts the requested override behavior for tracing in the unified emitter surface.
    • Cause
      • src/agentex/lib/core/harness/emitter.py only accepts overrides when isinstance(tracer, SpanTracer) is true. A valid injected/duck-typed tracer object is ignored, causing fallback to SpanTracer(...) whenever trace_id is truthy.
    • Fix
      • Relax the override branch to accept any non-None, non-False tracer object that implements the expected handle(signal) contract, or define and use a runtime-checkable tracer Protocol instead of requiring isinstance(tracer, SpanTracer).

    T-Rex Ran code and verified through T-Rex

Reviews (12): Last reviewed commit: "feat(harness): mark derived tool spans a..." | Re-trigger Greptile

Comment thread src/agentex/lib/core/harness/auto_send.py
Comment thread src/agentex/lib/core/harness/auto_send.py Outdated
@declan-scale declan-scale force-pushed the declan-scale/unified-harness-surface branch from d21c54a to ebc468d Compare June 18, 2026 17:29
Comment thread src/agentex/lib/core/harness/span_derivation.py
Comment thread src/agentex/lib/core/harness/auto_send.py
Comment thread src/agentex/lib/core/harness/auto_send.py Outdated
Comment thread src/agentex/lib/core/harness/auto_send.py
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread src/agentex/lib/core/harness/emitter.py Outdated
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

@declan-scale declan-scale force-pushed the declan-scale/unified-harness-surface branch from b4b8b33 to da780a1 Compare June 22, 2026 13:48
Comment thread src/agentex/lib/core/harness/auto_send.py
Comment thread docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md Outdated
Comment thread docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md Outdated
@danielmillerp

Copy link
Copy Markdown
Contributor

On AGX1-371 (no is_error/status on ToolResponseContent) — flagging this as more than a nice-to-have for the golden-agent adoption (#422).

Golden agent's ToolCompleted carries is_error today, so a derived tool span can mark failure. Without that field, golden agent loses tool-error status the moment it moves onto this surface — i.e. adoption is a regression on tracing fidelity, not a clean swap.

Since it's an additive optional (is_error: bool | None), can we pull it into this stack rather than deferring? Would rather land it here than block #422 on it later.

@declan-scale

Copy link
Copy Markdown
Contributor Author

On AGX1-371 (no is_error/status on ToolResponseContent) — flagging this as more than a nice-to-have for the golden-agent adoption (#422).

Golden agent's ToolCompleted carries is_error today, so a derived tool span can mark failure. Without that field, golden agent loses tool-error status the moment it moves onto this surface — i.e. adoption is a regression on tracing fidelity, not a clean swap.

Since it's an additive optional (is_error: bool | None), can we pull it into this stack rather than deferring? Would rather land it here than block #422 on it later.

Will land this in scale-agentex and then have a follow up here.

declan-scale and others added 14 commits June 22, 2026 15:46
Approach A (Agentex event stream as canonical source of truth): one tap per
harness feeds shared yield/auto-send delivery adapters and a span-deriving
tracing tap. Additive backwards-compat, stacked PRs <1000 lines, conformance +
live-matrix testing (3 test agents per harness: sync/async/temporal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… golden-agent integration

- Make tracing-tap span derivation explicit (tool open on Done of a
  ToolRequestContent index, close on matching ToolResponseContent by
  tool_call_id; parallel-safe; reasoning start->done). Flag missing
  is_error on ToolResponseContent as an additive upstream decision.
- Add first-class TurnUsage/TurnResult shape (aligned to llm_metrics token
  taxonomy) attached to the turn span via span(data=) and reused for metrics.
- Document golden-agent integration: all SGP/sandbox/secret/MCP coupling
  stays in the agent; only parsing/streaming/tracing/usage move to SDK taps +
  emitter; sandbox-setup events chain before the harness stream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 1-3)

Bite-sized TDD tasks: foundation types, pure SpanDeriver, SpanTracer adapter,
yield + auto_send delivery, UnifiedEmitter facade, conformance scaffold + CI
job. Migration/parser PRs (4-9) listed as follow-on plans.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… signals

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… handling in SpanDeriver

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sts for SpanTracer

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on early close

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reaming + tracing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… + cover error/finally paths in auto_send

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
declan-scale and others added 16 commits June 22, 2026 15:46
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…send_turn + doc tracer modes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egistry semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…he package

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or consistency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reportImplicitOverride)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MissingImports)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ast-segment, created_at (AGX1-377, AGX1-378)

auto_send.py:
- Replace single current_ctx with ctx_map[index] so parallel streams route correctly
- Open a streaming context for ALL content types on Start (not just text/reasoning),
  fixing tool_request/tool_response stream delivery (AGX1-377)
- Reset final_text_parts on each new Start(TextContent) and on Full(TextContent)
  so multi-step turns return the LAST text segment, not the full accumulation
- Add created_at: datetime | None param; forward to every
  streaming_task_message_context call (AGX1-378)

span_derivation.py:
- _on_full: handle Full(ToolRequestContent) by opening a tool span keyed by
  tool_call_id if not already open; adds LangGraph full-event harness support

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ting, last-segment, created_at, Full ToolRequest spans

test_auto_send.py:
- Fix test 2: remove bare Start(ToolRequestContent) from events (old behavior was
  that Start did not open a ctx; new behavior does, so test was updated to use
  Full-only events that still verify the two-context behavior)
- Extend _FakeStreaming to record created_at on each context call
- Add test 6: streamed tool_request opens a ctx + routes deltas (AGX1-377 core)
- Add test 7: interleaved indexes route deltas to correct per-index contexts
- Add test 8: multi-step turns return the LAST text segment only
- Add test 9: Full(TextContent) contributes its content to final_text
- Add test 10: created_at is forwarded to every streaming context call (AGX1-378)

test_span_derivation.py:
- Add test_full_tool_request_opens_span: Full(ToolRequestContent) opens a span
- Add test_full_tool_request_and_response_paired: paired Full request+response
  produces a complete OpenSpan+CloseSpan
- Add test_full_tool_request_does_not_double_open: idempotent; a Full for an
  already-open tool_call_id is a no-op

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n (AGX1-378)

So migration helpers can restore the deterministic first-message timestamp on
the temporal path. Default None preserves current behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…turn

Bug: auto_send_turn passed usage=turn.usage() eagerly, capturing the empty
default before the stream was consumed — so TurnResult.usage came back empty
for every auto_send harness (real turns populate usage during consumption).
Now drives delivery first, then reads usage. Adds a contract-honoring
regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_close_all() iterated open contexts and closed them sequentially with no
per-context error guard, so a failure on the first context (e.g. a backend
hiccup during teardown) abandoned every remaining open context and their task
messages were never finalized. Wrap each close() in its own try/except (warn
and continue) so all contexts are attempted regardless of individual failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These were brainstorming/spec artifacts not meant to ship; per review
feedback they don't belong in the foundation PR. Also de-references the
now-removed docs path in the harness-integration TODO comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntent.is_error

Now that the SDK carries ToolResponseContent.is_error (AGX1-371), thread it
through the harness surface: CloseSpan gains an is_error field, the span
deriver populates it when closing a tool span on a ToolResponseContent, and
the tracer records the status on span.data (Span has no dedicated error
field; None means the harness reported no status, so data is left untouched).

This preserves tool-error tracing fidelity for harnesses that report tool
failures (e.g. golden agent's ToolCompleted.is_error) once they move onto the
unified surface, instead of silently dropping the status.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@declan-scale declan-scale force-pushed the declan-scale/unified-harness-surface branch from 3384505 to 41b1a32 Compare June 22, 2026 19:54
@declan-scale declan-scale merged commit a9cacf4 into next Jun 22, 2026
42 checks passed
@declan-scale declan-scale deleted the declan-scale/unified-harness-surface branch June 22, 2026 19:59
@stainless-app stainless-app Bot mentioned this pull request Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants