Skip to content

tracking: openab-agent native MCP — Phase 3 (Resilience) #966

@brettchien

Description

@brettchien

Phase 3 of the openab-agent native MCP client (ADR docs/adr/openab-agent-mcp.md §5.9 + §8 + §9).

Tracks the final phase of the rollout plan. Phase 1 (foundation) shipped in #959; Phase 2 (network + auth) shipped in #962. Phase 3 closes out the rollout with resilience features and feature-flag retirement.

Progress

  • Commit 1 — Remove --features mcp flag (branch feat/openab-agent-mcp-resilience, commit b3a310b)
  • Commit 2 — Circuit breaker (design decisions resolved below — implementation in progress)
  • Commit 3mcp doctor CLI

Scope

1. Circuit breaker — Hermes pattern (ADR §5.9)

Per-server backoff/breaker on top of the existing Failed(_) status. Prevents repeated connect attempts from hammering a flapping server (and from filling the agent log with the same error 100 times during a tool-call burst).

Design questions — RESOLVED 2026-06-01

Q1. Backoff curve — fixed cooldown 3-state breaker vs per-attempt exponential backoff.

Option Model Hermes?
(a) Fixed cooldown 3-state breaker 3 fails / 30s → Open 60s → HalfOpen 1-probe → Closed/Open ✅ ADR §5.9 already specs this
(b) Per-attempt exponential Each Failed adds growing delay (1s/5s/30s/5min), no state machine

Pros / cons:

  • (a) ✅ ADR-aligned; clear state semantics (Closed/Open/HalfOpen); single tunable (cooldown 60s); Hermes-proven in prod. ❌ Fixed 60s is uniform — doesn't adapt to short flake vs long outage; long-outage case wakes up + re-trips every cooldown.
  • (b) ✅ Self-adapts (short blip = short delay, sustained outage = long delay); idiomatic HTTP-retry pattern (AWS SDK, gRPC). ❌ No "Open" concept (every call still attempts, just throttled — harder to surface in mcp doctor); more state to track; ADR §5.9 already specs the breaker form — switching means ADR rewrite.

Decision: (a) Fixed cooldown 3-state breaker. ADR-aligned and Hermes-validated. If consecutive Open-trips become a real signal later, can layer exponential cooldown on top in v2.


Q2. Counter scope — single consecutive_failures per server vs split connect / tool-call counters.

Pre-discussion clarifier — only transport-level failures count. JSON-RPC error responses ({error: {code, message}}) and tool isError: true content are protocol-normal and must NOT increment.

Option Model
(a) Single counter One consecutive_failures: u32; connect fail + tool-call transport fail both increment; any success resets; trip on 3 fails / 30s
(b) Two counters connect_failures + tool_call_failures separate, possibly different thresholds; either-or trip

Pros / cons:

  • (a) ✅ Hermes uses this (single _server_error_counts dict, validated in prod); "server unhealthy" is one concept; one tunable; remaining failure modes are transport-only so partitioning gives no real signal. ❌ Can't selectively trip earlier on connect-only failures.
  • (b) ✅ Can tune connect to trip earlier than tool-call (e.g. connect=2 / tool-call=10); mcp doctor can surface "5 mid-call disconnects but 0 connect failures = network OK, server unstable mid-call". ❌ Tunable surface doubles; state machine edge cases ("either trips → Open" vs "both must trip"); no prior art — bespoke design beyond Hermes.

Decision: (a) Single counter. Failure modes are highly homogeneous after JSON-RPC/isError exclusion. Easy to split later (1 → 2) if data shows it's needed; harder to merge back (2 → 1).


Q3. Reset trigger — what generates the half-open probe attempt?

Option Behavior
(a) Background poll Periodic background mcp status task; HalfOpen + cooldown elapsed → next poll = probe
(b) Explicit-only No auto-probe; Open blocks all calls until user runs mcp connect <name>
(c) Lazy / piggyback No background task; HalfOpen + cooldown elapsed → next user-initiated tool call = probe

Pros / cons:

  • (a) ✅ Self-healing without user input; mcp doctor can show "last poll: healthy". ❌ Extra tokio task + lifecycle; touches potentially-flaky server on schedule; probe may land on a server mid-restart → false-positive re-Open.
  • (b) ✅ Zero background traffic; simplest impl. ❌ Not really a circuit breaker — just a threshold-disable; manual recovery for any transient outage = poor UX; Hermes doesn't do this.
  • (c) ✅ Industry-standard (Hystrix, circuit-breaker Rust crate, Hermes all use this); no background task; self-healing; matches ADR §5.9 diagram exactly. ❌ First post-cooldown call sees one error if still down (acceptable — Open state already returned error before).

Decision: (c) Lazy / piggyback. Matches Hermes (tools/mcp_tool.py line 2480-2510 — quote: "we let the next call through as a probe. On success the success-path resets the breaker; on failure the error paths bump the count again").

Hermes additionally has (i) a manual /mcp refresh slash command and (ii) an OAuth-recovery-success auto-close path. These are TBD — implement the (c) primary path first, defer manual mcp connect <name> escape hatch + OAuth-recovery close to follow-up if needed.

2. mcp doctor CLI (ADR §8)

Interactive diagnostic subcommand. Helps users debug "why won't my server connect" without grep-ing logs. Output mirrors the ADR §8 design:

  • Run connect() for each configured server, surface result
  • For oauth servers: check cached token, refresh if expired, report status
  • List missing inputs — client_id env var unset, redirect_uri missing, device_authorization_endpoint absent on Custom provider, etc.
  • Recommend remediation (run mcp login X, set FOO_CLIENT_ID, ...)

3. Remove --features mcp flag (ADR §9) — DONE

✅ Shipped in commit b3a310b (branch feat/openab-agent-mcp-resilience). Cargo [features] block dropped, rmcp promoted to unconditional dep, all #[cfg(feature = "mcp")] / #[cfg(not(feature = "mcp"))] gates removed, CI matrix collapsed to single combo.

Out of scope

  • Per-tool permission gates (post-Phase-3 opt-in flag — see ADR §10 resolved-at-design-time list)
  • resources / prompts capabilities (v2)
  • Broker-side [agent].inherit_cloud_mcp_servers (issue feat(acp): allow opting out of inherited cloud MCP connectors via [agent].inherit_cloud_mcp_servers #753 — independent track)
  • Manual mcp connect <name> escape hatch (TBD follow-up — Hermes equivalent: /mcp refresh)
  • OAuth-recovery-success auto-close of breaker (TBD follow-up — Hermes does this, may matter once OAuth refresh races land)

Sequencing

Suggested commit order (low → high coupling):

  1. ✅ Remove --features mcp flag (mechanical; smallest PR, lands first) — b3a310b
  2. Circuit breaker (design decisions above are locked; implementation in progress)
  3. mcp doctor CLI (uses breaker state — lands last)

May land as a single stacked PR or three separate PRs depending on diff size. Decision deferred until commit 2 size is known.

Test plan

  • Breaker opens after 3 consecutive transport-level failures within 30s
  • Breaker auto-resets on next call after 60s cooldown (HalfOpen probe success)
  • HalfOpen probe failure re-arms cooldown (back to Open)
  • JSON-RPC error responses do NOT increment failure counter
  • Tool isError: true content does NOT increment failure counter
  • mcp doctor reports each server's status with actionable remediation
  • mcp doctor against fully-broken mcp.json produces non-zero exit + readable summary
  • cargo build works (no dead #[cfg] left after flag removal)
  • No #[cfg(feature = "mcp")] left in tree

References

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions