You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tracks the final phase of the rollout plan. Phase 1 (foundation) shipped in #959; Phase 2 (network + auth) shipped in #962. Phase 3 closes out the rollout with resilience features and feature-flag retirement.
Per-server backoff/breaker on top of the existing Failed(_) status. Prevents repeated connect attempts from hammering a flapping server (and from filling the agent log with the same error 100 times during a tool-call burst).
Each Failed adds growing delay (1s/5s/30s/5min), no state machine
❌
Pros / cons:
(a) ✅ ADR-aligned; clear state semantics (Closed/Open/HalfOpen); single tunable (cooldown 60s); Hermes-proven in prod. ❌ Fixed 60s is uniform — doesn't adapt to short flake vs long outage; long-outage case wakes up + re-trips every cooldown.
(b) ✅ Self-adapts (short blip = short delay, sustained outage = long delay); idiomatic HTTP-retry pattern (AWS SDK, gRPC). ❌ No "Open" concept (every call still attempts, just throttled — harder to surface in mcp doctor); more state to track; ADR §5.9 already specs the breaker form — switching means ADR rewrite.
Decision: (a) Fixed cooldown 3-state breaker. ADR-aligned and Hermes-validated. If consecutive Open-trips become a real signal later, can layer exponential cooldown on top in v2.
Q2. Counter scope — single consecutive_failures per server vs split connect / tool-call counters.
Pre-discussion clarifier — only transport-level failures count. JSON-RPC error responses ({error: {code, message}}) and tool isError: true content are protocol-normal and must NOT increment.
Option
Model
(a) Single counter
One consecutive_failures: u32; connect fail + tool-call transport fail both increment; any success resets; trip on 3 fails / 30s
(b) Two counters
connect_failures + tool_call_failures separate, possibly different thresholds; either-or trip
Pros / cons:
(a) ✅ Hermes uses this (single _server_error_counts dict, validated in prod); "server unhealthy" is one concept; one tunable; remaining failure modes are transport-only so partitioning gives no real signal. ❌ Can't selectively trip earlier on connect-only failures.
(b) ✅ Can tune connect to trip earlier than tool-call (e.g. connect=2 / tool-call=10); mcp doctor can surface "5 mid-call disconnects but 0 connect failures = network OK, server unstable mid-call". ❌ Tunable surface doubles; state machine edge cases ("either trips → Open" vs "both must trip"); no prior art — bespoke design beyond Hermes.
Decision: (a) Single counter. Failure modes are highly homogeneous after JSON-RPC/isError exclusion. Easy to split later (1 → 2) if data shows it's needed; harder to merge back (2 → 1).
Q3. Reset trigger — what generates the half-open probe attempt?
Option
Behavior
(a) Background poll
Periodic background mcp status task; HalfOpen + cooldown elapsed → next poll = probe
(b) Explicit-only
No auto-probe; Open blocks all calls until user runs mcp connect <name>
(c) Lazy / piggyback
No background task; HalfOpen + cooldown elapsed → next user-initiated tool call = probe
Pros / cons:
(a) ✅ Self-healing without user input; mcp doctor can show "last poll: healthy". ❌ Extra tokio task + lifecycle; touches potentially-flaky server on schedule; probe may land on a server mid-restart → false-positive re-Open.
(b) ✅ Zero background traffic; simplest impl. ❌ Not really a circuit breaker — just a threshold-disable; manual recovery for any transient outage = poor UX; Hermes doesn't do this.
(c) ✅ Industry-standard (Hystrix, circuit-breaker Rust crate, Hermes all use this); no background task; self-healing; matches ADR §5.9 diagram exactly. ❌ First post-cooldown call sees one error if still down (acceptable — Open state already returned error before).
Decision: (c) Lazy / piggyback. Matches Hermes (tools/mcp_tool.py line 2480-2510 — quote: "we let the next call through as a probe. On success the success-path resets the breaker; on failure the error paths bump the count again").
Hermes additionally has (i) a manual /mcp refresh slash command and (ii) an OAuth-recovery-success auto-close path. These are TBD — implement the (c) primary path first, defer manual mcp connect <name> escape hatch + OAuth-recovery close to follow-up if needed.
2. mcp doctor CLI (ADR §8)
Interactive diagnostic subcommand. Helps users debug "why won't my server connect" without grep-ing logs. Output mirrors the ADR §8 design:
Run connect() for each configured server, surface result
For oauth servers: check cached token, refresh if expired, report status
List missing inputs — client_id env var unset, redirect_uri missing, device_authorization_endpoint absent on Custom provider, etc.
Recommend remediation (run mcp login X, set FOO_CLIENT_ID, ...)
3. Remove --features mcp flag (ADR §9) — DONE
✅ Shipped in commit b3a310b (branch feat/openab-agent-mcp-resilience). Cargo [features] block dropped, rmcp promoted to unconditional dep, all #[cfg(feature = "mcp")] / #[cfg(not(feature = "mcp"))] gates removed, CI matrix collapsed to single combo.
Out of scope
Per-tool permission gates (post-Phase-3 opt-in flag — see ADR §10 resolved-at-design-time list)
Phase 3 of the openab-agent native MCP client (ADR
docs/adr/openab-agent-mcp.md§5.9 + §8 + §9).Tracks the final phase of the rollout plan. Phase 1 (foundation) shipped in #959; Phase 2 (network + auth) shipped in #962. Phase 3 closes out the rollout with resilience features and feature-flag retirement.
Progress
--features mcpflag (branchfeat/openab-agent-mcp-resilience, commitb3a310b)mcp doctorCLIScope
1. Circuit breaker — Hermes pattern (ADR §5.9)
Per-server backoff/breaker on top of the existing
Failed(_)status. Prevents repeated connect attempts from hammering a flapping server (and from filling the agent log with the same error 100 times during a tool-call burst).Design questions — RESOLVED 2026-06-01
Q1. Backoff curve — fixed cooldown 3-state breaker vs per-attempt exponential backoff.
Failedadds growing delay (1s/5s/30s/5min), no state machinePros / cons:
Closed/Open/HalfOpen); single tunable (cooldown 60s); Hermes-proven in prod. ❌ Fixed 60s is uniform — doesn't adapt to short flake vs long outage; long-outage case wakes up + re-trips every cooldown.mcp doctor); more state to track; ADR §5.9 already specs the breaker form — switching means ADR rewrite.Decision: (a) Fixed cooldown 3-state breaker. ADR-aligned and Hermes-validated. If consecutive Open-trips become a real signal later, can layer exponential cooldown on top in v2.
Q2. Counter scope — single
consecutive_failuresper server vs split connect / tool-call counters.Pre-discussion clarifier — only transport-level failures count. JSON-RPC error responses (
{error: {code, message}}) and toolisError: truecontent are protocol-normal and must NOT increment.consecutive_failures: u32; connect fail + tool-call transport fail both increment; any success resets; trip on 3 fails / 30sconnect_failures+tool_call_failuresseparate, possibly different thresholds; either-or tripPros / cons:
_server_error_countsdict, validated in prod); "server unhealthy" is one concept; one tunable; remaining failure modes are transport-only so partitioning gives no real signal. ❌ Can't selectively trip earlier on connect-only failures.mcp doctorcan surface "5 mid-call disconnects but 0 connect failures = network OK, server unstable mid-call". ❌ Tunable surface doubles; state machine edge cases ("either trips → Open" vs "both must trip"); no prior art — bespoke design beyond Hermes.Decision: (a) Single counter. Failure modes are highly homogeneous after JSON-RPC/
isErrorexclusion. Easy to split later (1 → 2) if data shows it's needed; harder to merge back (2 → 1).Q3. Reset trigger — what generates the half-open probe attempt?
mcp statustask; HalfOpen + cooldown elapsed → next poll = probemcp connect <name>Pros / cons:
mcp doctorcan show "last poll: healthy". ❌ Extra tokio task + lifecycle; touches potentially-flaky server on schedule; probe may land on a server mid-restart → false-positive re-Open.circuit-breakerRust crate, Hermes all use this); no background task; self-healing; matches ADR §5.9 diagram exactly. ❌ First post-cooldown call sees one error if still down (acceptable — Open state already returned error before).Decision: (c) Lazy / piggyback. Matches Hermes (
tools/mcp_tool.pyline 2480-2510 — quote: "we let the next call through as a probe. On success the success-path resets the breaker; on failure the error paths bump the count again").Hermes additionally has (i) a manual
/mcp refreshslash command and (ii) an OAuth-recovery-success auto-close path. These are TBD — implement the (c) primary path first, defer manualmcp connect <name>escape hatch + OAuth-recovery close to follow-up if needed.2.
mcp doctorCLI (ADR §8)Interactive diagnostic subcommand. Helps users debug "why won't my server connect" without grep-ing logs. Output mirrors the ADR §8 design:
connect()for each configured server, surface resultclient_idenv var unset,redirect_urimissing,device_authorization_endpointabsent on Custom provider, etc.run mcp login X,set FOO_CLIENT_ID, ...)3. Remove
--features mcpflag (ADR §9) — DONE✅ Shipped in commit
b3a310b(branchfeat/openab-agent-mcp-resilience). Cargo[features]block dropped,rmcppromoted to unconditional dep, all#[cfg(feature = "mcp")]/#[cfg(not(feature = "mcp"))]gates removed, CI matrix collapsed to single combo.Out of scope
resources/promptscapabilities (v2)[agent].inherit_cloud_mcp_servers(issue feat(acp): allow opting out of inherited cloud MCP connectors via [agent].inherit_cloud_mcp_servers #753 — independent track)mcp connect <name>escape hatch (TBD follow-up — Hermes equivalent:/mcp refresh)Sequencing
Suggested commit order (low → high coupling):
--features mcpflag (mechanical; smallest PR, lands first) —b3a310bmcp doctorCLI (uses breaker state — lands last)May land as a single stacked PR or three separate PRs depending on diff size. Decision deferred until commit 2 size is known.
Test plan
isError: truecontent does NOT increment failure countermcp doctorreports each server's status with actionable remediationmcp doctoragainst fully-brokenmcp.jsonproduces non-zero exit + readable summarycargo buildworks (no dead#[cfg]left after flag removal)#[cfg(feature = "mcp")]left in treeReferences
docs/adr/openab-agent-mcp.md— §5.9 (circuit breaker), §8 (doctor CLI), §9 (rollout)tools/mcp_tool.py(lines 1868-1912 + 2480-2510) — circuit breaker reference implementation🤖 Generated with Claude Code