tracking: openab-agent native MCP — Phase 3 (Resilience)

Phase 3 of the openab-agent native MCP client (ADR [`docs/adr/openab-agent-mcp.md`](../blob/main/docs/adr/openab-agent-mcp.md) §5.9 + §8 + §9).

Tracks the final phase of the rollout plan. Phase 1 (foundation) shipped in #959; Phase 2 (network + auth) shipped in #962. Phase 3 closes out the rollout with resilience features and feature-flag retirement.

## Progress

- [x] **Commit 1** — Remove `--features mcp` flag (branch `feat/openab-agent-mcp-resilience`, commit [`b3a310b`](https://github.com/brettchien/openab/commit/b3a310b))
- [ ] **Commit 2** — Circuit breaker (design decisions resolved below — implementation in progress)
- [ ] **Commit 3** — `mcp doctor` CLI

## Scope

### 1. Circuit breaker — Hermes pattern (ADR §5.9)

Per-server backoff/breaker on top of the existing `Failed(_)` status. Prevents repeated connect attempts from hammering a flapping server (and from filling the agent log with the same error 100 times during a tool-call burst).

#### Design questions — RESOLVED 2026-06-01

**Q1. Backoff curve** — fixed cooldown 3-state breaker vs per-attempt exponential backoff.

| Option | Model | Hermes? |
|---|---|---|
| (a) Fixed cooldown 3-state breaker | 3 fails / 30s → Open 60s → HalfOpen 1-probe → Closed/Open | ✅ ADR §5.9 already specs this |
| (b) Per-attempt exponential | Each `Failed` adds growing delay (1s/5s/30s/5min), no state machine | ❌ |

Pros / cons:

- **(a)** ✅ ADR-aligned; clear state semantics (`Closed`/`Open`/`HalfOpen`); single tunable (cooldown 60s); Hermes-proven in prod. ❌ Fixed 60s is uniform — doesn't adapt to short flake vs long outage; long-outage case wakes up + re-trips every cooldown.
- **(b)** ✅ Self-adapts (short blip = short delay, sustained outage = long delay); idiomatic HTTP-retry pattern (AWS SDK, gRPC). ❌ No "Open" concept (every call still attempts, just throttled — harder to surface in `mcp doctor`); more state to track; ADR §5.9 already specs the breaker form — switching means ADR rewrite.

**Decision: (a) Fixed cooldown 3-state breaker.** ADR-aligned and Hermes-validated. If consecutive Open-trips become a real signal later, can layer exponential cooldown on top in v2.

---

**Q2. Counter scope** — single `consecutive_failures` per server vs split connect / tool-call counters.

Pre-discussion clarifier — only **transport-level** failures count. JSON-RPC error responses (`{error: {code, message}}`) and tool `isError: true` content are protocol-normal and must NOT increment.

| Option | Model |
|---|---|
| (a) Single counter | One `consecutive_failures: u32`; connect fail + tool-call transport fail both increment; any success resets; trip on 3 fails / 30s |
| (b) Two counters | `connect_failures` + `tool_call_failures` separate, possibly different thresholds; either-or trip |

Pros / cons:

- **(a)** ✅ Hermes uses this (single `_server_error_counts` dict, validated in prod); "server unhealthy" is one concept; one tunable; remaining failure modes are transport-only so partitioning gives no real signal. ❌ Can't selectively trip earlier on connect-only failures.
- **(b)** ✅ Can tune connect to trip earlier than tool-call (e.g. connect=2 / tool-call=10); `mcp doctor` can surface "5 mid-call disconnects but 0 connect failures = network OK, server unstable mid-call". ❌ Tunable surface doubles; state machine edge cases ("either trips → Open" vs "both must trip"); no prior art — bespoke design beyond Hermes.

**Decision: (a) Single counter.** Failure modes are highly homogeneous after JSON-RPC/`isError` exclusion. Easy to split later (1 → 2) if data shows it's needed; harder to merge back (2 → 1).

---

**Q3. Reset trigger** — what generates the half-open probe attempt?

| Option | Behavior |
|---|---|
| (a) Background poll | Periodic background `mcp status` task; HalfOpen + cooldown elapsed → next poll = probe |
| (b) Explicit-only | No auto-probe; Open blocks all calls until user runs `mcp connect <name>` |
| (c) Lazy / piggyback | No background task; HalfOpen + cooldown elapsed → next user-initiated tool call = probe |

Pros / cons:

- **(a)** ✅ Self-healing without user input; `mcp doctor` can show "last poll: healthy". ❌ Extra tokio task + lifecycle; touches potentially-flaky server on schedule; probe may land on a server mid-restart → false-positive re-Open.
- **(b)** ✅ Zero background traffic; simplest impl. ❌ Not really a circuit breaker — just a threshold-disable; manual recovery for any transient outage = poor UX; Hermes doesn't do this.
- **(c)** ✅ Industry-standard (Hystrix, `circuit-breaker` Rust crate, Hermes all use this); no background task; self-healing; matches ADR §5.9 diagram exactly. ❌ First post-cooldown call sees one error if still down (acceptable — Open state already returned error before).

**Decision: (c) Lazy / piggyback.** Matches Hermes (`tools/mcp_tool.py` line 2480-2510 — quote: *"we let the next call through as a probe. On success the success-path resets the breaker; on failure the error paths bump the count again"*).

Hermes additionally has (i) a manual `/mcp refresh` slash command and (ii) an OAuth-recovery-success auto-close path. These are TBD — implement the (c) primary path first, defer manual `mcp connect <name>` escape hatch + OAuth-recovery close to follow-up if needed.

### 2. `mcp doctor` CLI (ADR §8)

Interactive diagnostic subcommand. Helps users debug "why won't my server connect" without grep-ing logs. Output mirrors the ADR §8 design:

- Run `connect()` for each configured server, surface result
- For oauth servers: check cached token, refresh if expired, report status
- List missing inputs — `client_id` env var unset, `redirect_uri` missing, `device_authorization_endpoint` absent on Custom provider, etc.
- Recommend remediation (`run mcp login X`, `set FOO_CLIENT_ID`, ...)

### 3. Remove `--features mcp` flag (ADR §9) — DONE

✅ Shipped in commit [`b3a310b`](https://github.com/brettchien/openab/commit/b3a310b) (branch `feat/openab-agent-mcp-resilience`). Cargo `[features]` block dropped, `rmcp` promoted to unconditional dep, all `#[cfg(feature = "mcp")]` / `#[cfg(not(feature = "mcp"))]` gates removed, CI matrix collapsed to single combo.

## Out of scope

- Per-tool permission gates (post-Phase-3 opt-in flag — see ADR §10 resolved-at-design-time list)
- `resources` / `prompts` capabilities (v2)
- Broker-side `[agent].inherit_cloud_mcp_servers` (issue #753 — independent track)
- Manual `mcp connect <name>` escape hatch (TBD follow-up — Hermes equivalent: `/mcp refresh`)
- OAuth-recovery-success auto-close of breaker (TBD follow-up — Hermes does this, may matter once OAuth refresh races land)

## Sequencing

Suggested commit order (low → high coupling):

1. ✅ Remove `--features mcp` flag (mechanical; smallest PR, lands first) — `b3a310b`
2. Circuit breaker (design decisions above are locked; implementation in progress)
3. `mcp doctor` CLI (uses breaker state — lands last)

May land as a single stacked PR or three separate PRs depending on diff size. Decision deferred until commit 2 size is known.

## Test plan

- [ ] Breaker opens after 3 consecutive transport-level failures within 30s
- [ ] Breaker auto-resets on next call after 60s cooldown (HalfOpen probe success)
- [ ] HalfOpen probe failure re-arms cooldown (back to Open)
- [ ] JSON-RPC error responses do NOT increment failure counter
- [ ] Tool `isError: true` content does NOT increment failure counter
- [ ] `mcp doctor` reports each server's status with actionable remediation
- [ ] `mcp doctor` against fully-broken `mcp.json` produces non-zero exit + readable summary
- [x] `cargo build` works (no dead `#[cfg]` left after flag removal)
- [x] No `#[cfg(feature = "mcp")]` left in tree

## References

- ADR `docs/adr/openab-agent-mcp.md` — §5.9 (circuit breaker), §8 (doctor CLI), §9 (rollout)
- Phase 1 PR #959 (tracking + ADR + foundation)
- Phase 2 PR #962 (network + auth) — closes the network/auth track
- Hermes Agent `tools/mcp_tool.py` (lines 1868-1912 + 2480-2510) — circuit breaker reference implementation

🤖 Generated with [Claude Code](https://claude.com/claude-code)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking: openab-agent native MCP — Phase 3 (Resilience) #966

Progress

Scope

1. Circuit breaker — Hermes pattern (ADR §5.9)

Design questions — RESOLVED 2026-06-01

2. `mcp doctor` CLI (ADR §8)

3. Remove `--features mcp` flag (ADR §9) — DONE

Out of scope

Sequencing

Test plan

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Option	Model	Hermes?
(a) Fixed cooldown 3-state breaker	3 fails / 30s → Open 60s → HalfOpen 1-probe → Closed/Open	✅ ADR §5.9 already specs this
(b) Per-attempt exponential	Each `Failed` adds growing delay (1s/5s/30s/5min), no state machine	❌

Option	Model
(a) Single counter	One `consecutive_failures: u32`; connect fail + tool-call transport fail both increment; any success resets; trip on 3 fails / 30s
(b) Two counters	`connect_failures` + `tool_call_failures` separate, possibly different thresholds; either-or trip

Option	Behavior
(a) Background poll	Periodic background `mcp status` task; HalfOpen + cooldown elapsed → next poll = probe
(b) Explicit-only	No auto-probe; Open blocks all calls until user runs `mcp connect <name>`
(c) Lazy / piggyback	No background task; HalfOpen + cooldown elapsed → next user-initiated tool call = probe

tracking: openab-agent native MCP — Phase 3 (Resilience) #966

Description

Progress

Scope

1. Circuit breaker — Hermes pattern (ADR §5.9)

Design questions — RESOLVED 2026-06-01

2. mcp doctor CLI (ADR §8)

3. Remove --features mcp flag (ADR §9) — DONE

Out of scope

Sequencing

Test plan

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `mcp doctor` CLI (ADR §8)

3. Remove `--features mcp` flag (ADR §9) — DONE