feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962) by brettchien · Pull Request #969 · openabdev/openab

brettchien · 2026-06-01T16:19:03Z

Discord Discussion URL: https://discord.com/channels/1490643539682529300/1510657283552706663

📚 PR stack — merge in this order:

feat(openab-agent): native MCP client (ADR + implementation) #959 — Phase 1: Foundation (ADR + rmcp + stdio + meta-tool)
feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962 — Phase 2: OAuth 2.1 paste-back (stacks on feat(openab-agent): native MCP client (ADR + implementation) #959)
feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962) #969 (this PR) — Phase 3: Resilience ← stacks on feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962, merge last

97 commits appear in the diff because GitHub does not support cross-repo PRs with a non-main base (the upstream feat/openab-agent-mcp / feat/openab-agent-mcp-oauth branches only live on the fork). Once #959 + #962 merge, this branch rebases and the diff collapses to the Phase 3 commits below.

Reviewers can take all three in parallel; merges must serialize.

Closes #966.

Summary

Phase 3 of the openab-agent native MCP client (ADR docs/adr/openab-agent-mcp.md §5.9 + §8 + §9). Adds the circuit-breaker resilience layer, the mcp doctor diagnostic CLI, and retires the --features mcp Cargo flag.

Phase 3 commits (review only these)

commit	content	ADR ref
`b3a310b`	Remove `--features mcp` flag — drop `[features]` block, promote `rmcp` to unconditional dep, strip all `#[cfg(feature = "mcp")]` gates, collapse CI matrix	§9
`26c7d87`	Per-server circuit-breaker module (Hermes pattern: Closed / Open / HalfOpen, fixed 60s cooldown, lazy probe) + 9 unit tests	§5.9
`6772a51`	Wire breaker into `McpRuntimeManager::connect()` + `meta_tool::call_tool` (transport-only counter; NeedsAuth bounces excluded; `isError: true` excluded) + 2 integration tests	§5.9
`bf62732`	`mcp doctor` CLI — per-server env / OAuth / live-connect diagnostic with remediation hints; non-zero exit on any failure	§8

#966 design decisions

Resolved in #966 issue body (2026-06-01); summary:

Q1 — Backoff curve: fixed cooldown 3-state breaker (vs per-attempt exponential). ADR-aligned + Hermes-validated; single tunable (60s).
Q2 — Counter scope: single consecutive_failures per server (vs split connect / tool-call). Hermes uses one counter; failure modes are homogeneous after JSON-RPC / isError exclusion.
Q3 — Reset trigger: lazy / piggyback probe (vs background poll / explicit-only). Industry-standard (Hystrix, Hermes); no background task.

Transport-level only: JSON-RPC error responses + tool isError: true content do not count. NeedsAuth and OAuth refresh-failure paths do not count either (auth-level, not transport).

Test plan

Breaker opens after 3 consecutive transport-level failures
Breaker Reject short-circuits with retry in {n}s hint
HalfOpen probe success closes; failure re-arms cooldown
NeedsAuth bounces do NOT increment counter
cargo test 130 passed, cargo clippy -- -D warnings clean, cargo fmt --check clean
mcp doctor runs per-server diagnostic with actionable remediation hints
mcp doctor exits non-zero when any server fails (composes existing tested primitives — connect() / load_namespaced_token_at / resolved())

Out of scope (deferred follow-ups)

Manual mcp connect <name> escape hatch (Hermes equivalent: /mcp refresh)
OAuth-recovery-success auto-close of breaker
Per-tool permission gates (post-Phase-3 opt-in flag)

🤖 Generated with Claude Code

Adds ADR for in-core rmcp + progressive-disclosure meta-tool, deferred to symmetry with the Skills extension pattern from PR openabdev#955. Memory analysis rules out the sidecar alternative; per-session config refresh replaces file-watcher hot reload to drop ~150 LOC of race-condition hotspot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ate guardrails - §6.4 flow selection: device-code preferred (matches existing CLI convention), paste-back universal fallback, browser laptop opt-in - §5.2 add login/complete_login meta-tool actions; login returns flow-tagged union - §6.1 spell out TokenStore persistence assumption + cold-start refresh - §6.3 device_authorization_endpoint extension point + RFC 8414 discovery - §3.8 stdio container-image caveat (interpreter required for Node/Python) - §5.7 max_concurrent_servers knob (default 10, see §7 for constrained tuning) - §7 Fargate 512MB/1GB OOM analysis + mitigations Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ntract - §6.4 RFC 8414 discovery disabled by default; opt-in requires oauth.discovery=true + oauth.discovery_allowlist (boot rejects otherwise). Rationale: awsvpc egress + SSRF surface in multi-tenant deployments. - §6.1 add RTR race warning: async persistence layers (Fargate S3 sync, eventually-consistent volumes) must flush new tokens to durable storage before Spot interruption, else cascade-revoke locks the user out. Contract: fsync(2) agent-side + mtime-event-driven sync deployment-side. - §6.3 expose oauth.discovery / oauth.discovery_allowlist on custom providers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 1 foundation slice per ADR §5.4 + §5.6: - Add optional `rmcp 1.7` dep + `mcp` feature flag (default off) - Wire `#[cfg(feature = "mcp")] mod mcp;` into main - New `mcp/config.rs`: `McpConfig` / `ServerConfig` (Stdio | Http) / `ToolFilter` / `OAuthConfig`, global+project layered load, project precedence on name collision, `${env:VAR}` interpolation with per-server error context OAuth fields limited to `provider` + `scopes`; custom-provider endpoints (§6.3) deferred to Phase 2 auth slice to avoid dead schema. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CI fmt-check found 8 formatting deltas in the Phase 1 scaffold. No logic change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CI previously only built the default feature set, leaving the Phase 1 MCP scaffold (mcp/config.rs, gated by --features mcp) without compile, clippy, or test coverage. Adds explicit --features mcp invocations and watches workflow file changes so this gap is closed for the rest of the rollout (ADR §9). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wires up the Phase 1 config loader so clippy --features mcp can see it. McpConfig::load() + ServerConfig::resolved() + serde pretty-print are all reachable from main, clearing the dead-code denial of compile under -D warnings. Output groups successful servers (✓) and failures (✗ with reason), sorted by name for deterministic display. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

sort_by_key over (name, _) is the cleaner form; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per Mira's security review on the Phase 1 thread: the previous 'mcp list' eagerly called ServerConfig::resolved(), substituting \${env:GITHUB_TOKEN} etc. into the printed output. Three leak paths matter — pasting CLI output into bug reports / chat, screen sharing, and stdout log collection. New behavior: - Default: print raw config; \${env:VAR} placeholders kept verbatim. Safe to paste publicly; reader still sees which env var feeds each field. - --resolve opts into substitution and prints a two-line warning banner. Useful for diagnosing missing-env startup failures. No CLI-shape break: 'mcp list' still works; --resolve is additive. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

println! puts the warning banner into stdout, which (a) gets swallowed by 'mcp list --resolve > dump.json' redirection so the user never sees the security notice, and (b) corrupts the JSON payload for downstream pipes like 'mcp list --resolve | jq'. Routing the banner through eprintln! keeps it visible regardless of redirection and keeps stdout pure JSON for piping. Standard Unix convention: data on stdout, diagnostics on stderr. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Introduces McpRuntimeManager owning one ServerHandle per configured server, each starting in ServerStatus::Disconnected per ADR §5.7 (lazy connect). Wires the manager via `mcp status` CLI so the types are exercised by clippy --features mcp; actual rmcp TokioChildProcess dial + Connected / Failed transitions land in the next slice to keep that risky bit isolated for bisecting.

`clippy --features mcp -- -D warnings` doesn't compile the test target, so `len()`'s only callers (the unit tests) didn't keep it alive. Tests already used `statuses().len()` in one place — switch the other two to match and drop the now-dead method. `is_empty()` stays because `cli_show_status` calls it.

…ct() ADR §5.7 lifecycle needs a `connect()` method that spawns a child process and awaits an rmcp handshake — both `Send` across `.await`. Plain `HashMap<String, ServerHandle>` is not `Sync`, and the background idle- eviction task will share the map with the foreground `mcp call` path, so the read-heavy / write-light access pattern wants `tokio::sync::RwLock`. This slice lands the lock migration only: - `handles: Arc<RwLock<HashMap<String, ServerHandle>>>` - `statuses()` and `is_empty()` become async; `cli_show_status` follows - `McpRuntimeManager` is now `Clone` (Arc bump) so the eviction task can hold its own handle - `connect(name)` transitions to `Connecting` and returns; the actual `rmcp::TokioChildProcess` dial + `Connected` / `Failed` transitions land in the next slice — keeping that bit isolated for bisecting 6 unit tests cover snapshot ordering, unknown-server error, transition, and clone-shares-state.

The chained `.statuses().await.into_iter().map(|(n, _)| n).collect()` fits on one line under rustfmt's default 100-char width — broke it preemptively last tick on the wrong side of the threshold.

`clippy --features mcp -D warnings` doesn't compile the test target, so `connect()`'s only callers (the unit tests) don't keep it alive. The real rmcp dial lands in the next slice and will be reachable from the meta-tool dispatch / `mcp call` CLI path; until then a targeted allow keeps clippy quiet without papering over the type-level dead_code on ServerStatus variants (which is already explicitly allowed).

connect() now spawns the configured stdio child via rmcp's TokioChildProcess, runs the JSON-RPC handshake (().serve(transport)), and parks the resulting RunningService on ServerHandle.client. Lock discipline uses a double-acquire: a brief write lock to mark Connecting, drop, run the dial without any lock held, then a second brief write lock to install the client or record Failed(msg). Holding the write lock across the spawn+handshake would block mcp status and the future idle-eviction scan for the entire dial latency. HTTP transport is rejected with a "phase 2" error so misconfigured entries surface cleanly without advancing state past Disconnected. Adds `openab-agent mcp connect <name>` as a smoke-test CLI for mcp.json entries; the RunningService is dropped on process exit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rver Per Mira's Tick 13 review: when two callers race on the same name, both pass the first-lock idempotency check (Disconnected), both spawn a child, both come back to acquire the install-lock. The second arrival must yield to the winner, otherwise it overwrites the installed client and silently drops the working RunningService. Adds a 3-line double-check at the second lock acquisition: if status is already Connected with a live client, return Ok(()) — `dial_result` falls out of scope here and RunningService::Drop kills the duplicate child. Cheaper than a Connecting-aware notify/condvar and adequate until the eviction-aware slice lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three rustfmt collapses that I missed under default 100-char width: * import order: bare `rmcp::ServiceExt` sorts AFTER `rmcp::transport::*` (rustfmt puts sub-module paths before bare items within a crate). * `return Err(anyhow!(...))` for the HTTP phase-2 message fits on one line (~88 chars). * `let Self { command, args, env } = self;` fits on one line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…patch Adds the single `mcp` tool the LLM sees (ADR §5.2). This slice lands: * `Action` enum tagged on `action`, matching the eight ADR actions, deserialized straight from the LLM's tool-call payload. * `dispatch(manager, action) -> Result<Value>` as the single entry point — both `agent.rs::execute_tool` (next slice) and tests go through this. * `help` returns the static usage doc. * `list_servers` returns `[{name, status, transport}]` via a new `McpRuntimeManager::snapshot()` that clones out under a read lock. * The four IO-bearing actions (`list_tools`, `describe_tool`, `call`, `status`) return a `not yet implemented` error so the surface area is visible to callers without pre-implementing the peer-borrow path. `login` / `complete_login` land with the Phase 2 OAuth slice. `#[allow(dead_code)]` on `Action` and `dispatch` because clippy --features mcp doesn't compile the test target and agent.rs wire-up lands next slice. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…LLM fallback hint Two of Mira's three Tick 15 findings (Arc-wrap for peer borrow lands with the actual list_tools/call slice — Arc without a consumer earns no keep). (1) `not_implemented_msg(action)` names the requested action, lists the actions that DO work (`help`, `list_servers`), and tells the LLM to fall back to native `read` / `write` / `edit` / `bash` rather than retrying. The old generic "not yet implemented" gave the model nothing to act on. (2) `ServerConfig::transport_label() -> &'static str` replaces the config clone inside `snapshot()`. The `Stdio` variant carries an `args: Vec<String>` and `env: HashMap<String, String>`; cloning that just to read the variant tag was wasted heap traffic on every `list_servers` call. `snapshot()` now yields `(String, ServerStatus, &'static str)`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rustfmt's fn_call_width default (60) splits assert!() args when the inline args exceed it; the new "read, write, edit, bash" fallback assert tipped the third call from ~55 to ~62 chars. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Apply Mira's Tick 15 review (3) and ship the first IO-bearing meta-tool action. * `ServerHandle.client` is now `Option<Arc<RunningService<...>>>`. `connect()` wraps the dial result in `Arc::new`. * `McpRuntimeManager::arc_peer(name)` clones the Arc out under a short read lock, drops the guard, returns the handle. Callers `.await` on `peer.list_all_tools()` / `peer.call_tool()` with no runtime lock held — kills the writer-starvation risk Mira flagged and sidesteps `Future is not Send` from holding a guard across `.await`. * `Action::ListTools { server }` now wired: lazy `connect()` → `arc_peer()` → `peer.list_all_tools()` → `[{name, description}]`. `describe_tool` / `call` / `status` still stub. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Apply Mira's Tick 18 review (3) — lenient args parsing for the `call` action so LLMs that send `arguments: null` (or omit the field entirely) for no-arg tools don't bounce off a strict type check. * `Action::Call { server, tool, arguments }` is wired through `call_tool()`: validate arguments → lazy `connect()` → `arc_peer()` → `peer.call_tool(CallToolRequestParams::new(...).with_arguments(map))` → serialize `CallToolResult` to JSON. * Argument coercion: `Value::Object(map) → map`, `Value::Null → {}`, everything else (string, number, array, bool) is rejected with a message that names the actual type so the LLM can correct itself. * Tests cover both branches: non-object string args fail early at validation; null args pass validation and fail later at connect. * `describe_tool` and `status` still stub; supported-list message updated to include `call`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The CallToolRequestParams chain fits in exactly 100 chars; rustfmt keeps it on one line. Burned a tick on a defensive break. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… 1 surface describe_tool returns full input_schema for one tool (list+filter via list_all_tools — MCP spec has no single-tool query). status reports per-server {name, status, transport, last_error} with optional name filter; surfaces Failed's embedded error as last_error. Extract fetch_tools helper shared by list_tools + describe_tool — this is the natural insertion point for the planned tools_cache.

…in_width

Surfaces the `mcp` action enum to the LLM via a new ToolDef and routes calls through McpRuntimeManager when configured. AcpServer constructs the manager once at startup (warn-and-default on config parse failure) and clones the cheap Arc-backed handle into each session's Agent. Tools.rs stays stateless and feature-flag-free — the mcp routing arm lives in Agent::execute_tool_call where it has access to the manager. A shared MCP_TOOL_NAME const keeps mcp_tool_def() and the dispatch arm in sync.

The `mut` on `tools` and `base` is consumed only inside `#[cfg(feature = "mcp")]` blocks, so the no-feature build saw them as unused. Gate `allow(unused_mut)` behind `cfg_attr(not(feature = "mcp"), ...)` so both builds stay clean.

`test_session_new` and `test_session_new_missing_key` both mutate `ANTHROPIC_API_KEY`. cargo runs tests in parallel, so a `set` from one thread can be observed by the other before the first reads it. The pre-existing race was timing-tight enough to pass historically; adding `mcp::load_runtime_or_warn()` to `AcpServer::new()` widened the gap between set and read enough to flip the race deterministic. Guarding both tests with a shared `Mutex<()>` serializes them without adding a `serial_test` dep. Poisoned guard is recovered via `into_inner()` — we don't care if a prior assertion panicked.

…_status one-pass Simplify pass on 9bca3de: - auth.rs: PENDING_PREFIX demoted to private; pending_key() promoted to pub fn as the single public constructor. runtime.rs uses it instead of formatting against the const, so the namespace literal lives in exactly one place. - mcp/mod.rs: cli_show_status folds the orphan calculation into the print loop via pending.remove(name) — the leftover set IS the orphans, dropping the second configured: HashSet<&str> pass. - mcp/mod.rs: drop the cli_login "Security note" docstring — the same rationale lives on the Login clap subcommand doc in main.rs, which is the user-facing surface. - runtime.rs: trim the spawn_blocking rationale on pending_logins — moved next to list_pending_logins_at where the sync read actually lives. Net -10 LOC across 3 files; ~/ops/openab-agent-ci.sh --quick green.

Phase 2 / ADR §6.5: HTTP+oauth servers now dial with their cached access_token when one is present and not within REFRESH_SKEW_SECONDS of expiry. Refresh-token rotation for the expired-but-recoverable case is the next slice — for now, expired tokens fall through to NeedsAuth alongside missing tokens, so `mcp login` stays the single user-actionable path. - auth::is_expired now pub(crate) — single source of truth for the expiry+skew check across codex (get_valid_token) and mcp (connect). - Dial::Http carries Option<String> auth; Dial::run uses StreamableHttpClientTransportConfig::with_uri(url).auth_header(token) when present, falls back to from_uri for anonymous HTTP. - connect() oauth branch tries load_namespaced_token_at first; the Ok+!expired arm builds an authenticated Dial, all other arms bounce to NeedsAuth. Tests: - connect_oauth_with_valid_cached_token_attempts_dial_not_needs_auth proves the bearer-injection path reaches handshake (fails at the hermetic 127.0.0.1:1 dead address, surfaces Failed not NeedsAuth). - connect_oauth_with_expired_token_bounces_to_needs_auth pins the expiry guard against future refresh-flow regressions.

Phase 2 / ADR §6.6: HTTP+oauth servers with an expired cached token now attempt RFC 6749 §6 refresh-grant before bouncing to NeedsAuth. On success the rotated TokenStore is fsynced and the dial proceeds with the new bearer; on failure the bounce is preserved so `mcp login` stays the user-actionable path. - post_token_refresh: sibling of post_token_exchange for grant_type= refresh_token (public client, no client_secret). - try_refresh_oauth_token: resolves client_id from current config (so a rotated catalog entry is picked up), POSTs the refresh, builds a new TokenStore preserving the previous refresh_token if the provider omits one (Google-style rotation), persists via save_namespaced_token_at. - DialPlan two-phase enum lets the inside-lock branch tag "needs refresh" and defer the async refresh to outside the write lock — slow refresh must not block concurrent `mcp status` reads. Tests: - connect_oauth_expired_no_refresh_token_bounces_to_needs_auth pins the empty-refresh short-circuit (no POST attempted). - connect_oauth_expired_with_refresh_token_failed_refresh_bounces_to_needs_auth proves the refresh path runs and that any failure surfaces as user-actionable NeedsAuth (current fixture's custom provider plus dead token endpoint both fail the refresh). Custom-provider refresh shares the same gap as custom-provider login — resolve_paste_client errors before the refresh POST runs. Will be unblocked by the same future slice that wires custom-provider login.

…ected winner Mira + Kirin caught a race: two concurrent connect() calls on the same expired-token server can interleave so one wins the refresh + dial while the other's refresh fails — and the loser was unconditionally overwriting `Connected` with `NeedsAuth`, giving the user a false "please log in" signal on a healthy session. Guard the status mutation with `!matches!(h.status, Connected)` so a concurrent winner's terminal state is preserved.

Mirrors finish_login (oauth.rs:271) so a hostile/buggy provider returning a huge expires_in can't trigger a debug-build overflow panic in the refresh-token rotation path.

…d refresh post_token_exchange and post_token_refresh diverged only in the form payload — same Client::builder() boilerplate, same status/body folding, same response shape. Extract post_token_form so each grant call site is just the RFC 6749 form tuple + a grant_label for the error context. Also drops a narrating comment in connect() that recapped the three match arms below it without adding any non-obvious context.

Adds the §3.1 device authorization request + §3.4 polling tick HTTP helpers — pure data plumbing, no runtime wiring yet. The polling-tick response classifier is split out as a pure fn (classify_device_poll) so the §3.5 error-code → flow-state mapping (authorization_pending, slow_down, access_denied, expired_token) is unit-testable without a mock HTTP server. DeviceAuthResponse defaults interval to 5s per RFC 8628 §3.5 when the provider omits it. verification_uri_complete (§3.3.1 extension) is Option since verification_uri + user_code is the always-present fallback. Both POST fns + the two types are #[allow(dead_code)] — next slice wires start_device_login() and the polling loop into McpRuntimeManager.

…loop) - `start_device_login(name)`: POST §3.1 + spawn detached §3.4 polling task - `DeviceLoginStart`: public user-facing bundle (user_code / verification_uri / verification_uri_complete / expires_in) - `resolve_device_client`: Custom-provider-only gate with explicit error for missing `device_authorization_endpoint` / `client_id` - `run_device_poll_loop`: deadline check + SlowDown back-off (+5s) + success persistence + NeedsAuth on terminal failure - `finalize_device_login` / `mark_device_login_failed`: status transitions (Disconnected on success keeps MCP handshake out of detached task; next `connect()` picks up the cached token via the oauth-aware DialPlan) - 4 gating tests: unknown server / stdio / missing device endpoint / HTTP reach #[allow(dead_code)] on the new surface until the next slice wires `mcp login --device` CLI. Quick gate green.

- main.rs: add `--device` flag to `McpAction::Login` (conflicts_with `paste`) - mcp/mod.rs: `cli_login_device(name)` — prints verification_uri_complete (§3.3.1) when present, falls back to `verification_uri` + `user_code`; watches `statuses()` until Disconnected (success) / NeedsAuth (terminal failure) or the provider-declared `expires_in` deadline elapses - runtime.rs: drop `#[allow(dead_code)]` on the Tick-55 device-flow chain (`start_device_login` is now called from the CLI surface)

- config.rs: add `oauth.redirect_uri: Option<String>` to `OAuthConfig` for the paste-back branch on custom providers (built-ins pin their callback in `ProviderSpec`; ignored by the device-code branch) - oauth.rs: propagate `redirect_uri` through `ResolvedProvider::Custom` - runtime.rs: in `resolve_paste_client`, accept Custom + client_id + redirect_uri; emit precise per-missing-field error otherwise: - device endpoint present → "use device flow" (preserved) - missing client_id → "requires `oauth.client_id`" - missing redirect_uri → "requires `oauth.redirect_uri` (must match the redirect URL pre-registered with the provider)" - Tests: rename `_rejects_custom_provider_for_now` → `_rejects_custom_without_redirect_uri`; add `_rejects_custom_without_client_id` + success-path `_custom_with_client_id_and_redirect_uri_succeeds` Closes the last Phase 2 paste-back functional gap. ADR §6.3 enumerates optional custom fields but doesn't currently list `redirect_uri`; follow-up ADR amendment lands separately.

Adds the field to the §6.3 config-shape list and a follow-up paragraph explaining why custom paste-back requires it (no local listener; must match provider OAuth app registration) and why the device-code / built-in branches ignore it. Follow-up to 84837ee (custom-provider paste-back support); Mira + Kirin dual sign-off requested this as a non-blocking docs amendment.

- auth.rs: convert free function `is_expired(&TokenStore) -> bool` into `impl TokenStore { pub fn is_expired(&self) -> bool }`. Switch the skew addition to `saturating_add` for parity with the rest of the OAuth arithmetic (defends `u64::MAX` "never expires" sentinel + any near-overflow clock value). - auth.rs: fold `show_status`'s inline duplicate of the expiry formula into the new method. - mcp/runtime.rs: drop the free-function import + update the cached-token DialPlan branch to `store.is_expired()`. - Tests: rewrite the three existing `is_expired` cases to method form and add `_sentinel_u64_max` covering the saturating_add behavior. Mira Tick 50 non-blocking cleanup.

Drop stale `#[allow(dead_code)]` markers now that runtime/flow wire all items through prod paths. Reuse `reqwest::Client` across the device-poll loop so TLS / TCP handshakes are amortized across the dozens-to-hundreds of polls a 30-minute device-flow window can produce. Trim session-narrative parentheticals from doc comments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reviewer-pass findings: - Drop dead auth.rs wrappers (`load/save/remove_namespaced_token`) that were never wired; runtime/tests use the `_at` siblings directly. - Extract `build_token_store` helper to dedupe the `TokenStore` construction + `u64::MAX` sentinel + refresh-rotation logic across `finish_login`, `finalize_device_login`, and `try_refresh_oauth_token`. - Preserve underlying error context in `complete_login` via `.with_context()` instead of `.map_err(|_| anyhow!(...))`. - Trim session-narrative parentheticals ("Tick N", "Mira's review") from doc comments; keep the technical reason that future readers need. - Refresh stale doc strings that still said "next slice" for behaviour Phase 2 itself ships. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Concurrent connect() callers may each hit `try_refresh_oauth_token` with the same stale `refresh_token` (no per-server single-flight gate yet). The connect() race guard prevents state corruption; the worst case is N duplicate POSTs to the token endpoint. A Phase 3 follow-up will add per-server single-flight for providers that cascade-revoke on replayed refresh tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Align ADR with implementation: the device-code polling loop persists tokens then sets status to Disconnected, so the next connect() reads the cached token via the oauth-aware dial path. Keeps the rmcp handshake out of the detached polling task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Track the most-recent device-poll JoinHandle per server in `McpRuntimeManager.device_login_tasks`. On a fresh `start_device_login` the prior task is aborted before the new one is registered, so a retry after a transient failure can't leave two loops racing to finalize the same server (would risk double-persisting tokens / contradictory status transitions). Phase 2 follow-up openabdev#4 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add `refresh_locks: HashMap<server, Arc<tokio::Mutex>>` and acquire the per-server lock before the refresh POST. After taking the lock, re-read the on-disk token; if a prior waiter already refreshed, return the cached store without a second POST. Closes the cascade-revoke window on providers (Google, ...) that invalidate every issued refresh_token when the same one is replayed concurrently. Phase 2 follow-up openabdev#1 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace the truncate+write+sync_all path with: write same-dir tmp (PID + atomic counter for uniqueness, mode 0o600 via create_new), fsync the tmp, rename(2) onto auth.json, fsync the parent directory. On crash, the file is either the prior version or the new version — never half-written. The old implementation truncated `auth.json` before writing, so a mid-write crash left a corrupt file that the next startup would fail to parse. On the ECS deployment that then re-restored a now-revoked refresh_token from S3 (ADR §6.1). Phase 2 follow-up openabdev#3 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

MCP is now an unconditional dep. Drops the `[features]` block in Cargo.toml, promotes `rmcp` from optional to required, strips every `#[cfg(feature = "mcp")]` / `#[cfg(not(feature = "mcp"))]` gate in src/{acp,agent,auth,main}.rs, and drops the no-mcp combo from the CI matrix (both .github/workflows/ci-openab-agent.yml and the local mirror script). Per ADR `docs/adr/openab-agent-mcp.md` §9 Phase 3 — the intermediate `default = ["mcp"]` step is skipped in favor of going straight to flag removal (decision recorded in tracking issue openabdev#966). Refs openabdev#966 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds `mcp/breaker.rs` per ADR §5.9 and the design decisions resolved in openabdev#966: - Q1 (a) Fixed cooldown 3-state breaker (Closed / Open / HalfOpen) - Q2 (a) Single consecutive-failure counter per server - Q3 (c) Lazy / piggyback probe (next call after cooldown = probe) API surface: pub enum Verdict { Allow, AllowProbe, Reject { retry_in_secs } } pub struct ServerBreaker { ... } impl ServerBreaker { pub fn new() -> Self; pub fn check(&self, server: &str) -> Verdict; pub fn record_success(&self, server: &str); pub fn record_failure(&self, server: &str); } Constants: `FAIL_THRESHOLD = 3`, `COOLDOWN = 60s`. Matches Hermes `tools/mcp_tool.py` (lines 1868-1912 + 2480-2510). ADR §5.9 mentions "3 fails in 30s" but Hermes itself uses pure consecutive count with no time window — going Hermes-simple here (any success resets). `#![allow(dead_code)]` carries the API surface until the next slice wires it into `McpRuntimeManager` (per the runbook's stub-without- prod-caller convention). Unit tests cover: unknown-server pass-through, under-threshold pass, threshold-trip, success-resets-count, cooldown-elapsed-allows-probe, probe-failure-rearms-cooldown, probe-success-closes-breaker, per- server isolation, retry_in_secs floor (1s when cooldown almost done). Refs openabdev#966 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Connects the per-server `ServerBreaker` introduced in 26c7d87 to the `McpRuntimeManager` connect path and the `meta_tool::call_tool` dispatch path (ADR §5.9 + openabdev#966 design decisions). - `connect()`: check breaker after the already-connected fast path; `Reject` short-circuits with a `retry in {n}s` hint. NeedsAuth bounces (missing/expired token, refresh failure) do NOT touch the breaker — those are auth-level, not transport-level. - Dial outcome at the tail of `connect()` records success/failure on the breaker, matching the single-counter / transport-only model resolved in openabdev#966 Q2. - `meta_tool::call_tool` wraps `peer.call_tool()`: wire-level `Err` = transport failure → trips the breaker; wire-level `Ok` (regardless of `CallToolResult.is_error`) resets it. `isError: true` content is protocol-normal payload, not a transport fault. - New `record_tool_call_outcome(name, ok)` is the only public surface the meta_tool path uses; the breaker stays private inside the manager. Tests: 2 new integration tests in `mcp::runtime::tests` — breaker_opens_after_threshold_consecutive_connect_failures (dead-port pattern, hermetic) and breaker_does_not_count_oauth_needs_auth_bounces (NeedsAuth isolation). Existing 9 breaker unit tests still pass.

Closes the Phase 3 scope from openabdev#966 — interactive diagnostic that runs a live `connect()` against every configured server, with three checks per server (short-circuit on first ✗): 1. Config resolution — `${env:VAR}` placeholders resolve against the live process env. Missing vars are flagged with the offending name and a "set the var(s)" hint. 2. OAuth state (HTTP + `oauth:` only) — cached `TokenStore` in `auth.json`. Missing → `mcp login <name>` hint; expired → noted but not fatal (connect will attempt refresh). 3. Live connect — `manager.connect(name).await`. Errors surface verbatim, including the §5.9 circuit-breaker `retry in {n}s` hint when the breaker is open. Exits non-zero when any server fails so CI / scripts can wrap with `openab-agent mcp doctor || alert`. Reuses the existing `McpRuntimeManager` and `load_namespaced_token_at` primitives — no new runtime state, no new fields on `McpRuntimeManager`. Wired into the existing `mcp` subcommand dispatch in main.rs alongside `list` / `status` / `connect` / `login`.

…F6/F9/F10 + F2/F7 doc) Code changes: - F6: gate `load_runtime_or_warn()` on `OPENAB_AGENT_MCP={1,true,yes,on}` env var. Compile-time feature flag is being removed in Phase 3 PR openabdev#969; this adds the runtime half so MCP stays dormant unless explicitly opted in, preventing accidental activation when mcp.json is present. - F4: integration test in `mcp::runtime::tests` exercising two concurrent `connect()` tasks against a guaranteed-failure server, asserts no stuck-Connecting state and that follow-up dials still attempt fresh. - F5: defensive guard in `Agent::execute_tool_call` for `mcp` tool calls when manager is unavailable — returns actionable error instead of silently falling through to fs-tool dispatch. - F9: regression test for missing-env-var error path; asserts the offender variable name + server name surface in the chained error message. - F10: `truncate_context` drain bounds clarified — `3.min(len)` instead of `(1 + 2).min(len)`, with comment explaining the preserved first message. Doc changes (ADR §5.4.1 added): - F2: rationale for `env_clear()` baseline + explicit `env:` allowlist — blocks token leakage through ambient env, matches MCP SDK behavior. - F7: rationale for single shared `McpRuntimeManager` per process — connection pooling, deduplicated handshakes, lifecycle owned by Agent.

shaun-agent · 2026-06-01T18:01:42Z

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Title: feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on feat(openab-agent): native MCP client (ADR + implementation) #959 + feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962)
Source: feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962) #969
Status: moved to PR-Screening
Generated at: 2026-06-01T18:02:23.876Z
Discord thread: not available

Screening report

posted screening comment: https://github.com//pull/969#issuecomment-4595186764 project action: moved `PVTI_lADOEFbZWM4BUUALzguZj-g` to `PR-Screening` on https://github.com/orgs/openabdev/projects/1

Intent

Ship Phase 3 of the native openab-agent MCP client: resilience and operational diagnostics. The operator-visible problem is that MCP servers need predictable failure isolation, diagnostics, and a no-flag default path before this client can be treated as production-ready.

Feat

Feature work with some release-cleanup behavior. It adds a per-server MCP circuit breaker, wires transport-failure tracking into connect/tool-call paths, adds mcp doctor diagnostics with remediation hints, and removes the --features mcp Cargo flag so MCP support is unconditional.

Who It Serves

Primary beneficiaries are agent runtime operators and deployers. Maintainers also benefit from a clearer default build and from diagnostics that reduce support ambiguity when MCP configuration, OAuth, or live connectivity fails.

Rewritten Prompt

Implement MCP Phase 3 for openab-agent on top of #959 and #962. Add a per-server three-state circuit breaker with fixed 60-second cooldown and lazy probe behavior. Count only transport-level failures. Exclude auth bounces, OAuth refresh failures, JSON-RPC application errors, and tool isError: true results. Add tests for thresholds, rejection hints, half-open transitions, and auth exclusions. Add mcp doctor diagnostics with actionable remediation and non-zero failure exit. Remove the MCP feature flag and simplify CI. Rebase after #959 and #962 land so the diff contains only Phase 3.

Merge Pitch

This should move forward after stack dependencies are reviewed because it turns MCP from happy-path integration into something operators can debug and recover. Risk is moderate-high due to stacked diff noise and unconditional MCP build impact. Main reviewer concern: breaker failure classification.

Best-Practice Comparison

OpenClaw partially applies: isolated failures and retry/backoff patterns are relevant, but durable job persistence, delivery routing, and run logs are not introduced here.

Hermes Agent directly applies. The PR mirrors its three-state breaker model: fixed cooldown, lazy half-open probe, one per-server counter, and no background polling. Good fit.

Implementation Options

Conservative: keep MCP behind the feature flag for one release; merge breaker and doctor only.
Balanced: accept full Phase 3 after feat(openab-agent): native MCP client (ADR + implementation) #959/feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962, then require rebase before review.
Ambitious: include persisted breaker state, structured run logs, and manual refresh now.

Comparison Table

Option	Speed	Complexity	Reliability	Maintainability	User Impact	Fit for OpenAB now
Conservative	Fastest	Low	Medium	Good short-term	Smaller	Acceptable
Balanced	Moderate	Medium	High	Best	Strong	Best fit
Ambitious	Slow	High	Potentially highest	Risky	Strong but delayed	Poor fit

Recommendation

Use the balanced path. Merge/review #959 and #962 first, then require #969 to rebase so Phase 3 is reviewable on its own. Main gates: breaker classification, mcp doctor exit behavior, and unconditional dependency/CI changes.

brettchien and others added 30 commits May 31, 2026 15:30

style(openab-agent/mcp): apply cargo fmt to config.rs

3d58865

CI fmt-check found 8 formatting deltas in the Phase 1 scaffold. No logic change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix(openab-agent/mcp): satisfy clippy::unnecessary_sort_by

82c43e5

sort_by_key over (name, _) is the cleaner form; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix(openab-agent/mcp): satisfy cargo fmt --check on runtime test

dc1b0a6

The chained `.statuses().await.into_iter().map(|(n, _)| n).collect()` fits on one line under rustfmt's default 100-char width — broke it preemptively last tick on the wrong side of the threshold.

fix(openab-agent/mcp): collapse let-params chain to satisfy cargo fmt

2a859e6

The CallToolRequestParams chain fits in exactly 100 chars; rustfmt keeps it on one line. Burned a tick on a defensive break. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix(openab-agent/mcp): break dispatch().await.unwrap() chain past cha…

f36fb77

…in_width

fix(openab-agent/mcp): use is_none_or per clippy::unnecessary_map_or

6a3e9e0

brettchien and others added 22 commits June 1, 2026 10:34

fix(openab-agent/mcp): saturating_add for refresh expires_at

1e4863f

Mirrors finish_login (oauth.rs:271) so a hostile/buggy provider returning a huge expires_in can't trigger a debug-build overflow panic in the refresh-token rotation path.

This was referenced Jun 1, 2026

feat(openab-agent): native MCP client (ADR + implementation) #959

Open

feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962

Open

brettchien marked this pull request as ready for review June 1, 2026 16:25

brettchien requested a review from thepagent as a code owner June 1, 2026 16:25

github-actions Bot added the pending-maintainer label Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962)#969

feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962)#969
brettchien wants to merge 98 commits into
openabdev:mainfrom
brettchien:feat/openab-agent-mcp-resilience

brettchien commented Jun 1, 2026 •

edited

Loading

Uh oh!

shaun-agent commented Jun 1, 2026 •

edited

Loading

Intent

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brettchien commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 3 commits (review only these)

#966 design decisions

Test plan

Out of scope (deferred follow-ups)

Uh oh!

shaun-agent commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAB PR Screening

Intent

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brettchien commented Jun 1, 2026 •

edited

Loading

shaun-agent commented Jun 1, 2026 •

edited

Loading