Skip to content

feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962)#969

Open
brettchien wants to merge 98 commits into
openabdev:mainfrom
brettchien:feat/openab-agent-mcp-resilience
Open

feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962)#969
brettchien wants to merge 98 commits into
openabdev:mainfrom
brettchien:feat/openab-agent-mcp-resilience

Conversation

@brettchien
Copy link
Copy Markdown
Contributor

@brettchien brettchien commented Jun 1, 2026

Discord Discussion URL: https://discord.com/channels/1490643539682529300/1510657283552706663


📚 PR stack — merge in this order:

  1. feat(openab-agent): native MCP client (ADR + implementation) #959 — Phase 1: Foundation (ADR + rmcp + stdio + meta-tool)
  2. feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962 — Phase 2: OAuth 2.1 paste-back (stacks on feat(openab-agent): native MCP client (ADR + implementation) #959)
  3. feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962) #969 (this PR) — Phase 3: Resilience ← stacks on feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962, merge last

97 commits appear in the diff because GitHub does not support cross-repo PRs with a non-main base (the upstream feat/openab-agent-mcp / feat/openab-agent-mcp-oauth branches only live on the fork). Once #959 + #962 merge, this branch rebases and the diff collapses to the Phase 3 commits below.

Reviewers can take all three in parallel; merges must serialize.

Closes #966.


Summary

Phase 3 of the openab-agent native MCP client (ADR docs/adr/openab-agent-mcp.md §5.9 + §8 + §9). Adds the circuit-breaker resilience layer, the mcp doctor diagnostic CLI, and retires the --features mcp Cargo flag.

Phase 3 commits (review only these)

commit content ADR ref
b3a310b Remove --features mcp flag — drop [features] block, promote rmcp to unconditional dep, strip all #[cfg(feature = "mcp")] gates, collapse CI matrix §9
26c7d87 Per-server circuit-breaker module (Hermes pattern: Closed / Open / HalfOpen, fixed 60s cooldown, lazy probe) + 9 unit tests §5.9
6772a51 Wire breaker into McpRuntimeManager::connect() + meta_tool::call_tool (transport-only counter; NeedsAuth bounces excluded; isError: true excluded) + 2 integration tests §5.9
bf62732 mcp doctor CLI — per-server env / OAuth / live-connect diagnostic with remediation hints; non-zero exit on any failure §8

#966 design decisions

Resolved in #966 issue body (2026-06-01); summary:

  • Q1 — Backoff curve: fixed cooldown 3-state breaker (vs per-attempt exponential). ADR-aligned + Hermes-validated; single tunable (60s).
  • Q2 — Counter scope: single consecutive_failures per server (vs split connect / tool-call). Hermes uses one counter; failure modes are homogeneous after JSON-RPC / isError exclusion.
  • Q3 — Reset trigger: lazy / piggyback probe (vs background poll / explicit-only). Industry-standard (Hystrix, Hermes); no background task.

Transport-level only: JSON-RPC error responses + tool isError: true content do not count. NeedsAuth and OAuth refresh-failure paths do not count either (auth-level, not transport).

Test plan

  • Breaker opens after 3 consecutive transport-level failures
  • Breaker Reject short-circuits with retry in {n}s hint
  • HalfOpen probe success closes; failure re-arms cooldown
  • NeedsAuth bounces do NOT increment counter
  • cargo test 130 passed, cargo clippy -- -D warnings clean, cargo fmt --check clean
  • mcp doctor runs per-server diagnostic with actionable remediation hints
  • mcp doctor exits non-zero when any server fails (composes existing tested primitives — connect() / load_namespaced_token_at / resolved())

Out of scope (deferred follow-ups)

  • Manual mcp connect <name> escape hatch (Hermes equivalent: /mcp refresh)
  • OAuth-recovery-success auto-close of breaker
  • Per-tool permission gates (post-Phase-3 opt-in flag)

🤖 Generated with Claude Code

brettchien and others added 30 commits May 31, 2026 15:30
Adds ADR for in-core rmcp + progressive-disclosure meta-tool, deferred
to symmetry with the Skills extension pattern from PR openabdev#955. Memory
analysis rules out the sidecar alternative; per-session config refresh
replaces file-watcher hot reload to drop ~150 LOC of race-condition
hotspot.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ate guardrails

- §6.4 flow selection: device-code preferred (matches existing CLI convention),
  paste-back universal fallback, browser laptop opt-in
- §5.2 add login/complete_login meta-tool actions; login returns flow-tagged union
- §6.1 spell out TokenStore persistence assumption + cold-start refresh
- §6.3 device_authorization_endpoint extension point + RFC 8414 discovery
- §3.8 stdio container-image caveat (interpreter required for Node/Python)
- §5.7 max_concurrent_servers knob (default 10, see §7 for constrained tuning)
- §7 Fargate 512MB/1GB OOM analysis + mitigations

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ntract

- §6.4 RFC 8414 discovery disabled by default; opt-in requires
  oauth.discovery=true + oauth.discovery_allowlist (boot rejects otherwise).
  Rationale: awsvpc egress + SSRF surface in multi-tenant deployments.
- §6.1 add RTR race warning: async persistence layers (Fargate S3 sync,
  eventually-consistent volumes) must flush new tokens to durable storage
  before Spot interruption, else cascade-revoke locks the user out.
  Contract: fsync(2) agent-side + mtime-event-driven sync deployment-side.
- §6.3 expose oauth.discovery / oauth.discovery_allowlist on custom providers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 foundation slice per ADR §5.4 + §5.6:

- Add optional `rmcp 1.7` dep + `mcp` feature flag (default off)
- Wire `#[cfg(feature = "mcp")] mod mcp;` into main
- New `mcp/config.rs`: `McpConfig` / `ServerConfig` (Stdio | Http) /
  `ToolFilter` / `OAuthConfig`, global+project layered load, project
  precedence on name collision, `${env:VAR}` interpolation with
  per-server error context

OAuth fields limited to `provider` + `scopes`; custom-provider
endpoints (§6.3) deferred to Phase 2 auth slice to avoid dead schema.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI fmt-check found 8 formatting deltas in the Phase 1 scaffold. No
logic change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI previously only built the default feature set, leaving the
Phase 1 MCP scaffold (mcp/config.rs, gated by --features mcp)
without compile, clippy, or test coverage. Adds explicit
--features mcp invocations and watches workflow file changes
so this gap is closed for the rest of the rollout (ADR §9).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires up the Phase 1 config loader so clippy --features mcp can
see it. McpConfig::load() + ServerConfig::resolved() + serde
pretty-print are all reachable from main, clearing the dead-code
denial of compile under -D warnings.

Output groups successful servers (✓) and failures (✗ with reason),
sorted by name for deterministic display.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sort_by_key over (name, _) is the cleaner form; no behavior
change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per Mira's security review on the Phase 1 thread: the previous
'mcp list' eagerly called ServerConfig::resolved(), substituting
\${env:GITHUB_TOKEN} etc. into the printed output. Three leak
paths matter — pasting CLI output into bug reports / chat, screen
sharing, and stdout log collection.

New behavior:
- Default: print raw config; \${env:VAR} placeholders kept verbatim.
  Safe to paste publicly; reader still sees which env var feeds
  each field.
- --resolve opts into substitution and prints a two-line warning
  banner. Useful for diagnosing missing-env startup failures.

No CLI-shape break: 'mcp list' still works; --resolve is additive.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
println! puts the warning banner into stdout, which (a) gets
swallowed by 'mcp list --resolve > dump.json' redirection so the
user never sees the security notice, and (b) corrupts the JSON
payload for downstream pipes like 'mcp list --resolve | jq'.

Routing the banner through eprintln! keeps it visible regardless
of redirection and keeps stdout pure JSON for piping. Standard
Unix convention: data on stdout, diagnostics on stderr.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Introduces McpRuntimeManager owning one ServerHandle per configured server,
each starting in ServerStatus::Disconnected per ADR §5.7 (lazy connect).
Wires the manager via `mcp status` CLI so the types are exercised by
clippy --features mcp; actual rmcp TokioChildProcess dial + Connected /
Failed transitions land in the next slice to keep that risky bit isolated
for bisecting.
`clippy --features mcp -- -D warnings` doesn't compile the test target,
so `len()`'s only callers (the unit tests) didn't keep it alive. Tests
already used `statuses().len()` in one place — switch the other two to
match and drop the now-dead method. `is_empty()` stays because
`cli_show_status` calls it.
…ct()

ADR §5.7 lifecycle needs a `connect()` method that spawns a child process
and awaits an rmcp handshake — both `Send` across `.await`. Plain
`HashMap<String, ServerHandle>` is not `Sync`, and the background idle-
eviction task will share the map with the foreground `mcp call` path, so
the read-heavy / write-light access pattern wants `tokio::sync::RwLock`.

This slice lands the lock migration only:
- `handles: Arc<RwLock<HashMap<String, ServerHandle>>>`
- `statuses()` and `is_empty()` become async; `cli_show_status` follows
- `McpRuntimeManager` is now `Clone` (Arc bump) so the eviction task can
  hold its own handle
- `connect(name)` transitions to `Connecting` and returns; the actual
  `rmcp::TokioChildProcess` dial + `Connected` / `Failed` transitions
  land in the next slice — keeping that bit isolated for bisecting

6 unit tests cover snapshot ordering, unknown-server error, transition,
and clone-shares-state.
The chained `.statuses().await.into_iter().map(|(n, _)| n).collect()`
fits on one line under rustfmt's default 100-char width — broke it
preemptively last tick on the wrong side of the threshold.
`clippy --features mcp -D warnings` doesn't compile the test target, so
`connect()`'s only callers (the unit tests) don't keep it alive. The
real rmcp dial lands in the next slice and will be reachable from the
meta-tool dispatch / `mcp call` CLI path; until then a targeted allow
keeps clippy quiet without papering over the type-level dead_code on
ServerStatus variants (which is already explicitly allowed).
connect() now spawns the configured stdio child via rmcp's
TokioChildProcess, runs the JSON-RPC handshake (().serve(transport)),
and parks the resulting RunningService on ServerHandle.client.

Lock discipline uses a double-acquire: a brief write lock to mark
Connecting, drop, run the dial without any lock held, then a second
brief write lock to install the client or record Failed(msg). Holding
the write lock across the spawn+handshake would block mcp status and
the future idle-eviction scan for the entire dial latency.

HTTP transport is rejected with a "phase 2" error so misconfigured
entries surface cleanly without advancing state past Disconnected.

Adds `openab-agent mcp connect <name>` as a smoke-test CLI for
mcp.json entries; the RunningService is dropped on process exit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rver

Per Mira's Tick 13 review: when two callers race on the same name, both
pass the first-lock idempotency check (Disconnected), both spawn a
child, both come back to acquire the install-lock. The second arrival
must yield to the winner, otherwise it overwrites the installed client
and silently drops the working RunningService.

Adds a 3-line double-check at the second lock acquisition: if status
is already Connected with a live client, return Ok(()) — `dial_result`
falls out of scope here and RunningService::Drop kills the duplicate
child. Cheaper than a Connecting-aware notify/condvar and adequate
until the eviction-aware slice lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three rustfmt collapses that I missed under default 100-char width:

* import order: bare `rmcp::ServiceExt` sorts AFTER `rmcp::transport::*`
  (rustfmt puts sub-module paths before bare items within a crate).
* `return Err(anyhow!(...))` for the HTTP phase-2 message fits on one
  line (~88 chars).
* `let Self { command, args, env } = self;` fits on one line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…patch

Adds the single `mcp` tool the LLM sees (ADR §5.2). This slice lands:

* `Action` enum tagged on `action`, matching the eight ADR actions,
  deserialized straight from the LLM's tool-call payload.
* `dispatch(manager, action) -> Result<Value>` as the single entry
  point — both `agent.rs::execute_tool` (next slice) and tests go
  through this.
* `help` returns the static usage doc.
* `list_servers` returns `[{name, status, transport}]` via a new
  `McpRuntimeManager::snapshot()` that clones out under a read lock.
* The four IO-bearing actions (`list_tools`, `describe_tool`, `call`,
  `status`) return a `not yet implemented` error so the surface area
  is visible to callers without pre-implementing the peer-borrow
  path. `login` / `complete_login` land with the Phase 2 OAuth slice.

`#[allow(dead_code)]` on `Action` and `dispatch` because clippy
--features mcp doesn't compile the test target and agent.rs
wire-up lands next slice.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LLM fallback hint

Two of Mira's three Tick 15 findings (Arc-wrap for peer borrow lands
with the actual list_tools/call slice — Arc without a consumer earns
no keep).

(1) `not_implemented_msg(action)` names the requested action, lists
the actions that DO work (`help`, `list_servers`), and tells the LLM
to fall back to native `read` / `write` / `edit` / `bash` rather than
retrying. The old generic "not yet implemented" gave the model
nothing to act on.

(2) `ServerConfig::transport_label() -> &'static str` replaces the
config clone inside `snapshot()`. The `Stdio` variant carries an
`args: Vec<String>` and `env: HashMap<String, String>`; cloning that
just to read the variant tag was wasted heap traffic on every
`list_servers` call. `snapshot()` now yields
`(String, ServerStatus, &'static str)`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rustfmt's fn_call_width default (60) splits assert!() args when the
inline args exceed it; the new "read, write, edit, bash" fallback
assert tipped the third call from ~55 to ~62 chars.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply Mira's Tick 15 review (3) and ship the first IO-bearing
meta-tool action.

* `ServerHandle.client` is now `Option<Arc<RunningService<...>>>`.
  `connect()` wraps the dial result in `Arc::new`.
* `McpRuntimeManager::arc_peer(name)` clones the Arc out under a
  short read lock, drops the guard, returns the handle. Callers
  `.await` on `peer.list_all_tools()` / `peer.call_tool()` with no
  runtime lock held — kills the writer-starvation risk Mira flagged
  and sidesteps `Future is not Send` from holding a guard across
  `.await`.
* `Action::ListTools { server }` now wired: lazy `connect()` →
  `arc_peer()` → `peer.list_all_tools()` → `[{name, description}]`.
  `describe_tool` / `call` / `status` still stub.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply Mira's Tick 18 review (3) — lenient args parsing for the
`call` action so LLMs that send `arguments: null` (or omit the field
entirely) for no-arg tools don't bounce off a strict type check.

* `Action::Call { server, tool, arguments }` is wired through
  `call_tool()`: validate arguments → lazy `connect()` → `arc_peer()`
  → `peer.call_tool(CallToolRequestParams::new(...).with_arguments(map))`
  → serialize `CallToolResult` to JSON.
* Argument coercion: `Value::Object(map) → map`, `Value::Null → {}`,
  everything else (string, number, array, bool) is rejected with a
  message that names the actual type so the LLM can correct itself.
* Tests cover both branches: non-object string args fail early at
  validation; null args pass validation and fail later at connect.
* `describe_tool` and `status` still stub; supported-list message
  updated to include `call`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CallToolRequestParams chain fits in exactly 100 chars; rustfmt
keeps it on one line. Burned a tick on a defensive break.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… 1 surface

describe_tool returns full input_schema for one tool (list+filter via
list_all_tools — MCP spec has no single-tool query). status reports
per-server {name, status, transport, last_error} with optional name
filter; surfaces Failed's embedded error as last_error.

Extract fetch_tools helper shared by list_tools + describe_tool —
this is the natural insertion point for the planned tools_cache.
Surfaces the `mcp` action enum to the LLM via a new ToolDef and routes
calls through McpRuntimeManager when configured. AcpServer constructs
the manager once at startup (warn-and-default on config parse failure)
and clones the cheap Arc-backed handle into each session's Agent.

Tools.rs stays stateless and feature-flag-free — the mcp routing arm
lives in Agent::execute_tool_call where it has access to the manager.
A shared MCP_TOOL_NAME const keeps mcp_tool_def() and the dispatch arm
in sync.
The `mut` on `tools` and `base` is consumed only inside `#[cfg(feature = "mcp")]`
blocks, so the no-feature build saw them as unused. Gate `allow(unused_mut)`
behind `cfg_attr(not(feature = "mcp"), ...)` so both builds stay clean.
`test_session_new` and `test_session_new_missing_key` both mutate
`ANTHROPIC_API_KEY`. cargo runs tests in parallel, so a `set` from
one thread can be observed by the other before the first reads it.
The pre-existing race was timing-tight enough to pass historically;
adding `mcp::load_runtime_or_warn()` to `AcpServer::new()` widened
the gap between set and read enough to flip the race deterministic.

Guarding both tests with a shared `Mutex<()>` serializes them without
adding a `serial_test` dep. Poisoned guard is recovered via
`into_inner()` — we don't care if a prior assertion panicked.
brettchien and others added 22 commits June 1, 2026 10:34
…_status one-pass

Simplify pass on 9bca3de:

- auth.rs: PENDING_PREFIX demoted to private; pending_key() promoted to
  pub fn as the single public constructor. runtime.rs uses it instead
  of formatting against the const, so the namespace literal lives in
  exactly one place.
- mcp/mod.rs: cli_show_status folds the orphan calculation into the
  print loop via pending.remove(name) — the leftover set IS the
  orphans, dropping the second configured: HashSet<&str> pass.
- mcp/mod.rs: drop the cli_login "Security note" docstring — the same
  rationale lives on the Login clap subcommand doc in main.rs, which
  is the user-facing surface.
- runtime.rs: trim the spawn_blocking rationale on pending_logins —
  moved next to list_pending_logins_at where the sync read actually
  lives.

Net -10 LOC across 3 files; ~/ops/openab-agent-ci.sh --quick green.
Phase 2 / ADR §6.5: HTTP+oauth servers now dial with their cached
access_token when one is present and not within REFRESH_SKEW_SECONDS
of expiry. Refresh-token rotation for the expired-but-recoverable
case is the next slice — for now, expired tokens fall through to
NeedsAuth alongside missing tokens, so `mcp login` stays the single
user-actionable path.

- auth::is_expired now pub(crate) — single source of truth for the
  expiry+skew check across codex (get_valid_token) and mcp (connect).
- Dial::Http carries Option<String> auth; Dial::run uses
  StreamableHttpClientTransportConfig::with_uri(url).auth_header(token)
  when present, falls back to from_uri for anonymous HTTP.
- connect() oauth branch tries load_namespaced_token_at first; the
  Ok+!expired arm builds an authenticated Dial, all other arms bounce
  to NeedsAuth.

Tests:
- connect_oauth_with_valid_cached_token_attempts_dial_not_needs_auth
  proves the bearer-injection path reaches handshake (fails at the
  hermetic 127.0.0.1:1 dead address, surfaces Failed not NeedsAuth).
- connect_oauth_with_expired_token_bounces_to_needs_auth pins the
  expiry guard against future refresh-flow regressions.
Phase 2 / ADR §6.6: HTTP+oauth servers with an expired cached token now
attempt RFC 6749 §6 refresh-grant before bouncing to NeedsAuth. On
success the rotated TokenStore is fsynced and the dial proceeds with
the new bearer; on failure the bounce is preserved so `mcp login`
stays the user-actionable path.

- post_token_refresh: sibling of post_token_exchange for grant_type=
  refresh_token (public client, no client_secret).
- try_refresh_oauth_token: resolves client_id from current config (so
  a rotated catalog entry is picked up), POSTs the refresh, builds a
  new TokenStore preserving the previous refresh_token if the provider
  omits one (Google-style rotation), persists via
  save_namespaced_token_at.
- DialPlan two-phase enum lets the inside-lock branch tag "needs
  refresh" and defer the async refresh to outside the write lock —
  slow refresh must not block concurrent `mcp status` reads.

Tests:
- connect_oauth_expired_no_refresh_token_bounces_to_needs_auth pins
  the empty-refresh short-circuit (no POST attempted).
- connect_oauth_expired_with_refresh_token_failed_refresh_bounces_to_needs_auth
  proves the refresh path runs and that any failure surfaces as
  user-actionable NeedsAuth (current fixture's custom provider plus
  dead token endpoint both fail the refresh).

Custom-provider refresh shares the same gap as custom-provider login
— resolve_paste_client errors before the refresh POST runs. Will be
unblocked by the same future slice that wires custom-provider login.
…ected winner

Mira + Kirin caught a race: two concurrent connect() calls on the same
expired-token server can interleave so one wins the refresh + dial
while the other's refresh fails — and the loser was unconditionally
overwriting `Connected` with `NeedsAuth`, giving the user a false
"please log in" signal on a healthy session.

Guard the status mutation with `!matches!(h.status, Connected)` so a
concurrent winner's terminal state is preserved.
Mirrors finish_login (oauth.rs:271) so a hostile/buggy provider returning
a huge expires_in can't trigger a debug-build overflow panic in the
refresh-token rotation path.
…d refresh

post_token_exchange and post_token_refresh diverged only in the form
payload — same Client::builder() boilerplate, same status/body folding,
same response shape. Extract post_token_form so each grant call site is
just the RFC 6749 form tuple + a grant_label for the error context.

Also drops a narrating comment in connect() that recapped the three
match arms below it without adding any non-obvious context.
Adds the §3.1 device authorization request + §3.4 polling tick HTTP
helpers — pure data plumbing, no runtime wiring yet. The polling-tick
response classifier is split out as a pure fn (classify_device_poll) so
the §3.5 error-code → flow-state mapping (authorization_pending,
slow_down, access_denied, expired_token) is unit-testable without a
mock HTTP server.

DeviceAuthResponse defaults interval to 5s per RFC 8628 §3.5 when the
provider omits it. verification_uri_complete (§3.3.1 extension) is
Option since verification_uri + user_code is the always-present
fallback.

Both POST fns + the two types are #[allow(dead_code)] — next slice wires
start_device_login() and the polling loop into McpRuntimeManager.
…loop)

- `start_device_login(name)`: POST §3.1 + spawn detached §3.4 polling task
- `DeviceLoginStart`: public user-facing bundle (user_code / verification_uri / verification_uri_complete / expires_in)
- `resolve_device_client`: Custom-provider-only gate with explicit error
  for missing `device_authorization_endpoint` / `client_id`
- `run_device_poll_loop`: deadline check + SlowDown back-off (+5s) +
  success persistence + NeedsAuth on terminal failure
- `finalize_device_login` / `mark_device_login_failed`: status transitions
  (Disconnected on success keeps MCP handshake out of detached task;
   next `connect()` picks up the cached token via the oauth-aware DialPlan)
- 4 gating tests: unknown server / stdio / missing device endpoint / HTTP reach

#[allow(dead_code)] on the new surface until the next slice wires
`mcp login --device` CLI. Quick gate green.
- main.rs: add `--device` flag to `McpAction::Login` (conflicts_with `paste`)
- mcp/mod.rs: `cli_login_device(name)` — prints verification_uri_complete
  (§3.3.1) when present, falls back to `verification_uri` + `user_code`;
  watches `statuses()` until Disconnected (success) / NeedsAuth (terminal
  failure) or the provider-declared `expires_in` deadline elapses
- runtime.rs: drop `#[allow(dead_code)]` on the Tick-55 device-flow chain
  (`start_device_login` is now called from the CLI surface)
- config.rs: add `oauth.redirect_uri: Option<String>` to `OAuthConfig`
  for the paste-back branch on custom providers (built-ins pin their
  callback in `ProviderSpec`; ignored by the device-code branch)
- oauth.rs: propagate `redirect_uri` through `ResolvedProvider::Custom`
- runtime.rs: in `resolve_paste_client`, accept Custom + client_id +
  redirect_uri; emit precise per-missing-field error otherwise:
  - device endpoint present → "use device flow" (preserved)
  - missing client_id → "requires `oauth.client_id`"
  - missing redirect_uri → "requires `oauth.redirect_uri` (must match
    the redirect URL pre-registered with the provider)"
- Tests: rename `_rejects_custom_provider_for_now` →
  `_rejects_custom_without_redirect_uri`; add
  `_rejects_custom_without_client_id` + success-path
  `_custom_with_client_id_and_redirect_uri_succeeds`

Closes the last Phase 2 paste-back functional gap. ADR §6.3 enumerates
optional custom fields but doesn't currently list `redirect_uri`;
follow-up ADR amendment lands separately.
Adds the field to the §6.3 config-shape list and a follow-up paragraph
explaining why custom paste-back requires it (no local listener; must
match provider OAuth app registration) and why the device-code / built-in
branches ignore it.

Follow-up to 84837ee (custom-provider paste-back support); Mira + Kirin
dual sign-off requested this as a non-blocking docs amendment.
- auth.rs: convert free function `is_expired(&TokenStore) -> bool` into
  `impl TokenStore { pub fn is_expired(&self) -> bool }`. Switch the skew
  addition to `saturating_add` for parity with the rest of the OAuth
  arithmetic (defends `u64::MAX` "never expires" sentinel + any
  near-overflow clock value).
- auth.rs: fold `show_status`'s inline duplicate of the expiry formula
  into the new method.
- mcp/runtime.rs: drop the free-function import + update the cached-token
  DialPlan branch to `store.is_expired()`.
- Tests: rewrite the three existing `is_expired` cases to method form and
  add `_sentinel_u64_max` covering the saturating_add behavior.

Mira Tick 50 non-blocking cleanup.
Drop stale `#[allow(dead_code)]` markers now that runtime/flow wire all
items through prod paths. Reuse `reqwest::Client` across the device-poll
loop so TLS / TCP handshakes are amortized across the dozens-to-hundreds
of polls a 30-minute device-flow window can produce. Trim session-narrative
parentheticals from doc comments.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer-pass findings:

- Drop dead auth.rs wrappers (`load/save/remove_namespaced_token`) that
  were never wired; runtime/tests use the `_at` siblings directly.
- Extract `build_token_store` helper to dedupe the `TokenStore`
  construction + `u64::MAX` sentinel + refresh-rotation logic across
  `finish_login`, `finalize_device_login`, and `try_refresh_oauth_token`.
- Preserve underlying error context in `complete_login` via
  `.with_context()` instead of `.map_err(|_| anyhow!(...))`.
- Trim session-narrative parentheticals ("Tick N", "Mira's review")
  from doc comments; keep the technical reason that future readers need.
- Refresh stale doc strings that still said "next slice" for behaviour
  Phase 2 itself ships.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Concurrent connect() callers may each hit `try_refresh_oauth_token`
with the same stale `refresh_token` (no per-server single-flight gate
yet). The connect() race guard prevents state corruption; the worst
case is N duplicate POSTs to the token endpoint. A Phase 3 follow-up
will add per-server single-flight for providers that cascade-revoke
on replayed refresh tokens.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Align ADR with implementation: the device-code polling loop persists
tokens then sets status to Disconnected, so the next connect() reads
the cached token via the oauth-aware dial path. Keeps the rmcp handshake
out of the detached polling task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Track the most-recent device-poll JoinHandle per server in
`McpRuntimeManager.device_login_tasks`. On a fresh `start_device_login`
the prior task is aborted before the new one is registered, so a retry
after a transient failure can't leave two loops racing to finalize the
same server (would risk double-persisting tokens / contradictory status
transitions).

Phase 2 follow-up openabdev#4 of 4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add `refresh_locks: HashMap<server, Arc<tokio::Mutex>>` and acquire the
per-server lock before the refresh POST. After taking the lock, re-read
the on-disk token; if a prior waiter already refreshed, return the
cached store without a second POST.

Closes the cascade-revoke window on providers (Google, ...) that
invalidate every issued refresh_token when the same one is replayed
concurrently. Phase 2 follow-up openabdev#1 of 4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the truncate+write+sync_all path with: write same-dir tmp
(PID + atomic counter for uniqueness, mode 0o600 via create_new),
fsync the tmp, rename(2) onto auth.json, fsync the parent directory.
On crash, the file is either the prior version or the new version —
never half-written.

The old implementation truncated `auth.json` before writing, so a
mid-write crash left a corrupt file that the next startup would fail
to parse. On the ECS deployment that then re-restored a now-revoked
refresh_token from S3 (ADR §6.1). Phase 2 follow-up openabdev#3 of 4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MCP is now an unconditional dep. Drops the `[features]` block in
Cargo.toml, promotes `rmcp` from optional to required, strips every
`#[cfg(feature = "mcp")]` / `#[cfg(not(feature = "mcp"))]` gate in
src/{acp,agent,auth,main}.rs, and drops the no-mcp combo from the CI
matrix (both .github/workflows/ci-openab-agent.yml and the local
mirror script).

Per ADR `docs/adr/openab-agent-mcp.md` §9 Phase 3 — the intermediate
`default = ["mcp"]` step is skipped in favor of going straight to
flag removal (decision recorded in tracking issue openabdev#966).

Refs openabdev#966

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `mcp/breaker.rs` per ADR §5.9 and the design decisions resolved
in openabdev#966:

- Q1 (a) Fixed cooldown 3-state breaker (Closed / Open / HalfOpen)
- Q2 (a) Single consecutive-failure counter per server
- Q3 (c) Lazy / piggyback probe (next call after cooldown = probe)

API surface:

    pub enum Verdict { Allow, AllowProbe, Reject { retry_in_secs } }
    pub struct ServerBreaker { ... }
    impl ServerBreaker {
        pub fn new() -> Self;
        pub fn check(&self, server: &str) -> Verdict;
        pub fn record_success(&self, server: &str);
        pub fn record_failure(&self, server: &str);
    }

Constants: `FAIL_THRESHOLD = 3`, `COOLDOWN = 60s`. Matches Hermes
`tools/mcp_tool.py` (lines 1868-1912 + 2480-2510). ADR §5.9 mentions
"3 fails in 30s" but Hermes itself uses pure consecutive count with
no time window — going Hermes-simple here (any success resets).

`#![allow(dead_code)]` carries the API surface until the next slice
wires it into `McpRuntimeManager` (per the runbook's stub-without-
prod-caller convention).

Unit tests cover: unknown-server pass-through, under-threshold pass,
threshold-trip, success-resets-count, cooldown-elapsed-allows-probe,
probe-failure-rearms-cooldown, probe-success-closes-breaker, per-
server isolation, retry_in_secs floor (1s when cooldown almost done).

Refs openabdev#966

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Connects the per-server `ServerBreaker` introduced in 26c7d87 to the
`McpRuntimeManager` connect path and the `meta_tool::call_tool`
dispatch path (ADR §5.9 + openabdev#966 design decisions).

- `connect()`: check breaker after the already-connected fast path;
  `Reject` short-circuits with a `retry in {n}s` hint. NeedsAuth bounces
  (missing/expired token, refresh failure) do NOT touch the breaker —
  those are auth-level, not transport-level.
- Dial outcome at the tail of `connect()` records success/failure on
  the breaker, matching the single-counter / transport-only model
  resolved in openabdev#966 Q2.
- `meta_tool::call_tool` wraps `peer.call_tool()`: wire-level `Err` =
  transport failure → trips the breaker; wire-level `Ok` (regardless
  of `CallToolResult.is_error`) resets it. `isError: true` content is
  protocol-normal payload, not a transport fault.
- New `record_tool_call_outcome(name, ok)` is the only public surface
  the meta_tool path uses; the breaker stays private inside the manager.

Tests: 2 new integration tests in `mcp::runtime::tests` —
breaker_opens_after_threshold_consecutive_connect_failures (dead-port
pattern, hermetic) and breaker_does_not_count_oauth_needs_auth_bounces
(NeedsAuth isolation). Existing 9 breaker unit tests still pass.
Closes the Phase 3 scope from openabdev#966 — interactive diagnostic that runs
a live `connect()` against every configured server, with three checks
per server (short-circuit on first ✗):

1. Config resolution — `${env:VAR}` placeholders resolve against the
   live process env. Missing vars are flagged with the offending name
   and a "set the var(s)" hint.
2. OAuth state (HTTP + `oauth:` only) — cached `TokenStore` in
   `auth.json`. Missing → `mcp login <name>` hint; expired → noted
   but not fatal (connect will attempt refresh).
3. Live connect — `manager.connect(name).await`. Errors surface
   verbatim, including the §5.9 circuit-breaker `retry in {n}s` hint
   when the breaker is open.

Exits non-zero when any server fails so CI / scripts can wrap with
`openab-agent mcp doctor || alert`. Reuses the existing
`McpRuntimeManager` and `load_namespaced_token_at` primitives — no
new runtime state, no new fields on `McpRuntimeManager`.

Wired into the existing `mcp` subcommand dispatch in main.rs
alongside `list` / `status` / `connect` / `login`.
@brettchien brettchien marked this pull request as ready for review June 1, 2026 16:25
@brettchien brettchien requested a review from thepagent as a code owner June 1, 2026 16:25
brettchien added a commit to brettchien/openab that referenced this pull request Jun 1, 2026
…F6/F9/F10 + F2/F7 doc)

Code changes:
- F6: gate `load_runtime_or_warn()` on `OPENAB_AGENT_MCP={1,true,yes,on}`
  env var. Compile-time feature flag is being removed in Phase 3 PR openabdev#969;
  this adds the runtime half so MCP stays dormant unless explicitly opted
  in, preventing accidental activation when mcp.json is present.
- F4: integration test in `mcp::runtime::tests` exercising two concurrent
  `connect()` tasks against a guaranteed-failure server, asserts no
  stuck-Connecting state and that follow-up dials still attempt fresh.
- F5: defensive guard in `Agent::execute_tool_call` for `mcp` tool calls
  when manager is unavailable — returns actionable error instead of
  silently falling through to fs-tool dispatch.
- F9: regression test for missing-env-var error path; asserts the offender
  variable name + server name surface in the chained error message.
- F10: `truncate_context` drain bounds clarified — `3.min(len)` instead of
  `(1 + 2).min(len)`, with comment explaining the preserved first message.

Doc changes (ADR §5.4.1 added):
- F2: rationale for `env_clear()` baseline + explicit `env:` allowlist —
  blocks token leakage through ambient env, matches MCP SDK behavior.
- F7: rationale for single shared `McpRuntimeManager` per process —
  connection pooling, deduplicated handshakes, lifecycle owned by Agent.
@shaun-agent
Copy link
Copy Markdown
Contributor

shaun-agent commented Jun 1, 2026

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report posted screening comment: https://github.com//pull/969#issuecomment-4595186764 project action: moved `PVTI_lADOEFbZWM4BUUALzguZj-g` to `PR-Screening` on https://github.com/orgs/openabdev/projects/1

Intent

Ship Phase 3 of the native openab-agent MCP client: resilience and operational diagnostics. The operator-visible problem is that MCP servers need predictable failure isolation, diagnostics, and a no-flag default path before this client can be treated as production-ready.

Feat

Feature work with some release-cleanup behavior. It adds a per-server MCP circuit breaker, wires transport-failure tracking into connect/tool-call paths, adds mcp doctor diagnostics with remediation hints, and removes the --features mcp Cargo flag so MCP support is unconditional.

Who It Serves

Primary beneficiaries are agent runtime operators and deployers. Maintainers also benefit from a clearer default build and from diagnostics that reduce support ambiguity when MCP configuration, OAuth, or live connectivity fails.

Rewritten Prompt

Implement MCP Phase 3 for openab-agent on top of #959 and #962. Add a per-server three-state circuit breaker with fixed 60-second cooldown and lazy probe behavior. Count only transport-level failures. Exclude auth bounces, OAuth refresh failures, JSON-RPC application errors, and tool isError: true results. Add tests for thresholds, rejection hints, half-open transitions, and auth exclusions. Add mcp doctor diagnostics with actionable remediation and non-zero failure exit. Remove the MCP feature flag and simplify CI. Rebase after #959 and #962 land so the diff contains only Phase 3.

Merge Pitch

This should move forward after stack dependencies are reviewed because it turns MCP from happy-path integration into something operators can debug and recover. Risk is moderate-high due to stacked diff noise and unconditional MCP build impact. Main reviewer concern: breaker failure classification.

Best-Practice Comparison

OpenClaw partially applies: isolated failures and retry/backoff patterns are relevant, but durable job persistence, delivery routing, and run logs are not introduced here.

Hermes Agent directly applies. The PR mirrors its three-state breaker model: fixed cooldown, lazy half-open probe, one per-server counter, and no background polling. Good fit.

Implementation Options

  1. Conservative: keep MCP behind the feature flag for one release; merge breaker and doctor only.
  2. Balanced: accept full Phase 3 after feat(openab-agent): native MCP client (ADR + implementation) #959/feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962, then require rebase before review.
  3. Ambitious: include persisted breaker state, structured run logs, and manual refresh now.

Comparison Table

Option Speed Complexity Reliability Maintainability User Impact Fit for OpenAB now
Conservative Fastest Low Medium Good short-term Smaller Acceptable
Balanced Moderate Medium High Best Strong Best fit
Ambitious Slow High Potentially highest Risky Strong but delayed Poor fit

Recommendation

Use the balanced path. Merge/review #959 and #962 first, then require #969 to rebase so Phase 3 is reviewable on its own. Main gates: breaker classification, mcp doctor exit behavior, and unconditional dependency/CI changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tracking: openab-agent native MCP — Phase 3 (Resilience)

2 participants