feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962)#969
feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962)#969brettchien wants to merge 98 commits into
Conversation
Adds ADR for in-core rmcp + progressive-disclosure meta-tool, deferred to symmetry with the Skills extension pattern from PR openabdev#955. Memory analysis rules out the sidecar alternative; per-session config refresh replaces file-watcher hot reload to drop ~150 LOC of race-condition hotspot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ate guardrails - §6.4 flow selection: device-code preferred (matches existing CLI convention), paste-back universal fallback, browser laptop opt-in - §5.2 add login/complete_login meta-tool actions; login returns flow-tagged union - §6.1 spell out TokenStore persistence assumption + cold-start refresh - §6.3 device_authorization_endpoint extension point + RFC 8414 discovery - §3.8 stdio container-image caveat (interpreter required for Node/Python) - §5.7 max_concurrent_servers knob (default 10, see §7 for constrained tuning) - §7 Fargate 512MB/1GB OOM analysis + mitigations Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ntract - §6.4 RFC 8414 discovery disabled by default; opt-in requires oauth.discovery=true + oauth.discovery_allowlist (boot rejects otherwise). Rationale: awsvpc egress + SSRF surface in multi-tenant deployments. - §6.1 add RTR race warning: async persistence layers (Fargate S3 sync, eventually-consistent volumes) must flush new tokens to durable storage before Spot interruption, else cascade-revoke locks the user out. Contract: fsync(2) agent-side + mtime-event-driven sync deployment-side. - §6.3 expose oauth.discovery / oauth.discovery_allowlist on custom providers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 foundation slice per ADR §5.4 + §5.6:
- Add optional `rmcp 1.7` dep + `mcp` feature flag (default off)
- Wire `#[cfg(feature = "mcp")] mod mcp;` into main
- New `mcp/config.rs`: `McpConfig` / `ServerConfig` (Stdio | Http) /
`ToolFilter` / `OAuthConfig`, global+project layered load, project
precedence on name collision, `${env:VAR}` interpolation with
per-server error context
OAuth fields limited to `provider` + `scopes`; custom-provider
endpoints (§6.3) deferred to Phase 2 auth slice to avoid dead schema.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI fmt-check found 8 formatting deltas in the Phase 1 scaffold. No logic change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI previously only built the default feature set, leaving the Phase 1 MCP scaffold (mcp/config.rs, gated by --features mcp) without compile, clippy, or test coverage. Adds explicit --features mcp invocations and watches workflow file changes so this gap is closed for the rest of the rollout (ADR §9). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires up the Phase 1 config loader so clippy --features mcp can see it. McpConfig::load() + ServerConfig::resolved() + serde pretty-print are all reachable from main, clearing the dead-code denial of compile under -D warnings. Output groups successful servers (✓) and failures (✗ with reason), sorted by name for deterministic display. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sort_by_key over (name, _) is the cleaner form; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per Mira's security review on the Phase 1 thread: the previous
'mcp list' eagerly called ServerConfig::resolved(), substituting
\${env:GITHUB_TOKEN} etc. into the printed output. Three leak
paths matter — pasting CLI output into bug reports / chat, screen
sharing, and stdout log collection.
New behavior:
- Default: print raw config; \${env:VAR} placeholders kept verbatim.
Safe to paste publicly; reader still sees which env var feeds
each field.
- --resolve opts into substitution and prints a two-line warning
banner. Useful for diagnosing missing-env startup failures.
No CLI-shape break: 'mcp list' still works; --resolve is additive.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
println! puts the warning banner into stdout, which (a) gets swallowed by 'mcp list --resolve > dump.json' redirection so the user never sees the security notice, and (b) corrupts the JSON payload for downstream pipes like 'mcp list --resolve | jq'. Routing the banner through eprintln! keeps it visible regardless of redirection and keeps stdout pure JSON for piping. Standard Unix convention: data on stdout, diagnostics on stderr. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Introduces McpRuntimeManager owning one ServerHandle per configured server, each starting in ServerStatus::Disconnected per ADR §5.7 (lazy connect). Wires the manager via `mcp status` CLI so the types are exercised by clippy --features mcp; actual rmcp TokioChildProcess dial + Connected / Failed transitions land in the next slice to keep that risky bit isolated for bisecting.
`clippy --features mcp -- -D warnings` doesn't compile the test target, so `len()`'s only callers (the unit tests) didn't keep it alive. Tests already used `statuses().len()` in one place — switch the other two to match and drop the now-dead method. `is_empty()` stays because `cli_show_status` calls it.
…ct() ADR §5.7 lifecycle needs a `connect()` method that spawns a child process and awaits an rmcp handshake — both `Send` across `.await`. Plain `HashMap<String, ServerHandle>` is not `Sync`, and the background idle- eviction task will share the map with the foreground `mcp call` path, so the read-heavy / write-light access pattern wants `tokio::sync::RwLock`. This slice lands the lock migration only: - `handles: Arc<RwLock<HashMap<String, ServerHandle>>>` - `statuses()` and `is_empty()` become async; `cli_show_status` follows - `McpRuntimeManager` is now `Clone` (Arc bump) so the eviction task can hold its own handle - `connect(name)` transitions to `Connecting` and returns; the actual `rmcp::TokioChildProcess` dial + `Connected` / `Failed` transitions land in the next slice — keeping that bit isolated for bisecting 6 unit tests cover snapshot ordering, unknown-server error, transition, and clone-shares-state.
The chained `.statuses().await.into_iter().map(|(n, _)| n).collect()` fits on one line under rustfmt's default 100-char width — broke it preemptively last tick on the wrong side of the threshold.
`clippy --features mcp -D warnings` doesn't compile the test target, so `connect()`'s only callers (the unit tests) don't keep it alive. The real rmcp dial lands in the next slice and will be reachable from the meta-tool dispatch / `mcp call` CLI path; until then a targeted allow keeps clippy quiet without papering over the type-level dead_code on ServerStatus variants (which is already explicitly allowed).
connect() now spawns the configured stdio child via rmcp's TokioChildProcess, runs the JSON-RPC handshake (().serve(transport)), and parks the resulting RunningService on ServerHandle.client. Lock discipline uses a double-acquire: a brief write lock to mark Connecting, drop, run the dial without any lock held, then a second brief write lock to install the client or record Failed(msg). Holding the write lock across the spawn+handshake would block mcp status and the future idle-eviction scan for the entire dial latency. HTTP transport is rejected with a "phase 2" error so misconfigured entries surface cleanly without advancing state past Disconnected. Adds `openab-agent mcp connect <name>` as a smoke-test CLI for mcp.json entries; the RunningService is dropped on process exit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rver Per Mira's Tick 13 review: when two callers race on the same name, both pass the first-lock idempotency check (Disconnected), both spawn a child, both come back to acquire the install-lock. The second arrival must yield to the winner, otherwise it overwrites the installed client and silently drops the working RunningService. Adds a 3-line double-check at the second lock acquisition: if status is already Connected with a live client, return Ok(()) — `dial_result` falls out of scope here and RunningService::Drop kills the duplicate child. Cheaper than a Connecting-aware notify/condvar and adequate until the eviction-aware slice lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three rustfmt collapses that I missed under default 100-char width:
* import order: bare `rmcp::ServiceExt` sorts AFTER `rmcp::transport::*`
(rustfmt puts sub-module paths before bare items within a crate).
* `return Err(anyhow!(...))` for the HTTP phase-2 message fits on one
line (~88 chars).
* `let Self { command, args, env } = self;` fits on one line.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…patch
Adds the single `mcp` tool the LLM sees (ADR §5.2). This slice lands:
* `Action` enum tagged on `action`, matching the eight ADR actions,
deserialized straight from the LLM's tool-call payload.
* `dispatch(manager, action) -> Result<Value>` as the single entry
point — both `agent.rs::execute_tool` (next slice) and tests go
through this.
* `help` returns the static usage doc.
* `list_servers` returns `[{name, status, transport}]` via a new
`McpRuntimeManager::snapshot()` that clones out under a read lock.
* The four IO-bearing actions (`list_tools`, `describe_tool`, `call`,
`status`) return a `not yet implemented` error so the surface area
is visible to callers without pre-implementing the peer-borrow
path. `login` / `complete_login` land with the Phase 2 OAuth slice.
`#[allow(dead_code)]` on `Action` and `dispatch` because clippy
--features mcp doesn't compile the test target and agent.rs
wire-up lands next slice.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LLM fallback hint Two of Mira's three Tick 15 findings (Arc-wrap for peer borrow lands with the actual list_tools/call slice — Arc without a consumer earns no keep). (1) `not_implemented_msg(action)` names the requested action, lists the actions that DO work (`help`, `list_servers`), and tells the LLM to fall back to native `read` / `write` / `edit` / `bash` rather than retrying. The old generic "not yet implemented" gave the model nothing to act on. (2) `ServerConfig::transport_label() -> &'static str` replaces the config clone inside `snapshot()`. The `Stdio` variant carries an `args: Vec<String>` and `env: HashMap<String, String>`; cloning that just to read the variant tag was wasted heap traffic on every `list_servers` call. `snapshot()` now yields `(String, ServerStatus, &'static str)`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rustfmt's fn_call_width default (60) splits assert!() args when the inline args exceed it; the new "read, write, edit, bash" fallback assert tipped the third call from ~55 to ~62 chars. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply Mira's Tick 15 review (3) and ship the first IO-bearing
meta-tool action.
* `ServerHandle.client` is now `Option<Arc<RunningService<...>>>`.
`connect()` wraps the dial result in `Arc::new`.
* `McpRuntimeManager::arc_peer(name)` clones the Arc out under a
short read lock, drops the guard, returns the handle. Callers
`.await` on `peer.list_all_tools()` / `peer.call_tool()` with no
runtime lock held — kills the writer-starvation risk Mira flagged
and sidesteps `Future is not Send` from holding a guard across
`.await`.
* `Action::ListTools { server }` now wired: lazy `connect()` →
`arc_peer()` → `peer.list_all_tools()` → `[{name, description}]`.
`describe_tool` / `call` / `status` still stub.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply Mira's Tick 18 review (3) — lenient args parsing for the
`call` action so LLMs that send `arguments: null` (or omit the field
entirely) for no-arg tools don't bounce off a strict type check.
* `Action::Call { server, tool, arguments }` is wired through
`call_tool()`: validate arguments → lazy `connect()` → `arc_peer()`
→ `peer.call_tool(CallToolRequestParams::new(...).with_arguments(map))`
→ serialize `CallToolResult` to JSON.
* Argument coercion: `Value::Object(map) → map`, `Value::Null → {}`,
everything else (string, number, array, bool) is rejected with a
message that names the actual type so the LLM can correct itself.
* Tests cover both branches: non-object string args fail early at
validation; null args pass validation and fail later at connect.
* `describe_tool` and `status` still stub; supported-list message
updated to include `call`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CallToolRequestParams chain fits in exactly 100 chars; rustfmt keeps it on one line. Burned a tick on a defensive break. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… 1 surface
describe_tool returns full input_schema for one tool (list+filter via
list_all_tools — MCP spec has no single-tool query). status reports
per-server {name, status, transport, last_error} with optional name
filter; surfaces Failed's embedded error as last_error.
Extract fetch_tools helper shared by list_tools + describe_tool —
this is the natural insertion point for the planned tools_cache.
Surfaces the `mcp` action enum to the LLM via a new ToolDef and routes calls through McpRuntimeManager when configured. AcpServer constructs the manager once at startup (warn-and-default on config parse failure) and clones the cheap Arc-backed handle into each session's Agent. Tools.rs stays stateless and feature-flag-free — the mcp routing arm lives in Agent::execute_tool_call where it has access to the manager. A shared MCP_TOOL_NAME const keeps mcp_tool_def() and the dispatch arm in sync.
The `mut` on `tools` and `base` is consumed only inside `#[cfg(feature = "mcp")]` blocks, so the no-feature build saw them as unused. Gate `allow(unused_mut)` behind `cfg_attr(not(feature = "mcp"), ...)` so both builds stay clean.
`test_session_new` and `test_session_new_missing_key` both mutate `ANTHROPIC_API_KEY`. cargo runs tests in parallel, so a `set` from one thread can be observed by the other before the first reads it. The pre-existing race was timing-tight enough to pass historically; adding `mcp::load_runtime_or_warn()` to `AcpServer::new()` widened the gap between set and read enough to flip the race deterministic. Guarding both tests with a shared `Mutex<()>` serializes them without adding a `serial_test` dep. Poisoned guard is recovered via `into_inner()` — we don't care if a prior assertion panicked.
…_status one-pass Simplify pass on 9bca3de: - auth.rs: PENDING_PREFIX demoted to private; pending_key() promoted to pub fn as the single public constructor. runtime.rs uses it instead of formatting against the const, so the namespace literal lives in exactly one place. - mcp/mod.rs: cli_show_status folds the orphan calculation into the print loop via pending.remove(name) — the leftover set IS the orphans, dropping the second configured: HashSet<&str> pass. - mcp/mod.rs: drop the cli_login "Security note" docstring — the same rationale lives on the Login clap subcommand doc in main.rs, which is the user-facing surface. - runtime.rs: trim the spawn_blocking rationale on pending_logins — moved next to list_pending_logins_at where the sync read actually lives. Net -10 LOC across 3 files; ~/ops/openab-agent-ci.sh --quick green.
Phase 2 / ADR §6.5: HTTP+oauth servers now dial with their cached access_token when one is present and not within REFRESH_SKEW_SECONDS of expiry. Refresh-token rotation for the expired-but-recoverable case is the next slice — for now, expired tokens fall through to NeedsAuth alongside missing tokens, so `mcp login` stays the single user-actionable path. - auth::is_expired now pub(crate) — single source of truth for the expiry+skew check across codex (get_valid_token) and mcp (connect). - Dial::Http carries Option<String> auth; Dial::run uses StreamableHttpClientTransportConfig::with_uri(url).auth_header(token) when present, falls back to from_uri for anonymous HTTP. - connect() oauth branch tries load_namespaced_token_at first; the Ok+!expired arm builds an authenticated Dial, all other arms bounce to NeedsAuth. Tests: - connect_oauth_with_valid_cached_token_attempts_dial_not_needs_auth proves the bearer-injection path reaches handshake (fails at the hermetic 127.0.0.1:1 dead address, surfaces Failed not NeedsAuth). - connect_oauth_with_expired_token_bounces_to_needs_auth pins the expiry guard against future refresh-flow regressions.
Phase 2 / ADR §6.6: HTTP+oauth servers with an expired cached token now attempt RFC 6749 §6 refresh-grant before bouncing to NeedsAuth. On success the rotated TokenStore is fsynced and the dial proceeds with the new bearer; on failure the bounce is preserved so `mcp login` stays the user-actionable path. - post_token_refresh: sibling of post_token_exchange for grant_type= refresh_token (public client, no client_secret). - try_refresh_oauth_token: resolves client_id from current config (so a rotated catalog entry is picked up), POSTs the refresh, builds a new TokenStore preserving the previous refresh_token if the provider omits one (Google-style rotation), persists via save_namespaced_token_at. - DialPlan two-phase enum lets the inside-lock branch tag "needs refresh" and defer the async refresh to outside the write lock — slow refresh must not block concurrent `mcp status` reads. Tests: - connect_oauth_expired_no_refresh_token_bounces_to_needs_auth pins the empty-refresh short-circuit (no POST attempted). - connect_oauth_expired_with_refresh_token_failed_refresh_bounces_to_needs_auth proves the refresh path runs and that any failure surfaces as user-actionable NeedsAuth (current fixture's custom provider plus dead token endpoint both fail the refresh). Custom-provider refresh shares the same gap as custom-provider login — resolve_paste_client errors before the refresh POST runs. Will be unblocked by the same future slice that wires custom-provider login.
…ected winner Mira + Kirin caught a race: two concurrent connect() calls on the same expired-token server can interleave so one wins the refresh + dial while the other's refresh fails — and the loser was unconditionally overwriting `Connected` with `NeedsAuth`, giving the user a false "please log in" signal on a healthy session. Guard the status mutation with `!matches!(h.status, Connected)` so a concurrent winner's terminal state is preserved.
Mirrors finish_login (oauth.rs:271) so a hostile/buggy provider returning a huge expires_in can't trigger a debug-build overflow panic in the refresh-token rotation path.
…d refresh post_token_exchange and post_token_refresh diverged only in the form payload — same Client::builder() boilerplate, same status/body folding, same response shape. Extract post_token_form so each grant call site is just the RFC 6749 form tuple + a grant_label for the error context. Also drops a narrating comment in connect() that recapped the three match arms below it without adding any non-obvious context.
Adds the §3.1 device authorization request + §3.4 polling tick HTTP helpers — pure data plumbing, no runtime wiring yet. The polling-tick response classifier is split out as a pure fn (classify_device_poll) so the §3.5 error-code → flow-state mapping (authorization_pending, slow_down, access_denied, expired_token) is unit-testable without a mock HTTP server. DeviceAuthResponse defaults interval to 5s per RFC 8628 §3.5 when the provider omits it. verification_uri_complete (§3.3.1 extension) is Option since verification_uri + user_code is the always-present fallback. Both POST fns + the two types are #[allow(dead_code)] — next slice wires start_device_login() and the polling loop into McpRuntimeManager.
…loop) - `start_device_login(name)`: POST §3.1 + spawn detached §3.4 polling task - `DeviceLoginStart`: public user-facing bundle (user_code / verification_uri / verification_uri_complete / expires_in) - `resolve_device_client`: Custom-provider-only gate with explicit error for missing `device_authorization_endpoint` / `client_id` - `run_device_poll_loop`: deadline check + SlowDown back-off (+5s) + success persistence + NeedsAuth on terminal failure - `finalize_device_login` / `mark_device_login_failed`: status transitions (Disconnected on success keeps MCP handshake out of detached task; next `connect()` picks up the cached token via the oauth-aware DialPlan) - 4 gating tests: unknown server / stdio / missing device endpoint / HTTP reach #[allow(dead_code)] on the new surface until the next slice wires `mcp login --device` CLI. Quick gate green.
- main.rs: add `--device` flag to `McpAction::Login` (conflicts_with `paste`) - mcp/mod.rs: `cli_login_device(name)` — prints verification_uri_complete (§3.3.1) when present, falls back to `verification_uri` + `user_code`; watches `statuses()` until Disconnected (success) / NeedsAuth (terminal failure) or the provider-declared `expires_in` deadline elapses - runtime.rs: drop `#[allow(dead_code)]` on the Tick-55 device-flow chain (`start_device_login` is now called from the CLI surface)
- config.rs: add `oauth.redirect_uri: Option<String>` to `OAuthConfig`
for the paste-back branch on custom providers (built-ins pin their
callback in `ProviderSpec`; ignored by the device-code branch)
- oauth.rs: propagate `redirect_uri` through `ResolvedProvider::Custom`
- runtime.rs: in `resolve_paste_client`, accept Custom + client_id +
redirect_uri; emit precise per-missing-field error otherwise:
- device endpoint present → "use device flow" (preserved)
- missing client_id → "requires `oauth.client_id`"
- missing redirect_uri → "requires `oauth.redirect_uri` (must match
the redirect URL pre-registered with the provider)"
- Tests: rename `_rejects_custom_provider_for_now` →
`_rejects_custom_without_redirect_uri`; add
`_rejects_custom_without_client_id` + success-path
`_custom_with_client_id_and_redirect_uri_succeeds`
Closes the last Phase 2 paste-back functional gap. ADR §6.3 enumerates
optional custom fields but doesn't currently list `redirect_uri`;
follow-up ADR amendment lands separately.
Adds the field to the §6.3 config-shape list and a follow-up paragraph explaining why custom paste-back requires it (no local listener; must match provider OAuth app registration) and why the device-code / built-in branches ignore it. Follow-up to 84837ee (custom-provider paste-back support); Mira + Kirin dual sign-off requested this as a non-blocking docs amendment.
- auth.rs: convert free function `is_expired(&TokenStore) -> bool` into
`impl TokenStore { pub fn is_expired(&self) -> bool }`. Switch the skew
addition to `saturating_add` for parity with the rest of the OAuth
arithmetic (defends `u64::MAX` "never expires" sentinel + any
near-overflow clock value).
- auth.rs: fold `show_status`'s inline duplicate of the expiry formula
into the new method.
- mcp/runtime.rs: drop the free-function import + update the cached-token
DialPlan branch to `store.is_expired()`.
- Tests: rewrite the three existing `is_expired` cases to method form and
add `_sentinel_u64_max` covering the saturating_add behavior.
Mira Tick 50 non-blocking cleanup.
Drop stale `#[allow(dead_code)]` markers now that runtime/flow wire all items through prod paths. Reuse `reqwest::Client` across the device-poll loop so TLS / TCP handshakes are amortized across the dozens-to-hundreds of polls a 30-minute device-flow window can produce. Trim session-narrative parentheticals from doc comments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer-pass findings:
- Drop dead auth.rs wrappers (`load/save/remove_namespaced_token`) that
were never wired; runtime/tests use the `_at` siblings directly.
- Extract `build_token_store` helper to dedupe the `TokenStore`
construction + `u64::MAX` sentinel + refresh-rotation logic across
`finish_login`, `finalize_device_login`, and `try_refresh_oauth_token`.
- Preserve underlying error context in `complete_login` via
`.with_context()` instead of `.map_err(|_| anyhow!(...))`.
- Trim session-narrative parentheticals ("Tick N", "Mira's review")
from doc comments; keep the technical reason that future readers need.
- Refresh stale doc strings that still said "next slice" for behaviour
Phase 2 itself ships.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Concurrent connect() callers may each hit `try_refresh_oauth_token` with the same stale `refresh_token` (no per-server single-flight gate yet). The connect() race guard prevents state corruption; the worst case is N duplicate POSTs to the token endpoint. A Phase 3 follow-up will add per-server single-flight for providers that cascade-revoke on replayed refresh tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Align ADR with implementation: the device-code polling loop persists tokens then sets status to Disconnected, so the next connect() reads the cached token via the oauth-aware dial path. Keeps the rmcp handshake out of the detached polling task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Track the most-recent device-poll JoinHandle per server in `McpRuntimeManager.device_login_tasks`. On a fresh `start_device_login` the prior task is aborted before the new one is registered, so a retry after a transient failure can't leave two loops racing to finalize the same server (would risk double-persisting tokens / contradictory status transitions). Phase 2 follow-up openabdev#4 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add `refresh_locks: HashMap<server, Arc<tokio::Mutex>>` and acquire the per-server lock before the refresh POST. After taking the lock, re-read the on-disk token; if a prior waiter already refreshed, return the cached store without a second POST. Closes the cascade-revoke window on providers (Google, ...) that invalidate every issued refresh_token when the same one is replayed concurrently. Phase 2 follow-up openabdev#1 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the truncate+write+sync_all path with: write same-dir tmp (PID + atomic counter for uniqueness, mode 0o600 via create_new), fsync the tmp, rename(2) onto auth.json, fsync the parent directory. On crash, the file is either the prior version or the new version — never half-written. The old implementation truncated `auth.json` before writing, so a mid-write crash left a corrupt file that the next startup would fail to parse. On the ECS deployment that then re-restored a now-revoked refresh_token from S3 (ADR §6.1). Phase 2 follow-up openabdev#3 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MCP is now an unconditional dep. Drops the `[features]` block in
Cargo.toml, promotes `rmcp` from optional to required, strips every
`#[cfg(feature = "mcp")]` / `#[cfg(not(feature = "mcp"))]` gate in
src/{acp,agent,auth,main}.rs, and drops the no-mcp combo from the CI
matrix (both .github/workflows/ci-openab-agent.yml and the local
mirror script).
Per ADR `docs/adr/openab-agent-mcp.md` §9 Phase 3 — the intermediate
`default = ["mcp"]` step is skipped in favor of going straight to
flag removal (decision recorded in tracking issue openabdev#966).
Refs openabdev#966
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `mcp/breaker.rs` per ADR §5.9 and the design decisions resolved in openabdev#966: - Q1 (a) Fixed cooldown 3-state breaker (Closed / Open / HalfOpen) - Q2 (a) Single consecutive-failure counter per server - Q3 (c) Lazy / piggyback probe (next call after cooldown = probe) API surface: pub enum Verdict { Allow, AllowProbe, Reject { retry_in_secs } } pub struct ServerBreaker { ... } impl ServerBreaker { pub fn new() -> Self; pub fn check(&self, server: &str) -> Verdict; pub fn record_success(&self, server: &str); pub fn record_failure(&self, server: &str); } Constants: `FAIL_THRESHOLD = 3`, `COOLDOWN = 60s`. Matches Hermes `tools/mcp_tool.py` (lines 1868-1912 + 2480-2510). ADR §5.9 mentions "3 fails in 30s" but Hermes itself uses pure consecutive count with no time window — going Hermes-simple here (any success resets). `#![allow(dead_code)]` carries the API surface until the next slice wires it into `McpRuntimeManager` (per the runbook's stub-without- prod-caller convention). Unit tests cover: unknown-server pass-through, under-threshold pass, threshold-trip, success-resets-count, cooldown-elapsed-allows-probe, probe-failure-rearms-cooldown, probe-success-closes-breaker, per- server isolation, retry_in_secs floor (1s when cooldown almost done). Refs openabdev#966 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Connects the per-server `ServerBreaker` introduced in 26c7d87 to the `McpRuntimeManager` connect path and the `meta_tool::call_tool` dispatch path (ADR §5.9 + openabdev#966 design decisions). - `connect()`: check breaker after the already-connected fast path; `Reject` short-circuits with a `retry in {n}s` hint. NeedsAuth bounces (missing/expired token, refresh failure) do NOT touch the breaker — those are auth-level, not transport-level. - Dial outcome at the tail of `connect()` records success/failure on the breaker, matching the single-counter / transport-only model resolved in openabdev#966 Q2. - `meta_tool::call_tool` wraps `peer.call_tool()`: wire-level `Err` = transport failure → trips the breaker; wire-level `Ok` (regardless of `CallToolResult.is_error`) resets it. `isError: true` content is protocol-normal payload, not a transport fault. - New `record_tool_call_outcome(name, ok)` is the only public surface the meta_tool path uses; the breaker stays private inside the manager. Tests: 2 new integration tests in `mcp::runtime::tests` — breaker_opens_after_threshold_consecutive_connect_failures (dead-port pattern, hermetic) and breaker_does_not_count_oauth_needs_auth_bounces (NeedsAuth isolation). Existing 9 breaker unit tests still pass.
Closes the Phase 3 scope from openabdev#966 — interactive diagnostic that runs a live `connect()` against every configured server, with three checks per server (short-circuit on first ✗): 1. Config resolution — `${env:VAR}` placeholders resolve against the live process env. Missing vars are flagged with the offending name and a "set the var(s)" hint. 2. OAuth state (HTTP + `oauth:` only) — cached `TokenStore` in `auth.json`. Missing → `mcp login <name>` hint; expired → noted but not fatal (connect will attempt refresh). 3. Live connect — `manager.connect(name).await`. Errors surface verbatim, including the §5.9 circuit-breaker `retry in {n}s` hint when the breaker is open. Exits non-zero when any server fails so CI / scripts can wrap with `openab-agent mcp doctor || alert`. Reuses the existing `McpRuntimeManager` and `load_namespaced_token_at` primitives — no new runtime state, no new fields on `McpRuntimeManager`. Wired into the existing `mcp` subcommand dispatch in main.rs alongside `list` / `status` / `connect` / `login`.
…F6/F9/F10 + F2/F7 doc)
Code changes:
- F6: gate `load_runtime_or_warn()` on `OPENAB_AGENT_MCP={1,true,yes,on}`
env var. Compile-time feature flag is being removed in Phase 3 PR openabdev#969;
this adds the runtime half so MCP stays dormant unless explicitly opted
in, preventing accidental activation when mcp.json is present.
- F4: integration test in `mcp::runtime::tests` exercising two concurrent
`connect()` tasks against a guaranteed-failure server, asserts no
stuck-Connecting state and that follow-up dials still attempt fresh.
- F5: defensive guard in `Agent::execute_tool_call` for `mcp` tool calls
when manager is unavailable — returns actionable error instead of
silently falling through to fs-tool dispatch.
- F9: regression test for missing-env-var error path; asserts the offender
variable name + server name surface in the chained error message.
- F10: `truncate_context` drain bounds clarified — `3.min(len)` instead of
`(1 + 2).min(len)`, with comment explaining the preserved first message.
Doc changes (ADR §5.4.1 added):
- F2: rationale for `env_clear()` baseline + explicit `env:` allowlist —
blocks token leakage through ambient env, matches MCP SDK behavior.
- F7: rationale for single shared `McpRuntimeManager` per process —
connection pooling, deduplicated handshakes, lifecycle owned by Agent.
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening reportposted screening comment: https://github.com//pull/969#issuecomment-4595186764 project action: moved `PVTI_lADOEFbZWM4BUUALzguZj-g` to `PR-Screening` on https://github.com/orgs/openabdev/projects/1IntentShip Phase 3 of the native FeatFeature work with some release-cleanup behavior. It adds a per-server MCP circuit breaker, wires transport-failure tracking into connect/tool-call paths, adds Who It ServesPrimary beneficiaries are agent runtime operators and deployers. Maintainers also benefit from a clearer default build and from diagnostics that reduce support ambiguity when MCP configuration, OAuth, or live connectivity fails. Rewritten PromptImplement MCP Phase 3 for Merge PitchThis should move forward after stack dependencies are reviewed because it turns MCP from happy-path integration into something operators can debug and recover. Risk is moderate-high due to stacked diff noise and unconditional MCP build impact. Main reviewer concern: breaker failure classification. Best-Practice ComparisonOpenClaw partially applies: isolated failures and retry/backoff patterns are relevant, but durable job persistence, delivery routing, and run logs are not introduced here. Hermes Agent directly applies. The PR mirrors its three-state breaker model: fixed cooldown, lazy half-open probe, one per-server counter, and no background polling. Good fit. Implementation Options
Comparison Table
RecommendationUse the balanced path. Merge/review #959 and #962 first, then require #969 to rebase so Phase 3 is reviewable on its own. Main gates: breaker classification, |
Discord Discussion URL: https://discord.com/channels/1490643539682529300/1510657283552706663
📚 PR stack — merge in this order:
97 commits appear in the diff because GitHub does not support cross-repo PRs with a non-
mainbase (the upstreamfeat/openab-agent-mcp/feat/openab-agent-mcp-oauthbranches only live on the fork). Once #959 + #962 merge, this branch rebases and the diff collapses to the Phase 3 commits below.Reviewers can take all three in parallel; merges must serialize.
Closes #966.
Summary
Phase 3 of the openab-agent native MCP client (ADR
docs/adr/openab-agent-mcp.md§5.9 + §8 + §9). Adds the circuit-breaker resilience layer, themcp doctordiagnostic CLI, and retires the--features mcpCargo flag.Phase 3 commits (review only these)
b3a310b--features mcpflag — drop[features]block, promotermcpto unconditional dep, strip all#[cfg(feature = "mcp")]gates, collapse CI matrix26c7d876772a51McpRuntimeManager::connect()+meta_tool::call_tool(transport-only counter; NeedsAuth bounces excluded;isError: trueexcluded) + 2 integration testsbf62732mcp doctorCLI — per-server env / OAuth / live-connect diagnostic with remediation hints; non-zero exit on any failure#966 design decisions
Resolved in #966 issue body (2026-06-01); summary:
consecutive_failuresper server (vs split connect / tool-call). Hermes uses one counter; failure modes are homogeneous after JSON-RPC /isErrorexclusion.Transport-level only: JSON-RPC
errorresponses + toolisError: truecontent do not count. NeedsAuth and OAuth refresh-failure paths do not count either (auth-level, not transport).Test plan
Rejectshort-circuits withretry in {n}shintcargo test130 passed,cargo clippy -- -D warningsclean,cargo fmt --checkcleanmcp doctorruns per-server diagnostic with actionable remediation hintsmcp doctorexits non-zero when any server fails (composes existing tested primitives —connect()/load_namespaced_token_at/resolved())Out of scope (deferred follow-ups)
mcp connect <name>escape hatch (Hermes equivalent:/mcp refresh)🤖 Generated with Claude Code