feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959)#962
feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959)#962brettchien wants to merge 94 commits into
Conversation
Adds ADR for in-core rmcp + progressive-disclosure meta-tool, deferred to symmetry with the Skills extension pattern from PR openabdev#955. Memory analysis rules out the sidecar alternative; per-session config refresh replaces file-watcher hot reload to drop ~150 LOC of race-condition hotspot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ate guardrails - §6.4 flow selection: device-code preferred (matches existing CLI convention), paste-back universal fallback, browser laptop opt-in - §5.2 add login/complete_login meta-tool actions; login returns flow-tagged union - §6.1 spell out TokenStore persistence assumption + cold-start refresh - §6.3 device_authorization_endpoint extension point + RFC 8414 discovery - §3.8 stdio container-image caveat (interpreter required for Node/Python) - §5.7 max_concurrent_servers knob (default 10, see §7 for constrained tuning) - §7 Fargate 512MB/1GB OOM analysis + mitigations Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ntract - §6.4 RFC 8414 discovery disabled by default; opt-in requires oauth.discovery=true + oauth.discovery_allowlist (boot rejects otherwise). Rationale: awsvpc egress + SSRF surface in multi-tenant deployments. - §6.1 add RTR race warning: async persistence layers (Fargate S3 sync, eventually-consistent volumes) must flush new tokens to durable storage before Spot interruption, else cascade-revoke locks the user out. Contract: fsync(2) agent-side + mtime-event-driven sync deployment-side. - §6.3 expose oauth.discovery / oauth.discovery_allowlist on custom providers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 foundation slice per ADR §5.4 + §5.6:
- Add optional `rmcp 1.7` dep + `mcp` feature flag (default off)
- Wire `#[cfg(feature = "mcp")] mod mcp;` into main
- New `mcp/config.rs`: `McpConfig` / `ServerConfig` (Stdio | Http) /
`ToolFilter` / `OAuthConfig`, global+project layered load, project
precedence on name collision, `${env:VAR}` interpolation with
per-server error context
OAuth fields limited to `provider` + `scopes`; custom-provider
endpoints (§6.3) deferred to Phase 2 auth slice to avoid dead schema.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI fmt-check found 8 formatting deltas in the Phase 1 scaffold. No logic change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI previously only built the default feature set, leaving the Phase 1 MCP scaffold (mcp/config.rs, gated by --features mcp) without compile, clippy, or test coverage. Adds explicit --features mcp invocations and watches workflow file changes so this gap is closed for the rest of the rollout (ADR §9). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires up the Phase 1 config loader so clippy --features mcp can see it. McpConfig::load() + ServerConfig::resolved() + serde pretty-print are all reachable from main, clearing the dead-code denial of compile under -D warnings. Output groups successful servers (✓) and failures (✗ with reason), sorted by name for deterministic display. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sort_by_key over (name, _) is the cleaner form; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per Mira's security review on the Phase 1 thread: the previous
'mcp list' eagerly called ServerConfig::resolved(), substituting
\${env:GITHUB_TOKEN} etc. into the printed output. Three leak
paths matter — pasting CLI output into bug reports / chat, screen
sharing, and stdout log collection.
New behavior:
- Default: print raw config; \${env:VAR} placeholders kept verbatim.
Safe to paste publicly; reader still sees which env var feeds
each field.
- --resolve opts into substitution and prints a two-line warning
banner. Useful for diagnosing missing-env startup failures.
No CLI-shape break: 'mcp list' still works; --resolve is additive.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
println! puts the warning banner into stdout, which (a) gets swallowed by 'mcp list --resolve > dump.json' redirection so the user never sees the security notice, and (b) corrupts the JSON payload for downstream pipes like 'mcp list --resolve | jq'. Routing the banner through eprintln! keeps it visible regardless of redirection and keeps stdout pure JSON for piping. Standard Unix convention: data on stdout, diagnostics on stderr. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Introduces McpRuntimeManager owning one ServerHandle per configured server, each starting in ServerStatus::Disconnected per ADR §5.7 (lazy connect). Wires the manager via `mcp status` CLI so the types are exercised by clippy --features mcp; actual rmcp TokioChildProcess dial + Connected / Failed transitions land in the next slice to keep that risky bit isolated for bisecting.
`clippy --features mcp -- -D warnings` doesn't compile the test target, so `len()`'s only callers (the unit tests) didn't keep it alive. Tests already used `statuses().len()` in one place — switch the other two to match and drop the now-dead method. `is_empty()` stays because `cli_show_status` calls it.
…ct() ADR §5.7 lifecycle needs a `connect()` method that spawns a child process and awaits an rmcp handshake — both `Send` across `.await`. Plain `HashMap<String, ServerHandle>` is not `Sync`, and the background idle- eviction task will share the map with the foreground `mcp call` path, so the read-heavy / write-light access pattern wants `tokio::sync::RwLock`. This slice lands the lock migration only: - `handles: Arc<RwLock<HashMap<String, ServerHandle>>>` - `statuses()` and `is_empty()` become async; `cli_show_status` follows - `McpRuntimeManager` is now `Clone` (Arc bump) so the eviction task can hold its own handle - `connect(name)` transitions to `Connecting` and returns; the actual `rmcp::TokioChildProcess` dial + `Connected` / `Failed` transitions land in the next slice — keeping that bit isolated for bisecting 6 unit tests cover snapshot ordering, unknown-server error, transition, and clone-shares-state.
The chained `.statuses().await.into_iter().map(|(n, _)| n).collect()` fits on one line under rustfmt's default 100-char width — broke it preemptively last tick on the wrong side of the threshold.
`clippy --features mcp -D warnings` doesn't compile the test target, so `connect()`'s only callers (the unit tests) don't keep it alive. The real rmcp dial lands in the next slice and will be reachable from the meta-tool dispatch / `mcp call` CLI path; until then a targeted allow keeps clippy quiet without papering over the type-level dead_code on ServerStatus variants (which is already explicitly allowed).
connect() now spawns the configured stdio child via rmcp's TokioChildProcess, runs the JSON-RPC handshake (().serve(transport)), and parks the resulting RunningService on ServerHandle.client. Lock discipline uses a double-acquire: a brief write lock to mark Connecting, drop, run the dial without any lock held, then a second brief write lock to install the client or record Failed(msg). Holding the write lock across the spawn+handshake would block mcp status and the future idle-eviction scan for the entire dial latency. HTTP transport is rejected with a "phase 2" error so misconfigured entries surface cleanly without advancing state past Disconnected. Adds `openab-agent mcp connect <name>` as a smoke-test CLI for mcp.json entries; the RunningService is dropped on process exit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rver Per Mira's Tick 13 review: when two callers race on the same name, both pass the first-lock idempotency check (Disconnected), both spawn a child, both come back to acquire the install-lock. The second arrival must yield to the winner, otherwise it overwrites the installed client and silently drops the working RunningService. Adds a 3-line double-check at the second lock acquisition: if status is already Connected with a live client, return Ok(()) — `dial_result` falls out of scope here and RunningService::Drop kills the duplicate child. Cheaper than a Connecting-aware notify/condvar and adequate until the eviction-aware slice lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three rustfmt collapses that I missed under default 100-char width:
* import order: bare `rmcp::ServiceExt` sorts AFTER `rmcp::transport::*`
(rustfmt puts sub-module paths before bare items within a crate).
* `return Err(anyhow!(...))` for the HTTP phase-2 message fits on one
line (~88 chars).
* `let Self { command, args, env } = self;` fits on one line.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…patch
Adds the single `mcp` tool the LLM sees (ADR §5.2). This slice lands:
* `Action` enum tagged on `action`, matching the eight ADR actions,
deserialized straight from the LLM's tool-call payload.
* `dispatch(manager, action) -> Result<Value>` as the single entry
point — both `agent.rs::execute_tool` (next slice) and tests go
through this.
* `help` returns the static usage doc.
* `list_servers` returns `[{name, status, transport}]` via a new
`McpRuntimeManager::snapshot()` that clones out under a read lock.
* The four IO-bearing actions (`list_tools`, `describe_tool`, `call`,
`status`) return a `not yet implemented` error so the surface area
is visible to callers without pre-implementing the peer-borrow
path. `login` / `complete_login` land with the Phase 2 OAuth slice.
`#[allow(dead_code)]` on `Action` and `dispatch` because clippy
--features mcp doesn't compile the test target and agent.rs
wire-up lands next slice.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LLM fallback hint Two of Mira's three Tick 15 findings (Arc-wrap for peer borrow lands with the actual list_tools/call slice — Arc without a consumer earns no keep). (1) `not_implemented_msg(action)` names the requested action, lists the actions that DO work (`help`, `list_servers`), and tells the LLM to fall back to native `read` / `write` / `edit` / `bash` rather than retrying. The old generic "not yet implemented" gave the model nothing to act on. (2) `ServerConfig::transport_label() -> &'static str` replaces the config clone inside `snapshot()`. The `Stdio` variant carries an `args: Vec<String>` and `env: HashMap<String, String>`; cloning that just to read the variant tag was wasted heap traffic on every `list_servers` call. `snapshot()` now yields `(String, ServerStatus, &'static str)`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rustfmt's fn_call_width default (60) splits assert!() args when the inline args exceed it; the new "read, write, edit, bash" fallback assert tipped the third call from ~55 to ~62 chars. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply Mira's Tick 15 review (3) and ship the first IO-bearing
meta-tool action.
* `ServerHandle.client` is now `Option<Arc<RunningService<...>>>`.
`connect()` wraps the dial result in `Arc::new`.
* `McpRuntimeManager::arc_peer(name)` clones the Arc out under a
short read lock, drops the guard, returns the handle. Callers
`.await` on `peer.list_all_tools()` / `peer.call_tool()` with no
runtime lock held — kills the writer-starvation risk Mira flagged
and sidesteps `Future is not Send` from holding a guard across
`.await`.
* `Action::ListTools { server }` now wired: lazy `connect()` →
`arc_peer()` → `peer.list_all_tools()` → `[{name, description}]`.
`describe_tool` / `call` / `status` still stub.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply Mira's Tick 18 review (3) — lenient args parsing for the
`call` action so LLMs that send `arguments: null` (or omit the field
entirely) for no-arg tools don't bounce off a strict type check.
* `Action::Call { server, tool, arguments }` is wired through
`call_tool()`: validate arguments → lazy `connect()` → `arc_peer()`
→ `peer.call_tool(CallToolRequestParams::new(...).with_arguments(map))`
→ serialize `CallToolResult` to JSON.
* Argument coercion: `Value::Object(map) → map`, `Value::Null → {}`,
everything else (string, number, array, bool) is rejected with a
message that names the actual type so the LLM can correct itself.
* Tests cover both branches: non-object string args fail early at
validation; null args pass validation and fail later at connect.
* `describe_tool` and `status` still stub; supported-list message
updated to include `call`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CallToolRequestParams chain fits in exactly 100 chars; rustfmt keeps it on one line. Burned a tick on a defensive break. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… 1 surface
describe_tool returns full input_schema for one tool (list+filter via
list_all_tools — MCP spec has no single-tool query). status reports
per-server {name, status, transport, last_error} with optional name
filter; surfaces Failed's embedded error as last_error.
Extract fetch_tools helper shared by list_tools + describe_tool —
this is the natural insertion point for the planned tools_cache.
Surfaces the `mcp` action enum to the LLM via a new ToolDef and routes calls through McpRuntimeManager when configured. AcpServer constructs the manager once at startup (warn-and-default on config parse failure) and clones the cheap Arc-backed handle into each session's Agent. Tools.rs stays stateless and feature-flag-free — the mcp routing arm lives in Agent::execute_tool_call where it has access to the manager. A shared MCP_TOOL_NAME const keeps mcp_tool_def() and the dispatch arm in sync.
The `mut` on `tools` and `base` is consumed only inside `#[cfg(feature = "mcp")]` blocks, so the no-feature build saw them as unused. Gate `allow(unused_mut)` behind `cfg_attr(not(feature = "mcp"), ...)` so both builds stay clean.
`test_session_new` and `test_session_new_missing_key` both mutate `ANTHROPIC_API_KEY`. cargo runs tests in parallel, so a `set` from one thread can be observed by the other before the first reads it. The pre-existing race was timing-tight enough to pass historically; adding `mcp::load_runtime_or_warn()` to `AcpServer::new()` widened the gap between set and read enough to flip the race deterministic. Guarding both tests with a shared `Mutex<()>` serializes them without adding a `serial_test` dep. Poisoned guard is recovered via `into_inner()` — we don't care if a prior assertion panicked.
Mira Tick 46 catch: when a provider omits both refresh_token and expires_in (Figma, Sentry, xAI), the prior `now + 0` fallback set the token "already expired", causing the next connect() to enter the refresh path, fail closed on the empty refresh_token, and bounce the user back to NeedsAuth seconds after a successful login. `expires_in: None` now uses the u64::MAX sentinel — `is_expired` returns false until the provider eventually 401s on use (at which point `mcp login` re-runs is the correct UX for non-refreshable tokens). Test now asserts the sentinel directly so the regression can't sneak back in. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`openab-agent mcp login <name> [--paste URL]` drives the whole paste-back loop: start_paste_login pins PKCE state, the CLI shows the authorize URL, blocks on stdin (or --paste for scripted use) for the redirect URL, then complete_login validates state + exchanges the code + persists the TokenStore. --paste exists for CI smoke tests and scripted setups; without it the CLI is interactive (read_redirect_from_stdin). State-mismatch or network failure leaves the pending entry intact so the user can retry with a fresh paste without re-issuing `mcp login`. Drops the #[allow(dead_code)] markers from start_paste_login, complete_login, save_namespaced_token_at, remove_pending_login and PasteLoginStart — all reachable from main via cli_login now, so dead_code propagates through the call graph (skill rule). Phase 2 user-facing surface is now complete: list / status / connect / login. mcp status surfacing of NeedsAuth + the device-code flow remain for the next slices. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… --paste Mira's PR openabdev#962 review (Q2) noted that --paste echoes the redirect URL into shell history and `ps`. PKCE makes either route theoretically safe, but `echo "<url>" | openab-agent mcp login <name>` leaves no trace — preferred for CI / scripts.
`auth::list_pending_logins_at` enumerates `mcp-pending:<name>` entries; `McpRuntimeManager::pending_logins` delegates via injected auth_path. `cli_show_status` cross-references against configured servers to flag both "login pending — finish via mcp login <name>" and orphaned entries with no matching mcp.json server. Introduces `PENDING_PREFIX` const so the read/write sides can't drift on the literal.
…_status one-pass Simplify pass on 9bca3de: - auth.rs: PENDING_PREFIX demoted to private; pending_key() promoted to pub fn as the single public constructor. runtime.rs uses it instead of formatting against the const, so the namespace literal lives in exactly one place. - mcp/mod.rs: cli_show_status folds the orphan calculation into the print loop via pending.remove(name) — the leftover set IS the orphans, dropping the second configured: HashSet<&str> pass. - mcp/mod.rs: drop the cli_login "Security note" docstring — the same rationale lives on the Login clap subcommand doc in main.rs, which is the user-facing surface. - runtime.rs: trim the spawn_blocking rationale on pending_logins — moved next to list_pending_logins_at where the sync read actually lives. Net -10 LOC across 3 files; ~/ops/openab-agent-ci.sh --quick green.
Phase 2 / ADR §6.5: HTTP+oauth servers now dial with their cached access_token when one is present and not within REFRESH_SKEW_SECONDS of expiry. Refresh-token rotation for the expired-but-recoverable case is the next slice — for now, expired tokens fall through to NeedsAuth alongside missing tokens, so `mcp login` stays the single user-actionable path. - auth::is_expired now pub(crate) — single source of truth for the expiry+skew check across codex (get_valid_token) and mcp (connect). - Dial::Http carries Option<String> auth; Dial::run uses StreamableHttpClientTransportConfig::with_uri(url).auth_header(token) when present, falls back to from_uri for anonymous HTTP. - connect() oauth branch tries load_namespaced_token_at first; the Ok+!expired arm builds an authenticated Dial, all other arms bounce to NeedsAuth. Tests: - connect_oauth_with_valid_cached_token_attempts_dial_not_needs_auth proves the bearer-injection path reaches handshake (fails at the hermetic 127.0.0.1:1 dead address, surfaces Failed not NeedsAuth). - connect_oauth_with_expired_token_bounces_to_needs_auth pins the expiry guard against future refresh-flow regressions.
Phase 2 / ADR §6.6: HTTP+oauth servers with an expired cached token now attempt RFC 6749 §6 refresh-grant before bouncing to NeedsAuth. On success the rotated TokenStore is fsynced and the dial proceeds with the new bearer; on failure the bounce is preserved so `mcp login` stays the user-actionable path. - post_token_refresh: sibling of post_token_exchange for grant_type= refresh_token (public client, no client_secret). - try_refresh_oauth_token: resolves client_id from current config (so a rotated catalog entry is picked up), POSTs the refresh, builds a new TokenStore preserving the previous refresh_token if the provider omits one (Google-style rotation), persists via save_namespaced_token_at. - DialPlan two-phase enum lets the inside-lock branch tag "needs refresh" and defer the async refresh to outside the write lock — slow refresh must not block concurrent `mcp status` reads. Tests: - connect_oauth_expired_no_refresh_token_bounces_to_needs_auth pins the empty-refresh short-circuit (no POST attempted). - connect_oauth_expired_with_refresh_token_failed_refresh_bounces_to_needs_auth proves the refresh path runs and that any failure surfaces as user-actionable NeedsAuth (current fixture's custom provider plus dead token endpoint both fail the refresh). Custom-provider refresh shares the same gap as custom-provider login — resolve_paste_client errors before the refresh POST runs. Will be unblocked by the same future slice that wires custom-provider login.
…ected winner Mira + Kirin caught a race: two concurrent connect() calls on the same expired-token server can interleave so one wins the refresh + dial while the other's refresh fails — and the loser was unconditionally overwriting `Connected` with `NeedsAuth`, giving the user a false "please log in" signal on a healthy session. Guard the status mutation with `!matches!(h.status, Connected)` so a concurrent winner's terminal state is preserved.
Mirrors finish_login (oauth.rs:271) so a hostile/buggy provider returning a huge expires_in can't trigger a debug-build overflow panic in the refresh-token rotation path.
…d refresh post_token_exchange and post_token_refresh diverged only in the form payload — same Client::builder() boilerplate, same status/body folding, same response shape. Extract post_token_form so each grant call site is just the RFC 6749 form tuple + a grant_label for the error context. Also drops a narrating comment in connect() that recapped the three match arms below it without adding any non-obvious context.
Adds the §3.1 device authorization request + §3.4 polling tick HTTP helpers — pure data plumbing, no runtime wiring yet. The polling-tick response classifier is split out as a pure fn (classify_device_poll) so the §3.5 error-code → flow-state mapping (authorization_pending, slow_down, access_denied, expired_token) is unit-testable without a mock HTTP server. DeviceAuthResponse defaults interval to 5s per RFC 8628 §3.5 when the provider omits it. verification_uri_complete (§3.3.1 extension) is Option since verification_uri + user_code is the always-present fallback. Both POST fns + the two types are #[allow(dead_code)] — next slice wires start_device_login() and the polling loop into McpRuntimeManager.
…loop) - `start_device_login(name)`: POST §3.1 + spawn detached §3.4 polling task - `DeviceLoginStart`: public user-facing bundle (user_code / verification_uri / verification_uri_complete / expires_in) - `resolve_device_client`: Custom-provider-only gate with explicit error for missing `device_authorization_endpoint` / `client_id` - `run_device_poll_loop`: deadline check + SlowDown back-off (+5s) + success persistence + NeedsAuth on terminal failure - `finalize_device_login` / `mark_device_login_failed`: status transitions (Disconnected on success keeps MCP handshake out of detached task; next `connect()` picks up the cached token via the oauth-aware DialPlan) - 4 gating tests: unknown server / stdio / missing device endpoint / HTTP reach #[allow(dead_code)] on the new surface until the next slice wires `mcp login --device` CLI. Quick gate green.
- main.rs: add `--device` flag to `McpAction::Login` (conflicts_with `paste`) - mcp/mod.rs: `cli_login_device(name)` — prints verification_uri_complete (§3.3.1) when present, falls back to `verification_uri` + `user_code`; watches `statuses()` until Disconnected (success) / NeedsAuth (terminal failure) or the provider-declared `expires_in` deadline elapses - runtime.rs: drop `#[allow(dead_code)]` on the Tick-55 device-flow chain (`start_device_login` is now called from the CLI surface)
- config.rs: add `oauth.redirect_uri: Option<String>` to `OAuthConfig`
for the paste-back branch on custom providers (built-ins pin their
callback in `ProviderSpec`; ignored by the device-code branch)
- oauth.rs: propagate `redirect_uri` through `ResolvedProvider::Custom`
- runtime.rs: in `resolve_paste_client`, accept Custom + client_id +
redirect_uri; emit precise per-missing-field error otherwise:
- device endpoint present → "use device flow" (preserved)
- missing client_id → "requires `oauth.client_id`"
- missing redirect_uri → "requires `oauth.redirect_uri` (must match
the redirect URL pre-registered with the provider)"
- Tests: rename `_rejects_custom_provider_for_now` →
`_rejects_custom_without_redirect_uri`; add
`_rejects_custom_without_client_id` + success-path
`_custom_with_client_id_and_redirect_uri_succeeds`
Closes the last Phase 2 paste-back functional gap. ADR §6.3 enumerates
optional custom fields but doesn't currently list `redirect_uri`;
follow-up ADR amendment lands separately.
Adds the field to the §6.3 config-shape list and a follow-up paragraph explaining why custom paste-back requires it (no local listener; must match provider OAuth app registration) and why the device-code / built-in branches ignore it. Follow-up to 84837ee (custom-provider paste-back support); Mira + Kirin dual sign-off requested this as a non-blocking docs amendment.
- auth.rs: convert free function `is_expired(&TokenStore) -> bool` into
`impl TokenStore { pub fn is_expired(&self) -> bool }`. Switch the skew
addition to `saturating_add` for parity with the rest of the OAuth
arithmetic (defends `u64::MAX` "never expires" sentinel + any
near-overflow clock value).
- auth.rs: fold `show_status`'s inline duplicate of the expiry formula
into the new method.
- mcp/runtime.rs: drop the free-function import + update the cached-token
DialPlan branch to `store.is_expired()`.
- Tests: rewrite the three existing `is_expired` cases to method form and
add `_sentinel_u64_max` covering the saturating_add behavior.
Mira Tick 50 non-blocking cleanup.
93da300 to
e5ea59f
Compare
Drop stale `#[allow(dead_code)]` markers now that runtime/flow wire all items through prod paths. Reuse `reqwest::Client` across the device-poll loop so TLS / TCP handshakes are amortized across the dozens-to-hundreds of polls a 30-minute device-flow window can produce. Trim session-narrative parentheticals from doc comments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer-pass findings:
- Drop dead auth.rs wrappers (`load/save/remove_namespaced_token`) that
were never wired; runtime/tests use the `_at` siblings directly.
- Extract `build_token_store` helper to dedupe the `TokenStore`
construction + `u64::MAX` sentinel + refresh-rotation logic across
`finish_login`, `finalize_device_login`, and `try_refresh_oauth_token`.
- Preserve underlying error context in `complete_login` via
`.with_context()` instead of `.map_err(|_| anyhow!(...))`.
- Trim session-narrative parentheticals ("Tick N", "Mira's review")
from doc comments; keep the technical reason that future readers need.
- Refresh stale doc strings that still said "next slice" for behaviour
Phase 2 itself ships.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening reportposted screening and moved the project item to `PR-Screening`.GitHub comment: #962 (comment) IntentPR #962 is trying to add Phase 2 of the FeatFeature work. It introduces provider resolution, OAuth config/discovery checks, PKCE/state primitives, pending paste-login persistence, token-store namespacing/fsync behavior, and runtime Who It ServesPrimary beneficiaries: agent runtime operators and deployers who need OpenAB agents to connect to protected MCP servers. Secondary beneficiaries: maintainers who need the MCP auth model split into reviewable phases before exposing Rewritten PromptImplement the OAuth 2.1 paste-back foundation for Merge PitchThis should move forward as the next MCP capability slice because it builds the security-critical auth substrate before user-facing command wiring. Risk is medium-high: OAuth state, token persistence, and secret redaction are easy places to create durable security bugs. The likely reviewer concern is not whether OAuth is needed, but whether this draft can be reviewed while stacked on #959 and whether the implementation has enough tests around CSRF/state, PKCE, persistence, and redaction before it becomes the base for CLI login. Best-Practice ComparisonOpenClaw is relevant mainly as an operational reference: durable job/state persistence, explicit routing, retries, and run logs map to the need for durable auth state and clear auth/runtime transitions. This PR appears aligned on durable persisted state, but should make failure and recovery behavior explicit before CLI UX lands. Hermes Agent is also relevant: its fresh-session scheduled runs and atomic persisted state are a useful bar for Implementation OptionsConservative: wait for #959 to merge, rebase, then review only the OAuth commits with no new scope. Require focused unit tests for PKCE, state mismatch, provider resolution, pending-login serialization, and secret redaction before marking ready. Balanced: after rebase, merge the OAuth foundation behind the existing MCP feature gates, but split any large runtime behavior into a follow-up if it is not directly needed for paste-back state. Add a small integration-style test around protected HTTP server auth state without requiring a live vendor server. Ambitious: rebase and complete the full paste-back user path in one PR, including Comparison Table
RecommendationUse the balanced path after #959 lands: rebase, collapse the diff to OAuth-only commits, keep |
Concurrent connect() callers may each hit `try_refresh_oauth_token` with the same stale `refresh_token` (no per-server single-flight gate yet). The connect() race guard prevents state corruption; the worst case is N duplicate POSTs to the token endpoint. A Phase 3 follow-up will add per-server single-flight for providers that cascade-revoke on replayed refresh tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Align ADR with implementation: the device-code polling loop persists tokens then sets status to Disconnected, so the next connect() reads the cached token via the oauth-aware dial path. Keeps the rmcp handshake out of the detached polling task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Track the most-recent device-poll JoinHandle per server in `McpRuntimeManager.device_login_tasks`. On a fresh `start_device_login` the prior task is aborted before the new one is registered, so a retry after a transient failure can't leave two loops racing to finalize the same server (would risk double-persisting tokens / contradictory status transitions). Phase 2 follow-up openabdev#4 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add `refresh_locks: HashMap<server, Arc<tokio::Mutex>>` and acquire the per-server lock before the refresh POST. After taking the lock, re-read the on-disk token; if a prior waiter already refreshed, return the cached store without a second POST. Closes the cascade-revoke window on providers (Google, ...) that invalidate every issued refresh_token when the same one is replayed concurrently. Phase 2 follow-up openabdev#1 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the truncate+write+sync_all path with: write same-dir tmp (PID + atomic counter for uniqueness, mode 0o600 via create_new), fsync the tmp, rename(2) onto auth.json, fsync the parent directory. On crash, the file is either the prior version or the new version — never half-written. The old implementation truncated `auth.json` before writing, so a mid-write crash left a corrupt file that the next startup would fail to parse. On the ECS deployment that then re-restored a now-revoked refresh_token from S3 (ADR §6.1). Phase 2 follow-up openabdev#3 of 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Discord Discussion URL: https://discord.com/channels/1490643539682529300/1510657283552706663/1510690101120995369
📚 PR stack — merge in this order:
mcp doctor, stacks on feat(openab-agent): MCP Phase 2 — OAuth 2.1 paste-back (stacks on #959) #962)Once #959 merges, this branch rebases and the diff collapses to only the 18 OAuth-specific commits. Reviewers can take all three in parallel; merges must serialize.
Summary
Phase 2 of the openab-agent native MCP client (ADR
docs/adr/openab-agent-mcp.md§6). Adds OAuth 2.1 paste-back authentication for Streamable HTTP servers.OAuth commits
2a69bf9TokenStore+ fsync4c77595da86d5bOAuthConfigfields + discovery boot check65f1c4d04b84f9ResolvedProvideras enum{Builtin, Custom}240014495f33bcNeedsAuthruntime state for protected HTTP serversaf3a25dstart_paste_login+ builtin client_id resolver7dd718cauth.jsonintoTokenStore | PendingPasteLogin79b1e2eauth.jsonbc006e2parse_paste_callbackURL helper (state validation + CSRF)Plus fmt fixups:
06acde6,7ea3e59,f61b498,7f4ef17,32232c0,ca91f1d,111f43a.Scope
OAuthConfigmcp loginCLI subcommand: next slice (will wirestart_paste_login+parse_paste_callback+complete_login)mcp doctordeferred to Phase 3 (feat(openab-agent): MCP Phase 3 — circuit breaker + flag removal (stacks on #959 + #962) #969)Test plan
mcp login <server>happy path against vendor-hosted MCP (Linear, etc.) in headless containermcp listredactsclient_secret(regression on Phase 1 secret-redaction)🤖 Generated with Claude Code