feat(core,client): restore OAuth error subclasses and transient-error classification#2242
feat(core,client): restore OAuth error subclasses and transient-error classification#2242felixweinberger wants to merge 5 commits into
Conversation
🦋 Changeset detectedLatest commit: 3ea50ff The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
@modelcontextprotocol/client
@modelcontextprotocol/codemod
@modelcontextprotocol/server
@modelcontextprotocol/server-legacy
@modelcontextprotocol/express
@modelcontextprotocol/fastify
@modelcontextprotocol/hono
@modelcontextprotocol/node
commit: |
| /** | ||
| * Returns `true` when an OAuth error indicates a transient condition that may succeed on | ||
| * retry: `server_error`, `temporarily_unavailable`, `too_many_requests` — or any error code | ||
| * not defined in {@linkcode OAuthErrorCode}. | ||
| * | ||
| * Treating unknown codes as transient matches SDK 1.x, which collapsed unrecognized error | ||
| * codes into `ServerError`. Authorization servers in the wild return non-standard codes | ||
| * (e.g. `invalid_refresh_token`) for conditions that are not permanent; treating them as | ||
| * permanent failures would stop retry/refresh loops that previously recovered. | ||
| */ | ||
| export function isTransientOAuthError(error: unknown): boolean { | ||
| if (!(error instanceof OAuthError)) { | ||
| return false; | ||
| } | ||
| return TRANSIENT_OAUTH_ERROR_CODES.has(error.code) || !KNOWN_OAUTH_ERROR_CODES.has(error.code); | ||
| } |
There was a problem hiding this comment.
🔴 The SDK's own auth() refresh-failure fallback in packages/client/src/client/auth.ts (lines 780–788) still keys on error.code === OAuthErrorCode.ServerError, so an unknown token-endpoint code like invalid_refresh_token — which this PR's own tests assert is now thrown as a plain OAuthError with the raw code and classified transient — is re-thrown instead of falling through to a fresh authorization, exactly the "refresh loop that recovered in v1 dead-ends in v2" break (Break 2) the PR description says it fixes. Apply isTransientOAuthError(error) (or equivalent unknown-code handling) at that internal call site, or explicitly scope the internal flow out of the PR's v1-parity claim.
Extended reasoning...
What the bug is. The PR adds isTransientOAuthError() to restore v1 retry semantics for unknown OAuth error codes, but the SDK's own authInternal() refresh-failure handling is not updated to use it. At packages/client/src/client/auth.ts:780-788 the catch reads:
} catch (error) {
// If this is a ServerError, or an unknown type, log it out and try to continue. Otherwise, escalate so we can fix things and retry.
if (!(error instanceof OAuthError) || error.code === OAuthErrorCode.ServerError) {
// Could not refresh OAuth tokens
} else {
// Refresh failed for another reason, re-throw
throw error;
}
}The comment's intent ("or an unknown type … try to continue") held in SDK 1.x because parseErrorResponse collapsed unrecognized token-endpoint codes into ServerError. In v2 the raw code is preserved, so an unknown code matches neither swallow condition and is re-thrown.
Code path that triggers it. auth() → authInternal() finds stored tokens with a refresh_token → calls refreshAuthorization() → token endpoint returns {"error": "invalid_refresh_token"} → executeTokenRequest throws via parseErrorResponse → with this PR, oauthErrorFromCode('invalid_refresh_token', …) produces a plain OAuthError carrying the raw code (the new authErrorCompat.test.ts test "preserves unknown token-endpoint error codes and classifies them transient" asserts exactly this) → the catch at line 782 sees error instanceof OAuthError === true and code !== 'server_error', so it re-throws → the outer auth() wrapper at auth.ts:553-563 only recovers for InvalidClient/UnauthorizedClient/InvalidGrant, so the unknown-code error propagates to the caller and the auth flow dead-ends instead of falling through to startAuthorization() → REDIRECT.
Step-by-step proof (v1 vs v2 with this PR). Suppose a client has stored tokens whose refresh token was rotated server-side, and the AS returns HTTP 400 {"error": "invalid_refresh_token"}:
- SDK 1.x:
parseErrorResponsedoesn't recognize the code → constructsServerError→ the equivalent catch (error instanceof ServerError) swallows it →authInternalcontinues tostartAuthorization→ returnsREDIRECT→ the user re-authorizes and the client recovers. - v2 with this PR:
oauthErrorFromCode('invalid_refresh_token', …)returns a plainOAuthErrorwithcode === 'invalid_refresh_token'→ line 782:error instanceof OAuthErroris true,code === OAuthErrorCode.ServerErroris false →elsebranch re-throws →auth()'s outer catch matches none ofinvalid_client/unauthorized_client/invalid_grant→ the error escapesauth()→handleOAuthUnauthorized/ the transport surfaces a failure instead of redirecting to re-authorization. - Meanwhile
isTransientOAuthError(error)would have returnedtruefor this exact error — the PR's tests assert it — but nothing in the SDK's own flow consults it.
Why existing code doesn't prevent it. The new helper, the subclass factory, and the docs (the "UNKNOWN-CODE TRAP" section in migration-SKILL.md) all target userland migrations. The PR even modifies other throw sites in this same auth.ts file (switching them to oauthErrorFromCode), but the one internal consumer where v1's collapse-to-ServerError semantics actually mattered keeps the strict code === ServerError check, so the restored retry semantics never apply to the SDK's own orchestration.
Impact. Any MCP client using the SDK's auth() orchestrator against an authorization server that returns non-standard refresh-failure codes (e.g. invalid_refresh_token, observed from real-world AS implementations per the PR description) will hard-fail the auth flow once the refresh token goes stale, instead of transparently falling back to a new authorization — the precise silent regression ("Break 2") this PR is motivated by.
How to fix. Replace the condition at line 782 with the new classification, e.g. if (!(error instanceof OAuthError) || isTransientOAuthError(error) || !KNOWN_OAUTH_ERROR_CODES.has(error.code)) — or more simply if (!(error instanceof OAuthError) || isTransientOAuthError(error)), which swallows server_error/temporarily_unavailable/too_many_requests/unknown codes and still escalates invalid_grant/invalid_client/unauthorized_client to the outer auth() recovery logic. Alternatively, if keeping the internal flow strict is intentional, the PR description and migration docs should narrow the claim that v1 retry semantics are restored, since they currently are restored only for consumers who adopt the helper themselves.
Note on scope. Lines 780–788 are pre-existing v2 code not touched by this diff, so this is not a regression introduced by the PR; it is flagged because the PR adds the exact classification helper for this scenario, modifies sibling throw sites in the same file, and explicitly claims to fix this break — leaving the SDK's own call site as the one place the fix doesn't reach.
| non-standard codes (e.g. `invalid_refresh_token`). Use `isTransientOAuthError(error)` instead — it returns true for `server_error`, `temporarily_unavailable`, `too_many_requests`, AND any code not in `OAuthErrorCode` (v1 parity). | ||
|
|
||
| The variadic `.tool()`, `.prompt()`, `.resource()` methods are removed. Use the `register*` methods with a config object. | ||
| ```typescript | ||
| // v1 | ||
| if (error instanceof ServerError) { | ||
| scheduleRetry(); | ||
| } | ||
|
|
||
| **IMPORTANT**: v2 requires schema objects implementing [Standard Schema](https://standardschema.dev/) — raw shapes like `{ name: z.string() }` are no longer supported. Wrap with `z.object()` (Zod v4), or use ArkType's `type({...})`, or Valibot. For raw JSON Schema, wrap with | ||
| `fromJsonSchema(schema)` from `@modelcontextprotocol/server` (validator defaults automatically; pass an explicit validator for custom configurations). Applies to `inputSchema`, `outputSchema`, and `argsSchema`. | ||
| // v2 — equivalent semantics | ||
| import { isTransientOAuthError } from '@modelcontextprotocol/client'; | ||
| if (isTransientOAuthError(error)) { | ||
| scheduleRetry(); | ||
| } | ||
| ``` |
There was a problem hiding this comment.
🔴 The OAuth-section rewrite in docs/migration-SKILL.md accidentally deleted unrelated migration content: the ## 6. McpServer API Changes heading, the variadic .tool()/.prompt()/.resource() removal note, the "IMPORTANT: v2 requires Standard Schema objects (wrap with z.object())" paragraph, and the "Unchanged APIs" paragraph. The file's numbering now jumps from section 5 to section 7, the ### Tools/### Prompts/### Resources subsections are orphaned, and step 4 of section 15 still references the now-missing "section 6" — please restore the deleted heading and paragraphs alongside the new OAuth content.
Extended reasoning...
What happened: The diff hunk in docs/migration-SKILL.md (lines ~178–230) replaces the old OAuth-error table with the new compatibility-focused prose. But the deleted span extends past the OAuth section boundary: it also removes (1) the **Unchanged APIs** (only import paths changed)… paragraph that closed the error-class section, (2) the ## 6. McpServer API Changes heading, (3) the sentence "The variadic .tool(), .prompt(), .resource() methods are removed. Use the register* methods with a config object.", and (4) the **IMPORTANT**: v2 requires schema objects implementing Standard Schema… paragraph (raw shapes no longer supported, wrap with z.object(), fromJsonSchema guidance). None of this content is re-added anywhere in the new text — the added lines are exclusively about OAuth errors.\n\nHow it manifests in the post-PR file: The top-level numbering jumps from ## 5. Removed / Renamed Type Aliases and Symbols (line 82) directly to ## 7. Headers API (line 294) — there is no section 6 anywhere in the file. The ### Tools, ### Prompts, ### Resources, ### Schema Migration Quick Reference, and ### Removed core exports subsections that previously belonged under "McpServer API Changes" now sit orphaned directly after the OAuth-error subsection, structurally nested under section 5. Grepping the post-PR file for McpServer API Changes, variadic, or Unchanged APIs returns nothing, confirming the content was deleted rather than moved.\n\nStep-by-step proof:\n1. Pre-PR, the file's structure was: ### OAuth error consolidation (table + rewrite example) → **Unchanged APIs** paragraph → ## 6. McpServer API Changes → variadic-removal sentence → IMPORTANT/Standard-Schema paragraph → ### Tools → … → ## 7. Headers API.\n2. The diff hunk @@ -178,52 +178,34 @@ removes everything from the old OAuth table through the IMPORTANT paragraph and inserts only the new OAuth compatibility prose, ending right before ### Tools.\n3. Post-PR, line 527 (## 15. Migration Steps, step 4) still reads "Replace .tool() / .prompt() / .resource() calls with registerTool / registerPrompt / registerResource per section 6" — a dangling reference to a section that no longer exists.\n4. The PR description scopes the change to OAuth error compatibility; nothing in it claims to restructure the McpServer API section, so this is collateral damage from the rewrite, not an intentional doc change.\n\nWhy it matters: This file is the LLM-targeted migration skill (migrate-v1-to-v2). The deleted variadic-method-removal statement and the Standard-Schema/z.object() requirement are core migration directives that an LLM consuming the skill relies on; some of that guidance partially survives in step 5 of section 15 and in the Schema Migration Quick Reference table, but the section heading, the variadic-removal statement, and the Unchanged-APIs paragraph are simply gone, and the broken numbering/orphaned subsections degrade the skill's structure.\n\nHow to fix: Re-insert, between the new OAuth code example and the ### Tools subsection, the previously existing content: the **Unchanged APIs** paragraph, the ## 6. McpServer API Changes heading, the variadic-removal sentence, and the IMPORTANT Standard-Schema paragraph. That restores the section numbering (5 → 6 → 7), re-parents the ### Tools/### Prompts/### Resources subsections, and makes the "per section 6" reference in section 15 valid again.
| /** | ||
| * Creates an {@linkcode OAuthError} from an error code, returning the specific subclass for | ||
| * known codes so that `instanceof` checks (e.g. `error instanceof InvalidGrantError`) work. | ||
| * | ||
| * Unknown / non-standard codes produce a plain {@linkcode OAuthError} that preserves the | ||
| * raw code. | ||
| */ | ||
| export function oauthErrorFromCode(code: OAuthErrorCode | string, message: string, errorUri?: string): OAuthError { | ||
| const ErrorClass = OAUTH_ERRORS[code]; | ||
| if (ErrorClass) { | ||
| return new ErrorClass(message, errorUri); | ||
| } | ||
| return new OAuthError(code, message, errorUri); | ||
| } |
There was a problem hiding this comment.
🟡 OAUTH_ERRORS is a plain object literal, so the OAUTH_ERRORS[code] lookup in oauthErrorFromCode walks the prototype chain — and code comes from untrusted authorization-server responses via OAuthError.fromResponse / parseErrorResponse. A server returning {"error": "constructor"} makes the function return new Object(message) (a String wrapper that is neither an OAuthError nor an Error), while codes like "toString" or "proto" make new ErrorClass(...) throw. Guard the lookup with Object.hasOwn(OAUTH_ERRORS, code), or build the map with a null prototype / a Map.
Extended reasoning...
What the bug is. oauthErrorFromCode (packages/core/src/auth/errors.ts:343-349) does const ErrorClass = OAUTH_ERRORS[code]; if (ErrorClass) return new ErrorClass(message, errorUri);. OAUTH_ERRORS is declared as a plain object literal, so the bracket lookup is not restricted to the 17 keys defined in it — it also resolves inherited properties from Object.prototype (constructor, toString, hasOwnProperty, valueOf, __proto__, …). For those keys ErrorClass is truthy but is not one of the OAuth error subclasses.
The code path that triggers it. The code value is fully server-controlled: OAuthErrorResponseSchema only requires error: z.string(), and parseErrorResponse → OAuthError.fromResponse → oauthErrorFromCode passes response.error straight through. So any authorization-server error body reaches this lookup unfiltered. executeTokenRequest calls parseErrorResponse both on !response.ok and on the HTTP-200-with-error fallback, and registerClient does the same.
Step-by-step proof (verified empirically in Node):
- AS returns
400with body{"error": "constructor", "error_description": "boom"}. parseErrorResponseparses it and callsOAuthError.fromResponse({ error: 'constructor', ... }).oauthErrorFromCode('constructor', 'boom')evaluatesOAUTH_ERRORS['constructor'], which resolves via the prototype chain to the globalObjectconstructor — truthy.new Object('boom')returns[String: 'boom']— a String wrapper object.instanceof OAuthErroris false;instanceof Erroris false.executeTokenRequestthenthrows this non-Error value to the consumer.instanceof OAuthErrorchecks, theauth()invalid_client/invalid_grant recovery branches, andisTransientOAuthErrorall silently stop matching, and the declaredOAuthErrorreturn type of bothoauthErrorFromCodeandparseErrorResponseis violated.
For codes like 'toString' or '__proto__', the resolved value is not constructible, so new ErrorClass(...) throws TypeError: ... is not a constructor. Inside parseErrorResponse that TypeError happens to be masked by the surrounding try/catch into the generic ServerError fallback (wrong message, original error_description lost), but direct callers of OAuthError.fromResponse / oauthErrorFromCode get the raw TypeError.
Why nothing prevents it today. The schema validation accepts any string, and there is no hasOwnProperty/Object.hasOwn guard before constructing. This is also a (small) regression relative to the pre-PR code path: previously fromResponse always did new OAuthError(response.error, ...), which produced a proper OAuthError regardless of the code string.
Impact. Limited but real: it requires a pathological or hostile authorization server emitting an error code that happens to collide with an Object.prototype property name, and the worst case is a mis-typed thrown value / degraded error classification — no prototype pollution and no code execution (the lookup is read-only). That keeps this at nit/defensive-hardening level rather than blocking, but since it sits on an untrusted-input path and the SDK now constructs from this map for every server error, it's worth closing while the PR is fresh.
How to fix — any one of these one-liners:
- Guard the lookup:
const ErrorClass = Object.hasOwn(OAUTH_ERRORS, code) ? OAUTH_ERRORS[code] : undefined; - Give the map a null prototype:
Object.assign(Object.create(null), { ... }) - Use a
Map<string, OAuthErrorSubclass>and.get(code).
The v1->v2 consolidation of the per-code OAuth error subclasses into a single OAuthError class silently breaks two common consumer patterns: - instanceof classification (e.g. catching InvalidGrantError to drop stored tokens) stops matching with no compile error in loosely typed code paths. - retry classification keyed on instanceof ServerError loses unknown error codes: v1 collapsed unrecognized codes into ServerError (hence retryable), while v2 preserves the raw code, so a mechanical rewrite to code === OAuthErrorCode.ServerError silently drops retries for non-standard codes like invalid_refresh_token. Re-add the subclasses as deprecated wrappers, construct them from oauthErrorFromCode() so SDK-produced errors satisfy instanceof checks, restore the deprecated OAUTH_ERRORS map and an errorCode alias, and add isTransientOAuthError() encoding the v1 retry semantics including the unknown-code case. The OAuth InvalidRequestError could not keep its name (taken by the JSON-RPC protocol type) and returns as OAuthInvalidRequestError.
parseErrorResponse and the client-metadata validation throw sites now produce the specific deprecated subclass for known error codes, so instanceof checks written against SDK 1.x keep working for errors thrown by the SDK itself. Unparsable error bodies fall back to a ServerError instance, matching 1.x.
Exercise refreshAuthorization with an injected fetchFn returning error responses, asserting consumer-visible error classes and transient classification through the full request path rather than only via parseErrorResponse directly.
…d OAUTH_ERRORS lookup; restore deleted McpServer section in migration-SKILL.md
e3c3afd to
3ea50ff
Compare
Restore OAuth error backwards compatibility: deprecated v1 subclasses,
instanceof-compatible construction, and anisTransientOAuthError()helper that preserves v1 retry semantics for unknown error codes.Motivation and Context
This fixes two silent backwards-compatibility breaks that v1→v2 migrations hit in OAuth error handling. Both were observed while migrating a large production MCP host application (client + server + custom transports) from SDK v1 to v2.
Break 1 —
instanceofclassification stops matching. v1 exposed a subclass per OAuth error code, and client code classified errors withinstanceof:In v2 today these classes don't exist; every check silently stops matching (no compile error in loosely typed paths, and
@modelcontextprotocol/server-legacy/authclasses are different classes, so importing them doesn't help against errors thrown by the v2 client). Token-refresh invalidation and retry classification degrade silently.Break 2 — unknown error codes silently lose retry behavior. v1's
parseErrorResponsecollapsed unrecognized error codes intoServerError— making them transient/retryable. v2 preserves the raw code. That's strictly better data, but it means the "obvious" migration rewrite:silently drops retries for non-standard codes that real authorization servers return (e.g.
invalid_refresh_token). A refresh loop that recovered in v1 dead-ends in v2.This PR makes the v1 patterns keep working while keeping v2's better model (raw codes preserved, single base class,
error.codeas the primary API).What changed
CustomOAuthErrorare exported again as thin@deprecatedwrappers aroundOAuthError(constructor signature unchanged from v1:(message, errorUri?)).OAuthError.fromResponse()/parseErrorResponse()and the client throw sites construct the matching subclass via a newoauthErrorFromCode(code, message, errorUri?), soinstanceofworks for SDK-produced errors. Unknown codes still produce a plainOAuthErrorwith the raw code preserved.isTransientOAuthError(error): true forserver_error,temporarily_unavailable,too_many_requests, and any code not inOAuthErrorCode— documented as the v1-parity replacement forinstanceof ServerErrorretry checks.errorCodegetter alias forcode(v1 property name).OAUTH_ERRORScode→class map restored.InvalidRequestErrorcould not keep its v1 name — v2 already exports a JSON-RPCInvalidRequestErrorinterface from the protocol types — so it returns asOAuthInvalidRequestError(documented in both migration guides).error.namestays'OAuthError'for all subclasses (matching the v2 base class), so v2-styleerror.namechecks are unaffected; this differs from v1, which used the subclass name, and is called out in the docs.How Has This Been Tested?
fromResponsesubclass behavior,errorCodealias,OAUTH_ERRORScompleteness against the enum,isTransientOAuthErrortruth table including the unknown-code case, andparseErrorResponseend-to-end (JSON error body → subclass; unparsable body →ServerError, matching v1).refreshAuthorizationdriven with an injectedfetchFnreturning real errorResponses, asserting the consumer-visible thrown class and transient classification through the full request path (known code →InvalidGrantError; unknowninvalid_refresh_token→ baseOAuthError, transient; 503temporarily_unavailable→TemporarilyUnavailableError).fromResponsewiring, or the unparsable-body fallback each makes at least one test fail.auth()-level fallback tests: a refresh failure with an unknown token-endpoint code falls through to a fresh authorization flow (returnsREDIRECT), while a known non-transient code (invalid_scope) still escalates to the caller — pinning the v1-parity semantics inside the SDK's own flow, not just in the helper.oauthErrorFromCodeis guarded against prototype-chain lookups (a server returningconstructor,__proto__, ortoStringas the error code gets a plainOAuthErrorwith the raw code preserved), with tests for each.core(570) andclient(375) test suites pass; typecheck and lint clean.Breaking Changes
None. All additions are backwards compatible; v2 code written against
OAuthError/error.codeis unaffected.Types of changes
Checklist
Additional context
isTransientOAuthErrordeliberately classifies unknown codes as transient. That mirrors v1's collapse-to-ServerErrorbehavior and errs toward retrying; consumers who want strict RFC-only transience can check the three codes directly.server-legacy/autherror classes remain distinct types (frozen v1 copies for the legacy AS router); the migration guide now warns against mixing them with@modelcontextprotocol/clientinstanceof checks.