Skip to content

fix: reclassify miscategorized network errors as SYSTEM_ERROR for retry/failover#911

Closed
KevinShiCN wants to merge 1 commit intoding113:devfrom
KevinShiCN:fix/socket-error-type-guard
Closed

fix: reclassify miscategorized network errors as SYSTEM_ERROR for retry/failover#911
KevinShiCN wants to merge 1 commit intoding113:devfrom
KevinShiCN:fix/socket-error-type-guard

Conversation

@KevinShiCN
Copy link

@KevinShiCN KevinShiCN commented Mar 13, 2026

Problem

When an upstream provider connection drops (e.g. SocketError from Node.js undici), the error can be miscategorized as NON_RETRYABLE_CLIENT_ERROR by ErrorCategorizer.

The dev branch already added instanceof ProxyError guards to prevent the TypeError: getDetailedErrorMessage is not a function crash. However, non-ProxyError network errors (SocketError, ECONNRESET, etc.) in the else branch still get thrown immediately without retry or failover, causing:

  1. Streaming response truncation mid-output (terrible UX)
  2. No failover to the next provider
  3. User sees abruptly cut-off output with no error message

Fix

In the else branch of the NON_RETRYABLE_CLIENT_ERROR handler, add network error detection before throwing:

  • SocketError, ECONNRESET, ECONNREFUSED, ETIMEDOUT, ENOTFOUND, and transport message signatures (fetch failed, other side closed) are detected
  • These are reclassified as SYSTEM_ERROR via continue, entering the retry/failover path
  • Only genuine non-network client errors continue to throw immediately

This is consistent with how the existing SYSTEM_ERROR handler works — retry current provider once, then failover to next provider.

Changes

  • src/app/v1/_lib/proxy/forwarder.ts: Add isNetworkError check in the else (non-ProxyError) branch of NON_RETRYABLE_CLIENT_ERROR handler (+30 lines)

Testing

  • The fix is minimal and defensive — existing ProxyError handling is completely unchanged
  • Network errors now correctly trigger provider failover instead of truncating output
  • Non-network plain errors still throw immediately as before

Fixes #909

Greptile Summary

This PR adds a defensive network-error reclassification guard inside the NON_RETRYABLE_CLIENT_ERROR handler's else branch (non-ProxyError path) in ProxyForwarder.send(). When a SocketError or other transport-level error escapes isTransportError() classification (e.g. because a user-configured error rule matched its message content before the transport check could short-circuit it), the error previously caused immediate throw-and-abandon behaviour; the fix correctly reclassifies it as SYSTEM_ERROR, performs all required bookkeeping, and routes it into the retry/failover path.

  • The isNetworkError detection set (SocketError name, TRANSPORT_CODES, fetch failed / other side closed message signatures) correctly mirrors the existing isTransportError() implementation in errors.ts, including error.cause?.code wrapping, EAI_AGAIN, and all undici-specific codes.
  • All bookkeeping that the native SYSTEM_ERROR handler performs is explicitly replicated: session.addProviderToChain(reason: "system_error"), recordEndpointFailure, currentEndpointIndex++, failedProviderIds.push, conditional recordFailure guarded by ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS.
  • The continue/break placement correctly bypasses the throw lastError at the end of the NON_RETRYABLE_CLIENT_ERROR block.
  • Minor: TRANSPORT_CODES is a new Set(...) literal allocated inside the catch block on every invocation rather than a module-level or shared constant, creating a silent drift risk if errors.ts gains new transport codes without updating this copy.

Confidence Score: 4/5

  • Safe to merge — the fix is correct and well-structured with a single minor maintainability concern.
  • The retry/failover semantics are correctly implemented, all required bookkeeping calls are present, and the detection logic faithfully mirrors the existing isTransportError() function. The only notable issue is a duplicated inline TRANSPORT_CODES Set that could drift from errors.ts over time, which is a maintainability concern rather than a runtime bug.
  • src/app/v1/_lib/proxy/forwarder.ts — the inline TRANSPORT_CODES constant should ideally be unified with the one in errors.ts.

Important Files Changed

Filename Overview
src/app/v1/_lib/proxy/forwarder.ts Adds an isNetworkError guard inside the NON_RETRYABLE_CLIENT_ERROR else-branch that correctly replicates the isTransportError() detection logic from errors.ts and performs all required bookkeeping (session.addProviderToChain, recordEndpointFailure, failedProviderIds.push) before issuing continue/break. The retry/failover semantics are correct. Minor: TRANSPORT_CODES is a duplicated Set literal allocated inline rather than referencing the canonical constant from errors.ts.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Inner retry loop\nattemptCount++] --> B[HTTP request]
    B -->|success| C[Return response]
    B -->|error| D[categorizeErrorAsync]

    D -->|CLIENT_ABORT| E[throw immediately]
    D -->|SYSTEM_ERROR| F[SYSTEM_ERROR handler\nrecordChain + recordEndpointFailure]
    D -->|NON_RETRYABLE_CLIENT_ERROR| G{instanceof ProxyError?}
    D -->|RESOURCE_NOT_FOUND| H[retry / failover]
    D -->|PROVIDER_ERROR| I[circuit breaker + failover]

    G -->|yes| J[record chain\nthrow immediately]
    G -->|no| K{isNetworkError?\nSocketError / ECONNRESET\nECONNREFUSED / etc.}

    K -->|yes 🆕| L[record chain\nrecordEndpointFailure]
    K -->|no| M[log + record chain\nthrow immediately]

    L --> N{attemptCount <\nmaxAttempts?}
    N -->|yes| O[currentEndpointIndex++\ncontinue → retry]
    N -->|no| P[failedProviderIds.push\nrecordFailure if enabled\nbreak → provider failover]

    F --> N

    style K fill:#d4edda,stroke:#28a745
    style L fill:#d4edda,stroke:#28a745
    style N fill:#d4edda,stroke:#28a745
    style O fill:#d4edda,stroke:#28a745
    style P fill:#d4edda,stroke:#28a745
Loading

Comments Outside Diff (1)

  1. src/app/v1/_lib/proxy/forwarder.ts, line 1566 (link)

    Incomplete fix — two unsafe as ProxyError casts remain

    The PR description states:

    "Add instanceof ProxyError guards at all three unsafe cast sites in forwarder.ts"

    However, the diff only added a guard to the NON_RETRYABLE_CLIENT_ERROR handler. This line and the analogous cast at line 1672 still contain unguarded as ProxyError casts:

    • Line 1566 (RESOURCE_NOT_FOUND handler): const proxyError = lastError as ProxyError; — if a non-ProxyError is categorized here, proxyError.upstreamError?.statusCodeInferred is safe due to optional chaining, but proxyError.message would return the plain error message rather than the ProxyError-formatted one, silently producing incorrect telemetry in session.addProviderToChain.

    • Line 1672 (PROVIDER_ERROR handler): const proxyError = lastError as ProxyError; followed immediately by const statusCode = proxyError.statusCode;statusCode will be undefined for any non-ProxyError. The statusCode === 524 guard at line 1678 evaluates undefined === 524 → false (safe), but statusCode: statusCode stored in the chain audit at line 1752 will be undefined, corrupting the audit trail. Additionally proxyError.upstreamError?.body/parsed will be undefined instead of actual upstream body data.

    Both sites need the same instanceof ProxyError guard that was applied to the NON_RETRYABLE_CLIENT_ERROR handler.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/app/v1/_lib/proxy/forwarder.ts
    Line: 1566
    
    Comment:
    **Incomplete fix — two unsafe `as ProxyError` casts remain**
    
    The PR description states:
    > "Add `instanceof ProxyError` guards at all three unsafe cast sites in `forwarder.ts`"
    
    However, the diff only added a guard to the `NON_RETRYABLE_CLIENT_ERROR` handler. This line and the analogous cast at line 1672 still contain unguarded `as ProxyError` casts:
    
    - **Line 1566** (`RESOURCE_NOT_FOUND` handler): `const proxyError = lastError as ProxyError;` — if a non-`ProxyError` is categorized here, `proxyError.upstreamError?.statusCodeInferred` is safe due to optional chaining, but `proxyError.message` would return the plain error message rather than the ProxyError-formatted one, silently producing incorrect telemetry in `session.addProviderToChain`.
    
    - **Line 1672** (`PROVIDER_ERROR` handler): `const proxyError = lastError as ProxyError;` followed immediately by `const statusCode = proxyError.statusCode;``statusCode` will be `undefined` for any non-`ProxyError`. The `statusCode === 524` guard at line 1678 evaluates `undefined === 524 → false` (safe), but `statusCode: statusCode` stored in the chain audit at line 1752 will be `undefined`, corrupting the audit trail. Additionally `proxyError.upstreamError?.body/parsed` will be `undefined` instead of actual upstream body data.
    
    Both sites need the same `instanceof ProxyError` guard that was applied to the `NON_RETRYABLE_CLIENT_ERROR` handler.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1419-1424

Comment:
**`TRANSPORT_CODES` Set allocated on every catch**

`TRANSPORT_CODES` is a `new Set(...)` literal inside the catch block, meaning it is reconstructed on every error that reaches this path. Since the same set is already defined as a module-level implementation detail inside `isTransportError()` in `errors.ts`, these two definitions are now diverged copies. Consider either:

1. Exporting a shared constant (or re-exporting the existing `isTransportError` function) from `errors.ts` and calling it here, or
2. At minimum, hoisting `TRANSPORT_CODES` to module scope in `forwarder.ts`.

This is a minor performance concern in an error path, but more importantly, keeping two separate copies risks them drifting if `errors.ts` adds new codes in the future.

```suggestion
              const TRANSPORT_CODES = NETWORK_ERROR_TRANSPORT_CODES; // import module-level constant
```
or simply call the exported helper directly:
```ts
// In errors.ts — export the existing private function:
export { isTransportError };

// In forwarder.ts — replace the inline block with:
const isNetworkError = isTransportError(lastError);
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 01467ea

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

在 ProxyForwarder 的非 ProxyError 分支中新增对网络/传输错误(如 SocketError、ECONNRESET、ECONNREFUSED、ETIMEDOUT、ENOTFOUND、"fetch failed"、"other side closed" 等)的检测;识别后将 errorCategory 重新分类为 SYSTEM_ERROR、记录警告并进入 SYSTEM_ERROR 的重试/切换逻辑,其它客户端错误保持原行为。

Changes

Cohort / File(s) Summary
Proxy forwarder error handling
src/app/v1/_lib/proxy/forwarder.ts
在捕获到非 ProxyError 的错误分支中加入传输/网络错误识别(通过 err.code/err.cause.codeTRANSPORT_CODESerr.nameerr.message 检查);将匹配到的错误归类为 SYSTEM_ERROR、记录详细审计/上下文、可能推进 currentEndpointIndex 并在端点耗尽时按配置记录/触发熔断器或切换 provider;保留对真正客户端错误的既有日志与抛出行为。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive 变更部分满足#909的目标,但存在实现缺陷:isNetworkError检测不完整(缺少cause?.code、EAI_AGAIN和undici特定代码),可能无法捕获所有网络错误场景。 补充isNetworkError检测逻辑以匹配完整的isTransportError实现;导出并复用errors.ts中的isTransportError函数或对齐所有信号检测。
Out of Scope Changes check ❓ Inconclusive 发现两个范围外遗留问题:RESOURCE_NOT_FOUND和PROVIDER_ERROR处理器中的不安全ProxyError强制转换(行1566和1672)未应用instanceof保护,与PR目标的完整性不符。 在RESOURCE_NOT_FOUND和PROVIDER_ERROR处理器中应用相同的instanceof ProxyError保护,并在system errorDetails中补充errorStack字段保持审计一致性。
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed 标题清晰准确地描述了主要变更:重新分类网络错误为SYSTEM_ERROR以支持重试/故障转移。
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed PR描述清晰地说明了问题(网络错误被误分类导致流截断)、修复方案(在NON_RETRYABLE_CLIENT_ERROR处理器中添加网络错误检测)和相关变更。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XS Extra Small PR (< 50 lines) bug Something isn't working area:Error Rule area:provider labels Mar 13, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical flaw in the error handling mechanism where non-ProxyError types, such as SocketError from dropped connections, were incorrectly categorized and subsequently cast to ProxyError. This led to TypeError exceptions, crashing the error handling flow, truncating streaming responses, and preventing proper failover. The changes introduce defensive instanceof ProxyError checks at key error handling sites, reclassifying non-ProxyError instances as SYSTEM_ERROR to ensure they correctly enter the retry and failover path, thereby enhancing the system's stability and user experience.

Highlights

  • Error Handling Robustness: Implemented instanceof ProxyError guards in forwarder.ts to prevent TypeError when non-ProxyError types (like SocketError) are incorrectly cast, ensuring more robust error handling.
  • Error Reclassification: Reclassified miscategorized non-ProxyError errors as SYSTEM_ERROR to correctly trigger the retry and failover mechanisms, improving system resilience and preventing crashes.
  • Improved User Experience: Prevented crashes in the error handling path and truncation of streaming responses, leading to a more stable and reliable user experience during upstream connection issues.
Changelog
  • src/app/v1/_lib/proxy/forwarder.ts
    • Added an instanceof ProxyError guard within the NON_RETRYABLE_CLIENT_ERROR handling block to reclassify non-ProxyError instances as SYSTEM_ERROR.
    • Introduced an instanceof ProxyError guard in the RESOURCE_NOT_FOUND handling block to ensure proper error type handling.
    • Applied an instanceof ProxyError guard to the default ProxyError handling section for consistent error reclassification.
Activity
  • No specific activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds instanceof ProxyError guards to prevent a TypeError when a non-ProxyError is miscategorized, which is a solid defensive fix. My review includes suggestions to improve logging consistency across the new error handling blocks for better diagnostics. I've also pointed out a medium-severity concern with the current retry implementation using continue, as it bypasses some important logic in the main SYSTEM_ERROR handler. Addressing this in the future would make the retry mechanism more robust.

Comment on lines +1367 to +1368
errorCategory = ErrorCategory.SYSTEM_ERROR;
continue; // Re-enter the category handling loop as SYSTEM_ERROR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This correctly handles the case where a non-ProxyError is miscategorized. However, using continue here triggers a retry but bypasses the main SYSTEM_ERROR handler. This means important logic like logging to the provider chain and advancing to the next endpoint (currentEndpointIndex++) is skipped. This could lead to retrying on the same faulty endpoint and reduced observability. A future refactor to allow fall-through to the SYSTEM_ERROR handler would make the retry mechanism more robust.

Comment on lines +1528 to +1532
logger.warn("ProxyForwarder: Non-ProxyError in RESOURCE_NOT_FOUND, reclassifying as SYSTEM_ERROR", {
providerId: currentProvider.id,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the other instanceof checks and for better diagnostics, consider enhancing this log message to include providerName and originalCategory.

Suggested change
logger.warn("ProxyForwarder: Non-ProxyError in RESOURCE_NOT_FOUND, reclassifying as SYSTEM_ERROR", {
providerId: currentProvider.id,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
});
logger.warn("ProxyForwarder: Non-ProxyError in RESOURCE_NOT_FOUND, reclassifying as SYSTEM_ERROR for retry", {
providerId: currentProvider.id,
providerName: currentProvider.name,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
originalCategory: errorCategory,
});

Comment on lines +1643 to +1647
logger.warn("ProxyForwarder: Non-ProxyError in default handler, reclassifying as SYSTEM_ERROR", {
providerId: currentProvider.id,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To maintain consistency with the other new checks and improve logging, it would be beneficial to include providerName and originalCategory here as well. Also, specifying PROVIDER_ERROR in the message instead of 'default handler' would be more precise.

Suggested change
logger.warn("ProxyForwarder: Non-ProxyError in default handler, reclassifying as SYSTEM_ERROR", {
providerId: currentProvider.id,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
});
logger.warn("ProxyForwarder: Non-ProxyError in PROVIDER_ERROR handler, reclassifying as SYSTEM_ERROR for retry", {
providerId: currentProvider.id,
providerName: currentProvider.name,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
originalCategory: errorCategory,
});

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR adds defensive type guards at three locations in to prevent TypeError crashes when non-ProxyError errors (like SocketError) are miscategorized as by the error categorizer. The fix correctly reclassifies such errors to and continues the retry/failover path.

PR Size: XS

  • Lines changed: 35
  • Files changed: 1

Review Coverage

  • Logic and correctness - The fix logic is sound: when a non-ProxyError is miscategorized, it's reclassified to SYSTEM_ERROR and continues to the retry path
  • Security (OWASP Top 10) - No security issues
  • Error handling - The fix improves error handling by adding guards and logging warnings with context (errorType, errorMessage)
  • Type safety - Uses proper checks, consistent with existing codebase patterns (lines 1025, 1029, 1171, 1304)
  • Documentation accuracy - Comments added accurately explain the bug and its fix
  • Test coverage - While no new test was added, this is a defensive bug fix (not a new feature) that follows existing patterns
  • Code clarity - The fix is minimal, focused, and easy to understand

Automated review by Claude AI

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR adds defensive instanceof ProxyError type guards at three locations in forwarder.ts to prevent TypeError crashes when non-ProxyError errors (like SocketError) are miscategorized as NON_RETRYABLE_CLIENT_ERROR by the error categorizer. The fix correctly reclassifies such errors to SYSTEM_ERROR and continues the retry/failover path.

PR Size: XS

  • Lines changed: 35
  • Files changed: 1

Review Coverage

  • Logic and correctness - The fix logic is sound: when a non-ProxyError is miscategorized, it is reclassified to SYSTEM_ERROR and continues to the retry path
  • Security (OWASP Top 10) - No security issues
  • Error handling - The fix improves error handling by adding guards and logging warnings with context (errorType, errorMessage)
  • Type safety - Uses proper instanceof ProxyError checks, consistent with existing codebase patterns
  • Documentation accuracy - Comments added accurately explain the bug and its fix
  • Test coverage - While no new test was added, this is a defensive bug fix (not a new feature) that follows existing patterns
  • Code clarity - The fix is minimal, focused, and easy to understand

Automated review by Claude AI

Comment on lines +1359 to +1369
if (!(lastError instanceof ProxyError)) {
logger.warn("ProxyForwarder: Non-ProxyError miscategorized as NON_RETRYABLE_CLIENT_ERROR, reclassifying as SYSTEM_ERROR for retry", {
providerId: currentProvider.id,
providerName: currentProvider.name,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
originalCategory: errorCategory,
});
errorCategory = ErrorCategory.SYSTEM_ERROR;
continue; // Re-enter the category handling loop as SYSTEM_ERROR
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue bypasses SYSTEM_ERROR bookkeeping, breaking provider failover

When continue is executed here (and in the two analogous guards at lines ~1527 and ~1642), the loop jumps directly to the next inner-loop iteration, completely skipping the SYSTEM_ERROR handler (lines ~1432–1522). That handler is responsible for:

  1. Recording the failed attempt via session.addProviderToChain(...).
  2. Advancing currentEndpointIndex++ on the first failure.
  3. Calling failedProviderIds.push(currentProvider.id) once attemptCount >= maxAttemptsPerProvider.
  4. Triggering break so the outer loop can switch providers.

Because none of those steps happen, when all maxAttemptsPerProvider inner-loop iterations exhaust through this guard (the same SocketError keeps matching isNonRetryableClientErrorAsync on every retry), the inner loop exits naturally with failedProviderIds still empty for the current provider. selectAlternative(session, failedProviderIds) then has no reason to exclude the broken provider and may re-select it in the next outer-loop iteration — the very failover the PR description promises.

The comment // Re-enter the category handling loop as SYSTEM_ERROR suggests the intent was to fall through to the SYSTEM_ERROR handler within the same iteration, but continue in JavaScript/TypeScript always starts the next iteration.

A minimal correction would be to add failedProviderIds.push(currentProvider.id) before the continue, and to handle the first-vs-last-attempt distinction (endpoint advancement vs. provider switch) similarly to what the SYSTEM_ERROR handler already does. Alternatively, restructure so the non-ProxyError path falls through to the existing SYSTEM_ERROR handler rather than using continue.

// Example correction for NON_RETRYABLE guard:
if (!(lastError instanceof ProxyError)) {
  logger.warn("ProxyForwarder: Non-ProxyError miscategorized as NON_RETRYABLE_CLIENT_ERROR, reclassifying as SYSTEM_ERROR for retry", {
    providerId: currentProvider.id,
    providerName: currentProvider.name,
    errorType: lastError?.constructor?.name,
    errorMessage: lastError?.message,
    originalCategory: errorCategory,
  });
  errorCategory = ErrorCategory.SYSTEM_ERROR;
  // ⚠️ Do NOT `continue` here — fall through so the SYSTEM_ERROR handler
  // below performs endpoint advancement, session chain recording, and
  // failedProviderIds tracking.
}

The same fix applies to the RESOURCE_NOT_FOUND guard (~line 1527) and the default PROVIDER_ERROR guard (~line 1642).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1359-1369

Comment:
**`continue` bypasses SYSTEM_ERROR bookkeeping, breaking provider failover**

When `continue` is executed here (and in the two analogous guards at lines ~1527 and ~1642), the loop jumps directly to the *next inner-loop iteration*, completely skipping the `SYSTEM_ERROR` handler (lines ~1432–1522). That handler is responsible for:

1. Recording the failed attempt via `session.addProviderToChain(...)`.
2. Advancing `currentEndpointIndex++` on the first failure.
3. Calling `failedProviderIds.push(currentProvider.id)` once `attemptCount >= maxAttemptsPerProvider`.
4. Triggering `break` so the outer loop can switch providers.

Because none of those steps happen, when all `maxAttemptsPerProvider` inner-loop iterations exhaust through this guard (the same SocketError keeps matching `isNonRetryableClientErrorAsync` on every retry), the inner loop exits *naturally* with `failedProviderIds` still empty for the current provider. `selectAlternative(session, failedProviderIds)` then has no reason to exclude the broken provider and may re-select it in the next outer-loop iteration — the very failover the PR description promises.

The comment `// Re-enter the category handling loop as SYSTEM_ERROR` suggests the intent was to fall through to the `SYSTEM_ERROR` handler *within the same iteration*, but `continue` in JavaScript/TypeScript always starts the **next** iteration.

A minimal correction would be to add `failedProviderIds.push(currentProvider.id)` before the `continue`, and to handle the first-vs-last-attempt distinction (endpoint advancement vs. provider switch) similarly to what the `SYSTEM_ERROR` handler already does. Alternatively, restructure so the non-ProxyError path falls through to the existing `SYSTEM_ERROR` handler rather than using `continue`.

```typescript
// Example correction for NON_RETRYABLE guard:
if (!(lastError instanceof ProxyError)) {
  logger.warn("ProxyForwarder: Non-ProxyError miscategorized as NON_RETRYABLE_CLIENT_ERROR, reclassifying as SYSTEM_ERROR for retry", {
    providerId: currentProvider.id,
    providerName: currentProvider.name,
    errorType: lastError?.constructor?.name,
    errorMessage: lastError?.message,
    originalCategory: errorCategory,
  });
  errorCategory = ErrorCategory.SYSTEM_ERROR;
  // ⚠️ Do NOT `continue` here — fall through so the SYSTEM_ERROR handler
  // below performs endpoint advancement, session chain recording, and
  // failedProviderIds tracking.
}
```
The same fix applies to the `RESOURCE_NOT_FOUND` guard (~line 1527) and the default `PROVIDER_ERROR` guard (~line 1642).

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1642 to +1650
if (!(lastError instanceof ProxyError)) {
logger.warn("ProxyForwarder: Non-ProxyError in default handler, reclassifying as SYSTEM_ERROR", {
providerId: currentProvider.id,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
});
errorCategory = ErrorCategory.SYSTEM_ERROR;
continue;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing session.addProviderToChain call before reclassification

Like the NON_RETRYABLE_CLIENT_ERROR guard above, this guard and the RESOURCE_NOT_FOUND guard (~line 1527) all use continue without ever recording the failed attempt in the provider chain (session.addProviderToChain). When the inner loop exhausts entirely via these guards, the session audit trail for all those attempts will be silently missing, making it impossible to diagnose repeated SocketError failures after the fact.

The SYSTEM_ERROR handler (the intended destination) would have recorded each attempt as reason: "system_error" with structured errorDetails. That telemetry is lost with the current continue-based approach.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1642-1650

Comment:
**Missing `session.addProviderToChain` call before reclassification**

Like the `NON_RETRYABLE_CLIENT_ERROR` guard above, this guard and the `RESOURCE_NOT_FOUND` guard (~line 1527) all use `continue` without ever recording the failed attempt in the provider chain (`session.addProviderToChain`). When the inner loop exhausts entirely via these guards, the session audit trail for all those attempts will be silently missing, making it impossible to diagnose repeated SocketError failures after the fact.

The `SYSTEM_ERROR` handler (the intended destination) would have recorded each attempt as `reason: "system_error"` with structured `errorDetails`. That telemetry is lost with the current `continue`-based approach.

How can I resolve this? If you propose a fix, please make it concise.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/app/v1/_lib/proxy/forwarder.ts (1)

1642-1650: 关键修复位置,建议增强日志信息

这个 guard 位于 PROVIDER_ERROR 的默认处理逻辑中(EmptyResponseError 已在上方处理),是修复 issue #909 的关键位置。当 undici 的 SocketError 或其他网络层错误被错误分类为 PROVIDER_ERROR 时,这个 guard 可以正确地将其重新分类为 SYSTEM_ERROR 并触发重试。

建议:为保持日志信息的一致性,可以补充 providerName 字段(与第一个 guard 保持一致):

建议的改进
 if (!(lastError instanceof ProxyError)) {
   logger.warn("ProxyForwarder: Non-ProxyError in default handler, reclassifying as SYSTEM_ERROR", {
     providerId: currentProvider.id,
+    providerName: currentProvider.name,
     errorType: lastError?.constructor?.name,
     errorMessage: lastError?.message,
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1642 - 1650, The log in the
default PROVIDER_ERROR handler reclassifies non-ProxyError as SYSTEM_ERROR but
omits providerName; update the logger.warn call inside the guard (the block that
checks "if (!(lastError instanceof ProxyError))" in forwarder.ts/ProxyForwarder)
to include providerName alongside providerId, errorType, and errorMessage so it
matches the earlier guard's fields; leave the errorCategory reassignment to
ErrorCategory.SYSTEM_ERROR and the continue intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1642-1650: The log in the default PROVIDER_ERROR handler
reclassifies non-ProxyError as SYSTEM_ERROR but omits providerName; update the
logger.warn call inside the guard (the block that checks "if (!(lastError
instanceof ProxyError))" in forwarder.ts/ProxyForwarder) to include providerName
alongside providerId, errorType, and errorMessage so it matches the earlier
guard's fields; leave the errorCategory reassignment to
ErrorCategory.SYSTEM_ERROR and the continue intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2709196c-ce7d-4213-8e23-33f3acfb7b18

📥 Commits

Reviewing files that changed from the base of the PR and between fdba211 and 70b4378.

📒 Files selected for processing (1)
  • src/app/v1/_lib/proxy/forwarder.ts

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR adds defensive instanceof ProxyError type guards at three locations in forwarder.ts to prevent TypeError crashes when non-ProxyError errors (like SocketError) are miscategorized as NON_RETRYABLE_CLIENT_ERROR by the error categorizer. The fix correctly reclassifies such errors to SYSTEM_ERROR and continues the retry/failover path.

PR Size: XS

  • Lines changed: 35
  • Files changed: 1

Review Coverage

  • Logic and correctness - The fix logic is sound
  • Security - No security issues
  • Error handling - The fix improves error handling
  • Type safety - Uses proper instanceof checks
  • Documentation accuracy - Comments are accurate
  • Test coverage - Defensive bug fix
  • Code clarity - Minimal and focused

Automated review by Claude AI

// which truncates streaming responses. Reclassify as SYSTEM_ERROR
// so it enters the retry/failover path instead.
// See: https://github.com/ding113/claude-code-hub/issues/909
if (!(lastError instanceof ProxyError)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High] continue bypasses SYSTEM_ERROR bookkeeping, breaking provider failover

Why this is a problem: When continue is executed (lines 1359-1369, 1527-1535, 1642-1650), it jumps to the next iteration of the inner retry loop, completely skipping the SYSTEM_ERROR handler (lines 1416-1506). That handler is responsible for:

  1. Recording the failed attempt via session.addProviderToChain(...)
  2. Advancing currentEndpointIndex++ on first failure
  3. Calling failedProviderIds.push(currentProvider.id) when exhausted
  4. Triggering break to switch providers

Because none of these happen, when all maxAttemptsPerProvider iterations exhaust through this guard, the inner loop exits naturally with failedProviderIds still empty. selectAlternative(session, failedProviderIds) then has no reason to exclude the broken provider and may re-select it.

Additionally, the error categorization result is cached (errors.ts:576), so categorizeErrorAsync will return the same NON_RETRYABLE_CLIENT_ERROR on every retry, causing the fix to reclassify and continue without ever executing the SYSTEM_ERROR path.

Suggested fix: Instead of continue, re-throw as a ProxyError that will be re-categorized as SYSTEM_ERROR, or manually add the bookkeeping before continuing.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR adds defensive instanceof ProxyError type guards to prevent TypeError crashes when non-ProxyError errors (like SocketError) are miscategorized as NON_RETRYABLE_CLIENT_ERROR. However, there is a High priority issue with the implementation that could break provider failover.

PR Size: XS

  • Lines changed: 35
  • Files changed: 1

Issues Found

Category Critical High Medium Low
Logic/Bugs 0 1 0 0
Error Handling 0 1 0 0

High Priority Issues (Should Fix)

[High] continue bypasses SYSTEM_ERROR bookkeeping, breaking provider failover

The fix uses continue after reclassifying to SYSTEM_ERROR, which jumps to the next iteration and completely skips the SYSTEM_ERROR handler. This handler is responsible for:

  • Recording to session chain (session.addProviderToChain)
  • Tracking failed providers (failedProviderIds.push)
  • Advancing endpoints and triggering provider switch

When all retry attempts exhaust through this guard path, failedProviderIds remains empty, so selectAlternative cannot exclude the broken provider. Additionally, error categorization is cached, so subsequent retries will keep returning the same category, causing an infinite loop through the guard without proper bookkeeping.

Review Coverage

  • Logic and correctness - Issue found
  • Security (OWASP Top 10) - Clean
  • Error handling - Issue found (continue breaks error handling flow)
  • Type safety - Clean (uses instanceof correctly)
  • Documentation accuracy - Clean
  • Test coverage - N/A (defensive fix)
  • Code clarity - Clean

Automated review by Claude AI

@ding113 ding113 changed the base branch from main to dev March 13, 2026 14:25
@KevinShiCN KevinShiCN force-pushed the fix/socket-error-type-guard branch from 70b4378 to 6bd3472 Compare March 13, 2026 14:37
@KevinShiCN KevinShiCN changed the title fix: add instanceof ProxyError guard to prevent TypeError on SocketError fix: reclassify miscategorized network errors as SYSTEM_ERROR for retry/failover Mar 13, 2026
Comment on lines +1435 to +1436
errorCategory = ErrorCategory.SYSTEM_ERROR;
continue; // Re-enter the loop to hit SYSTEM_ERROR handler
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errorCategory reassignment is a dead write — SYSTEM_ERROR handler is never reached

errorCategory is declared with let inside the catch block at line 1023:

let errorCategory = await categorizeErrorAsync(lastError);

Because it is block-scoped to each catch execution, the assignment errorCategory = ErrorCategory.SYSTEM_ERROR here only modifies the current iteration's local copy. When continue jumps to the next loop iteration and the request fails again with the same SocketError, the catch block re-executes and a fresh let errorCategory is assigned from categorizeErrorAsync — which will again return NON_RETRYABLE_CLIENT_ERROR for the same error. The SYSTEM_ERROR handler (lines ~1471–1562) is therefore never reached for this error path.

This means:

  • The comment // Re-enter the loop to hit SYSTEM_ERROR handler is incorrect.
  • All maxAttemptsPerProvider iterations are consumed by the isNetworkError guard, each time continue-ing back into the same path.
  • The SYSTEM_ERROR handler's bookkeeping (session.addProviderToChain, failedProviderIds.push, break) is never executed, which is the root cause of the failover not working (as also noted in the existing review thread at line 1359).

To actually route into the SYSTEM_ERROR handler, the reclassification must persist across iterations — for example by hoisting a flag variable outside the catch block and checking it at the start of the catch block before calling categorizeErrorAsync.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1435-1436

Comment:
**`errorCategory` reassignment is a dead write — SYSTEM_ERROR handler is never reached**

`errorCategory` is declared with `let` *inside* the `catch` block at line 1023:

```typescript
let errorCategory = await categorizeErrorAsync(lastError);
```

Because it is block-scoped to each catch execution, the assignment `errorCategory = ErrorCategory.SYSTEM_ERROR` here only modifies the current iteration's local copy. When `continue` jumps to the next loop iteration and the request fails again with the same `SocketError`, the catch block re-executes and a **fresh** `let errorCategory` is assigned from `categorizeErrorAsync` — which will again return `NON_RETRYABLE_CLIENT_ERROR` for the same error. The SYSTEM_ERROR handler (lines ~1471–1562) is therefore *never reached* for this error path.

This means:
- The comment `// Re-enter the loop to hit SYSTEM_ERROR handler` is incorrect.
- All `maxAttemptsPerProvider` iterations are consumed by the `isNetworkError` guard, each time `continue`-ing back into the same path.
- The SYSTEM_ERROR handler's bookkeeping (`session.addProviderToChain`, `failedProviderIds.push`, `break`) is never executed, which is the root cause of the failover not working (as also noted in the existing review thread at line 1359).

To actually route into the SYSTEM_ERROR handler, the reclassification must persist *across* iterations — for example by hoisting a flag variable outside the catch block and checking it at the start of the catch block before calling `categorizeErrorAsync`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1423 to +1474
if (isNetworkError) {
const err = lastError as Error & {
code?: string;
errno?: number;
syscall?: string;
};

logger.warn(
"ProxyForwarder: Network error miscategorized as NON_RETRYABLE_CLIENT_ERROR, handling as SYSTEM_ERROR for retry/failover",
{
providerId: currentProvider.id,
providerName: currentProvider.name,
errorType: err.constructor.name,
errorMessage: err.message,
errorCode: err.code,
attemptNumber: attemptCount,
totalProvidersAttempted,
willRetry: attemptCount < maxAttemptsPerProvider,
originalCategory: errorCategory,
}
);

session.addProviderToChain(currentProvider, {
...endpointAudit,
reason: "system_error",
circuitState: getCircuitState(currentProvider.id),
attemptNumber: attemptCount,
errorMessage: errorMessage,
errorDetails: {
system: {
errorType: err.constructor.name,
errorName: err.name,
errorMessage: err.message || err.name || "Unknown error",
errorCode: err.code,
errorSyscall: err.syscall,
},
request: buildRequestDetails(session),
},
});

if (attemptCount < maxAttemptsPerProvider) {
currentEndpointIndex++;
logger.debug("ProxyForwarder: Advancing endpoint index due to miscategorized network error", {
providerId: currentProvider.id,
previousEndpointIndex: currentEndpointIndex - 1,
newEndpointIndex: currentEndpointIndex,
maxEndpointIndex: endpointCandidates.length - 1,
});

await new Promise((resolve) => setTimeout(resolve, 100));
continue;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recordEndpointFailure is skipped for reclassified network errors

The endpoint-level circuit breaker call at lines 1037–1041 is gated on errorCategory === ErrorCategory.SYSTEM_ERROR:

if (activeEndpoint.endpointId != null) {
  if (isTimeoutError || errorCategory === ErrorCategory.SYSTEM_ERROR) {
    await recordEndpointFailure(activeEndpoint.endpointId, lastError);
  }
}

Because categorizeErrorAsync classifies a SocketError as NON_RETRYABLE_CLIENT_ERROR (that is the root cause this PR is fixing), errorCategory is never SYSTEM_ERROR when this guard runs. As a result, recordEndpointFailure is never called for the network errors handled by this new block.

Genuine SYSTEM_ERROR-classified network errors correctly record a per-endpoint failure. The new reclassified network-error path silently skips it, so degraded endpoints that keep producing SocketError are not penalised in the endpoint circuit breaker and will remain eligible for selection on every subsequent retry—potentially causing more SocketErrors on the same endpoint.

A minimal fix is to call recordEndpointFailure inside the isNetworkError branch before continue/break, mirroring what already happens at line 1039:

if (isNetworkError) {
  // ... existing session.addProviderToChain call ...

  // Mirror the endpoint-level circuit breaker recording that SYSTEM_ERROR gets at line 1039
  if (activeEndpoint.endpointId != null) {
    await recordEndpointFailure(activeEndpoint.endpointId, lastError);
  }

  if (attemptCount < maxAttemptsPerProvider) {
    // ...continue path...
  }
  // ...break path...
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1423-1474

Comment:
**`recordEndpointFailure` is skipped for reclassified network errors**

The endpoint-level circuit breaker call at lines 1037–1041 is gated on `errorCategory === ErrorCategory.SYSTEM_ERROR`:

```typescript
if (activeEndpoint.endpointId != null) {
  if (isTimeoutError || errorCategory === ErrorCategory.SYSTEM_ERROR) {
    await recordEndpointFailure(activeEndpoint.endpointId, lastError);
  }
}
```

Because `categorizeErrorAsync` classifies a `SocketError` as `NON_RETRYABLE_CLIENT_ERROR` (that is the root cause this PR is fixing), `errorCategory` is never `SYSTEM_ERROR` when this guard runs. As a result, `recordEndpointFailure` is **never called** for the network errors handled by this new block.

Genuine `SYSTEM_ERROR`-classified network errors correctly record a per-endpoint failure. The new reclassified network-error path silently skips it, so degraded endpoints that keep producing `SocketError` are not penalised in the endpoint circuit breaker and will remain eligible for selection on every subsequent retry—potentially causing more SocketErrors on the same endpoint.

A minimal fix is to call `recordEndpointFailure` inside the `isNetworkError` branch before `continue`/`break`, mirroring what already happens at line 1039:

```typescript
if (isNetworkError) {
  // ... existing session.addProviderToChain call ...

  // Mirror the endpoint-level circuit breaker recording that SYSTEM_ERROR gets at line 1039
  if (activeEndpoint.endpointId != null) {
    await recordEndpointFailure(activeEndpoint.endpointId, lastError);
  }

  if (attemptCount < maxAttemptsPerProvider) {
    // ...continue path...
  }
  // ...break path...
}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/app/v1/_lib/proxy/forwarder.ts (2)

1410-1508: 核心修复逻辑正确,网络错误现在能正确触发重试/故障转移

该修复正确识别了被误分类为 NON_RETRYABLE_CLIENT_ERROR 的网络错误,并内联处理为 SYSTEM_ERROR 路径。与之前使用 continue 跳转导致的死写问题相比,当前实现在 continue/break 之前完成了所有必要的记账操作(provider chain、endpoint 推进、failedProviderIds)。

有两个小的改进建议:

  1. Line 1482 重复声明 envenv 已在 Line 536 声明,此处无需重新获取。

  2. Line 1451-1460 缺少 errorStack:与 SYSTEM_ERROR handler(Line 1573)相比,审计数据缺少 errorStack 字段,这会影响调试时的信息完整性。

建议的改进
@@ -1448,6 +1448,7 @@
                   errorDetails: {
                     system: {
                       errorType: err.constructor.name,
                       errorName: err.name,
                       errorMessage: err.message || err.name || "Unknown error",
                       errorCode: err.code,
                       errorSyscall: err.syscall,
+                      errorStack: err.stack?.split("\n").slice(0, 3).join("\n"),
                     },
                     request: buildRequestDetails(session),
                   },
@@ -1479,7 +1479,7 @@
                   totalProvidersAttempted,
                 });
 
-                const env = getEnvConfig();
+                // env already declared at line 536, reuse it
                 failedProviderIds.push(currentProvider.id);
 
                 if (env.ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1410 - 1508, Remove the
redundant getEnvConfig() call inside the miscategorized-network-error branch
(it’s already retrieved earlier) and include the error stack in the audit
payload so it matches the SYSTEM_ERROR path: when building the audit entry in
session.addProviderToChain (inside the ProxyForwarder miscategorized network
error handling) add errorStack: err.stack (or err.stack || null) under
errorDetails.system; delete the duplicate env = getEnvConfig() line and keep
using the previously-scoped env for the ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS
check and subsequent recordFailure/currentProvider handling.

1414-1421: 网络错误检测逻辑可考虑抽取为工具函数

当前的 isNetworkError 检测模式覆盖了常见的 undici/Node.js 网络错误场景。但如果这种检测逻辑在其他地方也需要使用(例如 hedge 路径或其他错误处理器),建议将其抽取为一个独立的工具函数放在 errors.ts 中,便于集中维护和复用。

// 可在 errors.ts 中添加
export function isNetworkTransportError(error: Error): boolean {
  const err = error as Error & { code?: string };
  return (
    error.name === "SocketError" ||
    err.code === "ECONNRESET" ||
    err.code === "ECONNREFUSED" ||
    err.code === "ETIMEDOUT" ||
    err.code === "ENOTFOUND" ||
    error.message?.toLowerCase().includes("fetch failed") ||
    error.message?.toLowerCase().includes("other side closed")
  );
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1414 - 1421, Extract the
inline network-error detection into a reusable helper: create and export a
function isNetworkTransportError(error: Error) in errors.ts that contains the
existing checks (referencing the same checks now assigned to isNetworkError),
then replace the inline isNetworkError expression in forwarder.ts with a call to
isNetworkTransportError(lastError) and import it; also update any other places
(e.g., hedge path or other error handlers) to use the new
isNetworkTransportError to centralize and reuse the logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1410-1508: Remove the redundant getEnvConfig() call inside the
miscategorized-network-error branch (it’s already retrieved earlier) and include
the error stack in the audit payload so it matches the SYSTEM_ERROR path: when
building the audit entry in session.addProviderToChain (inside the
ProxyForwarder miscategorized network error handling) add errorStack: err.stack
(or err.stack || null) under errorDetails.system; delete the duplicate env =
getEnvConfig() line and keep using the previously-scoped env for the
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS check and subsequent
recordFailure/currentProvider handling.
- Around line 1414-1421: Extract the inline network-error detection into a
reusable helper: create and export a function isNetworkTransportError(error:
Error) in errors.ts that contains the existing checks (referencing the same
checks now assigned to isNetworkError), then replace the inline isNetworkError
expression in forwarder.ts with a call to isNetworkTransportError(lastError) and
import it; also update any other places (e.g., hedge path or other error
handlers) to use the new isNetworkTransportError to centralize and reuse the
logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ba4d10d-7924-4dd8-bc25-469a2e1131b6

📥 Commits

Reviewing files that changed from the base of the PR and between 6bd3472 and 4e3e019.

📒 Files selected for processing (1)
  • src/app/v1/_lib/proxy/forwarder.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/app/v1/_lib/proxy/forwarder.ts (3)

1451-1461: errorDetails.system 缺少 errorStack 字段

与 SYSTEM_ERROR 处理器(第 1580 行)相比,此处的 errorDetails.system 对象缺少 errorStack 字段。虽然不影响功能,但添加此字段可以在调试时提供更完整的错误上下文。

建议添加 errorStack 字段
 errorDetails: {
   system: {
     errorType: err.constructor.name,
     errorName: err.name,
     errorMessage: err.message || err.name || "Unknown error",
     errorCode: err.code,
     errorSyscall: err.syscall,
+    errorStack: err.stack?.split("\n").slice(0, 3).join("\n"),
   },
   request: buildRequestDetails(session),
 },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1451 - 1461, The
errorDetails.system object built where errors are reported is missing the
errorStack field; update the object construction (the block that sets
errorDetails: { system: { ... } }) to include errorStack: err.stack (or
err.stack || null) so it matches the SYSTEM_ERROR handler behavior around the
SYSTEM_ERROR handling code (referenced near line 1580) and provides full stack
context when logging from the forwarder error reporting path.

1489-1489: 重复调用 getEnvConfig() 产生变量遮蔽

第 1489 行的 const env = getEnvConfig(); 遮蔽了第 536 行已声明的外层 env 变量。由于外层 env 在整个 send() 方法作用域内有效,此处无需重新获取。

移除重复声明
-              const env = getEnvConfig();
               failedProviderIds.push(currentProvider.id);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` at line 1489, Inside the send() method
remove the redundant local declaration "const env = getEnvConfig();" that
shadows the outer env variable; instead reuse the already-declared outer env
(referenced earlier in send()) and delete the extra call to getEnvConfig() to
avoid duplicate invocation and variable shadowing (search for the getEnvConfig()
call within send() and remove it so subsequent references use the outer env).

1410-1421: 网络错误检测逻辑正确,但建议提取为独立的类型守卫函数

当前的 isNetworkError 检测逻辑正确覆盖了常见的网络/传输错误类型。然而,代码库中已有类似的错误检测函数(如 isClientAbortErrorisHttp2ErrorisSSLCertificateError),建议将此逻辑提取到 errors.ts 作为独立的类型守卫函数,以保持一致性并便于复用。

建议提取到 errors.ts 的示例
// 在 errors.ts 中添加
export function isNetworkError(error: unknown): error is Error & { code?: string } {
  if (!(error instanceof Error)) return false;
  const err = error as Error & { code?: string };
  return (
    err.name === "SocketError" ||
    err.code === "ECONNRESET" ||
    err.code === "ECONNREFUSED" ||
    err.code === "ETIMEDOUT" ||
    err.code === "ENOTFOUND" ||
    err.message?.toLowerCase().includes("fetch failed") ||
    err.message?.toLowerCase().includes("other side closed")
  );
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1410 - 1421, Extract the
inline network-detection logic into a reusable type-guard exported from
errors.ts (e.g., export function isNetworkError(error: unknown): error is Error
& { code?: string }) that first checks instanceof Error and then tests name,
code values (ECONNRESET, ECONNREFUSED, ETIMEDOUT, ENOTFOUND) and message
substrings ("fetch failed", "other side closed"); replace the inline
isNetworkError computation in forwarder.ts with a call to this new
isNetworkError(lastError) and add the appropriate import from errors.ts so other
modules can reuse the guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1451-1461: The errorDetails.system object built where errors are
reported is missing the errorStack field; update the object construction (the
block that sets errorDetails: { system: { ... } }) to include errorStack:
err.stack (or err.stack || null) so it matches the SYSTEM_ERROR handler behavior
around the SYSTEM_ERROR handling code (referenced near line 1580) and provides
full stack context when logging from the forwarder error reporting path.
- Line 1489: Inside the send() method remove the redundant local declaration
"const env = getEnvConfig();" that shadows the outer env variable; instead reuse
the already-declared outer env (referenced earlier in send()) and delete the
extra call to getEnvConfig() to avoid duplicate invocation and variable
shadowing (search for the getEnvConfig() call within send() and remove it so
subsequent references use the outer env).
- Around line 1410-1421: Extract the inline network-detection logic into a
reusable type-guard exported from errors.ts (e.g., export function
isNetworkError(error: unknown): error is Error & { code?: string }) that first
checks instanceof Error and then tests name, code values (ECONNRESET,
ECONNREFUSED, ETIMEDOUT, ENOTFOUND) and message substrings ("fetch failed",
"other side closed"); replace the inline isNetworkError computation in
forwarder.ts with a call to this new isNetworkError(lastError) and add the
appropriate import from errors.ts so other modules can reuse the guard.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0ccbe974-0cec-48d8-8f1f-76416cb31560

📥 Commits

Reviewing files that changed from the base of the PR and between 4e3e019 and bfcbda3.

📒 Files selected for processing (1)
  • src/app/v1/_lib/proxy/forwarder.ts

Comment on lines +1414 to +1421
const isNetworkError =
lastError.name === "SocketError" ||
(lastError as Error & { code?: string }).code === "ECONNRESET" ||
(lastError as Error & { code?: string }).code === "ECONNREFUSED" ||
(lastError as Error & { code?: string }).code === "ETIMEDOUT" ||
(lastError as Error & { code?: string }).code === "ENOTFOUND" ||
lastError.message?.toLowerCase().includes("fetch failed") ||
lastError.message?.toLowerCase().includes("other side closed");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete duplicate of existing isTransportError — misses wrapped-cause errors, EAI_AGAIN, and undici-specific codes

errors.ts already has a private isTransportError function (used in categorizeErrorAsync at priority 1.5) that implements the canonical set of transport-error signals. The inline isNetworkError here duplicates that logic but with three important gaps:

  1. error.cause?.code is not checked. Undici frequently wraps transport failures: e.g., a TypeError: fetch failed carries cause.code === "ECONNRESET". isTransportError checks (error as ...).cause?.code but the new code only tests the top-level .code, so wrapped errors escape detection.

  2. "EAI_AGAIN" is missing. Temporary DNS failures (common in Kubernetes/container environments) carry code EAI_AGAIN. isTransportError includes it; the new check does not, so a containerised DNS hiccup is silently treated as a genuine client error.

  3. Undici-internal codes are missing. isTransportError also covers "UND_ERR_SOCKET", "UND_ERR_CONNECT_TIMEOUT", "UND_ERR_HEADERS_TIMEOUT", and "UND_ERR_BODY_TIMEOUT". The new code ignores them entirely.

Because categorizeErrorAsync already runs isTransportError at priority 1.5 — before isNonRetryableClientErrorAsync — a standard SocketError or ECONNRESET should never even reach NON_RETRYABLE_CLIENT_ERROR. This isNetworkError guard can therefore only fire when a network error somehow slips past isTransportError (e.g., an unexpected wrapping). In that edge case the incomplete implementation fails to catch the very same error patterns it was designed for.

Consider exporting isTransportError from errors.ts and reusing it here, or at minimum aligning the detection logic by adding the missing codes and the cause?.code check:

const errCode =
  (lastError as Error & { code?: string }).code ??
  (lastError as Error & { cause?: { code?: string } }).cause?.code;

const TRANSPORT_CODES = new Set([
  "ECONNRESET", "ECONNREFUSED", "ETIMEDOUT", "ENOTFOUND",
  "EAI_AGAIN",
  "UND_ERR_SOCKET", "UND_ERR_CONNECT_TIMEOUT",
  "UND_ERR_HEADERS_TIMEOUT", "UND_ERR_BODY_TIMEOUT",
]);

const isNetworkError =
  lastError.name === "SocketError" ||
  (errCode !== undefined && TRANSPORT_CODES.has(errCode)) ||
  lastError.message?.toLowerCase().includes("fetch failed") ||
  lastError.message?.toLowerCase().includes("other side closed");
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1414-1421

Comment:
**Incomplete duplicate of existing `isTransportError` — misses wrapped-cause errors, `EAI_AGAIN`, and undici-specific codes**

`errors.ts` already has a private `isTransportError` function (used in `categorizeErrorAsync` at priority 1.5) that implements the canonical set of transport-error signals. The inline `isNetworkError` here duplicates that logic but with three important gaps:

1. **`error.cause?.code` is not checked.** Undici frequently wraps transport failures: e.g., a `TypeError: fetch failed` carries `cause.code === "ECONNRESET"`. `isTransportError` checks `(error as ...).cause?.code` but the new code only tests the top-level `.code`, so wrapped errors escape detection.

2. **`"EAI_AGAIN"` is missing.** Temporary DNS failures (common in Kubernetes/container environments) carry code `EAI_AGAIN`. `isTransportError` includes it; the new check does not, so a containerised DNS hiccup is silently treated as a genuine client error.

3. **Undici-internal codes are missing.** `isTransportError` also covers `"UND_ERR_SOCKET"`, `"UND_ERR_CONNECT_TIMEOUT"`, `"UND_ERR_HEADERS_TIMEOUT"`, and `"UND_ERR_BODY_TIMEOUT"`. The new code ignores them entirely.

Because `categorizeErrorAsync` already runs `isTransportError` at priority 1.5 — before `isNonRetryableClientErrorAsync` — a standard `SocketError` or `ECONNRESET` should never even reach `NON_RETRYABLE_CLIENT_ERROR`. This `isNetworkError` guard can therefore only fire when a network error somehow slips past `isTransportError` (e.g., an unexpected wrapping). In that edge case the incomplete implementation fails to catch the very same error patterns it was designed for.

Consider exporting `isTransportError` from `errors.ts` and reusing it here, or at minimum aligning the detection logic by adding the missing codes and the `cause?.code` check:

```typescript
const errCode =
  (lastError as Error & { code?: string }).code ??
  (lastError as Error & { cause?: { code?: string } }).cause?.code;

const TRANSPORT_CODES = new Set([
  "ECONNRESET", "ECONNREFUSED", "ETIMEDOUT", "ENOTFOUND",
  "EAI_AGAIN",
  "UND_ERR_SOCKET", "UND_ERR_CONNECT_TIMEOUT",
  "UND_ERR_HEADERS_TIMEOUT", "UND_ERR_BODY_TIMEOUT",
]);

const isNetworkError =
  lastError.name === "SocketError" ||
  (errCode !== undefined && TRANSPORT_CODES.has(errCode)) ||
  lastError.message?.toLowerCase().includes("fetch failed") ||
  lastError.message?.toLowerCase().includes("other side closed");
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1451 to +1461
errorDetails: {
system: {
errorType: err.constructor.name,
errorName: err.name,
errorMessage: err.message || err.name || "Unknown error",
errorCode: err.code,
errorSyscall: err.syscall,
},
request: buildRequestDetails(session),
},
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing errorStack in system error details — inconsistent audit trail

Every other system-type errorDetails block in this file includes errorStack: err.stack?.split("\n").slice(0, 3).join("\n") (e.g., the canonical SYSTEM_ERROR handler at line ~1580, and the rectifier paths at lines ~1204 and ~1337). The new block omits it, making it impossible to distinguish call sites during post-incident diagnosis of miscategorised network errors.

Suggested change
errorDetails: {
system: {
errorType: err.constructor.name,
errorName: err.name,
errorMessage: err.message || err.name || "Unknown error",
errorCode: err.code,
errorSyscall: err.syscall,
},
request: buildRequestDetails(session),
},
});
errorDetails: {
system: {
errorType: err.constructor.name,
errorName: err.name,
errorMessage: err.message || err.name || "Unknown error",
errorCode: err.code,
errorSyscall: err.syscall,
errorStack: err.stack?.split("\n").slice(0, 3).join("\n"),
},
request: buildRequestDetails(session),
},
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1451-1461

Comment:
**Missing `errorStack` in `system` error details — inconsistent audit trail**

Every other `system`-type `errorDetails` block in this file includes `errorStack: err.stack?.split("\n").slice(0, 3).join("\n")` (e.g., the canonical SYSTEM_ERROR handler at line ~1580, and the rectifier paths at lines ~1204 and ~1337). The new block omits it, making it impossible to distinguish call sites during post-incident diagnosis of miscategorised network errors.

```suggestion
                  errorDetails: {
                    system: {
                      errorType: err.constructor.name,
                      errorName: err.name,
                      errorMessage: err.message || err.name || "Unknown error",
                      errorCode: err.code,
                      errorSyscall: err.syscall,
                      errorStack: err.stack?.split("\n").slice(0, 3).join("\n"),
                    },
                    request: buildRequestDetails(session),
                  },
```

How can I resolve this? If you propose a fix, please make it concise.

…OR path

When a SocketError or other transport error is miscategorized as
NON_RETRYABLE_CLIENT_ERROR, the existing instanceof ProxyError guard
(added in dev) routes it to the plain-error else branch, which throws
immediately without retry or failover — truncating streaming responses.

This patch adds inline SYSTEM_ERROR-equivalent handling for detected
network errors in the plain-error branch:

- Transport error detection aligned with isTransportError() in errors.ts,
  including cause?.code unwrapping, EAI_AGAIN, and undici-specific codes
- Endpoint-level circuit breaker recording (recordEndpointFailure)
- Retry with endpoint index advancement when attempts remain
- Provider failover with failedProviderIds tracking when exhausted
- Provider-level circuit breaker recording per config

Fixes ding113#909
@KevinShiCN KevinShiCN force-pushed the fix/socket-error-type-guard branch from bfcbda3 to 01467ea Compare March 13, 2026 17:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/app/v1/_lib/proxy/forwarder.ts (3)

1419-1424: TRANSPORT_CODES 在循环内重复创建,可考虑提取为模块级常量

每次进入错误处理分支时都会重新创建 Set 对象。虽然性能影响有限,但将其提取到模块级别可以避免重复创建,同时也便于与 errors.ts 中的 isTransportError(如有)保持同步。

建议提取为模块级常量
+/** Network/transport error codes that should trigger retry/failover */
+const TRANSPORT_CODES = new Set([
+  "ECONNRESET", "ECONNREFUSED", "ETIMEDOUT", "ENOTFOUND",
+  "EAI_AGAIN",
+  "UND_ERR_SOCKET", "UND_ERR_CONNECT_TIMEOUT",
+  "UND_ERR_HEADERS_TIMEOUT", "UND_ERR_BODY_TIMEOUT",
+]);
+
 const STANDARD_ENDPOINTS = [
   "/v1/messages",

然后在第 1419-1424 行删除局部定义即可。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1419 - 1424, Extract the
TRANSPORT_CODES Set out of the error handling block and define it as a
module-level constant so it is created once and reusable; replace the local
definition in forwarder.ts (currently named TRANSPORT_CODES inside the error
handling branch) with a reference to the new module-level constant, and consider
aligning/deferring to the existing isTransportError utility in errors.ts if
present to avoid duplication of transport error logic.

1454-1470: errorDetails.system 缺少 errorStack 字段,与 SYSTEM_ERROR 处理路径不一致

在 SYSTEM_ERROR 处理分支(第 1582-1593 行),errorDetails.system 包含 errorStack 字段用于调试追踪。但此处新增的网络错误重分类路径缺少该字段,会导致决策链审计数据不一致。

建议补充 errorStack 字段
                 errorDetails: {
                   system: {
                     errorType: err.constructor.name,
                     errorName: err.name,
                     errorMessage: err.message || err.name || "Unknown error",
                     errorCode: err.code,
                     errorSyscall: err.syscall,
+                    errorStack: err.stack?.split("\n").slice(0, 3).join("\n"),
                   },
                   request: buildRequestDetails(session),
                 },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1454 - 1470, The
errorDetails.system object created in the network-error reclassification path
passed to session.addProviderToChain (inside the block building endpointAudit
for currentProvider) is missing the errorStack field; to fix, include
errorStack: err.stack (or err.stack || null) alongside
errorType/name/message/code/syscall so the structure matches the SYSTEM_ERROR
path used later (use the same keys as in the SYSTEM_ERROR handling, e.g.,
errorStack) and ensure you reference currentProvider.id and
getCircuitState(currentProvider.id) as existing context when adding the provider
to the chain.

1498-1498: 变量 env 重复声明(shadowing)

第 536 行已有 const env = getEnvConfig(),此处重新声明会产生 shadowing。由于 getEnvConfig() 返回缓存的单例,功能上没有问题,但可以直接使用外层已有的 env 变量。

建议移除重复声明
-                const env = getEnvConfig();
                 failedProviderIds.push(currentProvider.id);

                 if (env.ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` at line 1498, 在 forwarder.ts 中第 1498 行的
`const env = getEnvConfig()` 与第 536 行已定义的同名变量产生了 shadowing;请删除此处重复的声明并直接复用外层的
`env` 变量(保留对 getEnvConfig() 的唯一调用处),以避免重复声明和潜在混淆。确保引用的标识符为 `env` 和
`getEnvConfig()`,并在删除重复声明后运行编译/类型检查确认没有未定义引用。
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1419-1424: Extract the TRANSPORT_CODES Set out of the error
handling block and define it as a module-level constant so it is created once
and reusable; replace the local definition in forwarder.ts (currently named
TRANSPORT_CODES inside the error handling branch) with a reference to the new
module-level constant, and consider aligning/deferring to the existing
isTransportError utility in errors.ts if present to avoid duplication of
transport error logic.
- Around line 1454-1470: The errorDetails.system object created in the
network-error reclassification path passed to session.addProviderToChain (inside
the block building endpointAudit for currentProvider) is missing the errorStack
field; to fix, include errorStack: err.stack (or err.stack || null) alongside
errorType/name/message/code/syscall so the structure matches the SYSTEM_ERROR
path used later (use the same keys as in the SYSTEM_ERROR handling, e.g.,
errorStack) and ensure you reference currentProvider.id and
getCircuitState(currentProvider.id) as existing context when adding the provider
to the chain.
- Line 1498: 在 forwarder.ts 中第 1498 行的 `const env = getEnvConfig()` 与第 536
行已定义的同名变量产生了 shadowing;请删除此处重复的声明并直接复用外层的 `env` 变量(保留对 getEnvConfig()
的唯一调用处),以避免重复声明和潜在混淆。确保引用的标识符为 `env` 和
`getEnvConfig()`,并在删除重复声明后运行编译/类型检查确认没有未定义引用。

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 36fb2b44-ff97-474e-928d-6a8bd3f5b35d

📥 Commits

Reviewing files that changed from the base of the PR and between bfcbda3 and 01467ea.

📒 Files selected for processing (1)
  • src/app/v1/_lib/proxy/forwarder.ts

@ding113 ding113 closed this Mar 14, 2026
@github-project-automation github-project-automation bot moved this from Backlog to Done in Claude Code Hub Roadmap Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Error Rule area:provider bug Something isn't working size/XS Extra Small PR (< 50 lines)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

SocketError triggers TypeError: t.getDetailedErrorMessage is not a function — breaks failover and truncates streaming

2 participants