fix: reclassify miscategorized network errors as SYSTEM_ERROR for retry/failover#911
fix: reclassify miscategorized network errors as SYSTEM_ERROR for retry/failover#911KevinShiCN wants to merge 1 commit intoding113:devfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough在 ProxyForwarder 的非 ProxyError 分支中新增对网络/传输错误(如 SocketError、ECONNRESET、ECONNREFUSED、ETIMEDOUT、ENOTFOUND、"fetch failed"、"other side closed" 等)的检测;识别后将 errorCategory 重新分类为 SYSTEM_ERROR、记录警告并进入 SYSTEM_ERROR 的重试/切换逻辑,其它客户端错误保持原行为。 Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical flaw in the error handling mechanism where non- Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly adds instanceof ProxyError guards to prevent a TypeError when a non-ProxyError is miscategorized, which is a solid defensive fix. My review includes suggestions to improve logging consistency across the new error handling blocks for better diagnostics. I've also pointed out a medium-severity concern with the current retry implementation using continue, as it bypasses some important logic in the main SYSTEM_ERROR handler. Addressing this in the future would make the retry mechanism more robust.
src/app/v1/_lib/proxy/forwarder.ts
Outdated
| errorCategory = ErrorCategory.SYSTEM_ERROR; | ||
| continue; // Re-enter the category handling loop as SYSTEM_ERROR |
There was a problem hiding this comment.
This correctly handles the case where a non-ProxyError is miscategorized. However, using continue here triggers a retry but bypasses the main SYSTEM_ERROR handler. This means important logic like logging to the provider chain and advancing to the next endpoint (currentEndpointIndex++) is skipped. This could lead to retrying on the same faulty endpoint and reduced observability. A future refactor to allow fall-through to the SYSTEM_ERROR handler would make the retry mechanism more robust.
src/app/v1/_lib/proxy/forwarder.ts
Outdated
| logger.warn("ProxyForwarder: Non-ProxyError in RESOURCE_NOT_FOUND, reclassifying as SYSTEM_ERROR", { | ||
| providerId: currentProvider.id, | ||
| errorType: lastError?.constructor?.name, | ||
| errorMessage: lastError?.message, | ||
| }); |
There was a problem hiding this comment.
For consistency with the other instanceof checks and for better diagnostics, consider enhancing this log message to include providerName and originalCategory.
| logger.warn("ProxyForwarder: Non-ProxyError in RESOURCE_NOT_FOUND, reclassifying as SYSTEM_ERROR", { | |
| providerId: currentProvider.id, | |
| errorType: lastError?.constructor?.name, | |
| errorMessage: lastError?.message, | |
| }); | |
| logger.warn("ProxyForwarder: Non-ProxyError in RESOURCE_NOT_FOUND, reclassifying as SYSTEM_ERROR for retry", { | |
| providerId: currentProvider.id, | |
| providerName: currentProvider.name, | |
| errorType: lastError?.constructor?.name, | |
| errorMessage: lastError?.message, | |
| originalCategory: errorCategory, | |
| }); |
src/app/v1/_lib/proxy/forwarder.ts
Outdated
| logger.warn("ProxyForwarder: Non-ProxyError in default handler, reclassifying as SYSTEM_ERROR", { | ||
| providerId: currentProvider.id, | ||
| errorType: lastError?.constructor?.name, | ||
| errorMessage: lastError?.message, | ||
| }); |
There was a problem hiding this comment.
To maintain consistency with the other new checks and improve logging, it would be beneficial to include providerName and originalCategory here as well. Also, specifying PROVIDER_ERROR in the message instead of 'default handler' would be more precise.
| logger.warn("ProxyForwarder: Non-ProxyError in default handler, reclassifying as SYSTEM_ERROR", { | |
| providerId: currentProvider.id, | |
| errorType: lastError?.constructor?.name, | |
| errorMessage: lastError?.message, | |
| }); | |
| logger.warn("ProxyForwarder: Non-ProxyError in PROVIDER_ERROR handler, reclassifying as SYSTEM_ERROR for retry", { | |
| providerId: currentProvider.id, | |
| providerName: currentProvider.name, | |
| errorType: lastError?.constructor?.name, | |
| errorMessage: lastError?.message, | |
| originalCategory: errorCategory, | |
| }); |
There was a problem hiding this comment.
Code Review Summary
This PR adds defensive type guards at three locations in to prevent TypeError crashes when non-ProxyError errors (like SocketError) are miscategorized as by the error categorizer. The fix correctly reclassifies such errors to and continues the retry/failover path.
PR Size: XS
- Lines changed: 35
- Files changed: 1
Review Coverage
- Logic and correctness - The fix logic is sound: when a non-ProxyError is miscategorized, it's reclassified to SYSTEM_ERROR and continues to the retry path
- Security (OWASP Top 10) - No security issues
- Error handling - The fix improves error handling by adding guards and logging warnings with context (errorType, errorMessage)
- Type safety - Uses proper checks, consistent with existing codebase patterns (lines 1025, 1029, 1171, 1304)
- Documentation accuracy - Comments added accurately explain the bug and its fix
- Test coverage - While no new test was added, this is a defensive bug fix (not a new feature) that follows existing patterns
- Code clarity - The fix is minimal, focused, and easy to understand
Automated review by Claude AI
There was a problem hiding this comment.
Code Review Summary
This PR adds defensive instanceof ProxyError type guards at three locations in forwarder.ts to prevent TypeError crashes when non-ProxyError errors (like SocketError) are miscategorized as NON_RETRYABLE_CLIENT_ERROR by the error categorizer. The fix correctly reclassifies such errors to SYSTEM_ERROR and continues the retry/failover path.
PR Size: XS
- Lines changed: 35
- Files changed: 1
Review Coverage
- Logic and correctness - The fix logic is sound: when a non-ProxyError is miscategorized, it is reclassified to SYSTEM_ERROR and continues to the retry path
- Security (OWASP Top 10) - No security issues
- Error handling - The fix improves error handling by adding guards and logging warnings with context (errorType, errorMessage)
- Type safety - Uses proper instanceof ProxyError checks, consistent with existing codebase patterns
- Documentation accuracy - Comments added accurately explain the bug and its fix
- Test coverage - While no new test was added, this is a defensive bug fix (not a new feature) that follows existing patterns
- Code clarity - The fix is minimal, focused, and easy to understand
Automated review by Claude AI
src/app/v1/_lib/proxy/forwarder.ts
Outdated
| if (!(lastError instanceof ProxyError)) { | ||
| logger.warn("ProxyForwarder: Non-ProxyError miscategorized as NON_RETRYABLE_CLIENT_ERROR, reclassifying as SYSTEM_ERROR for retry", { | ||
| providerId: currentProvider.id, | ||
| providerName: currentProvider.name, | ||
| errorType: lastError?.constructor?.name, | ||
| errorMessage: lastError?.message, | ||
| originalCategory: errorCategory, | ||
| }); | ||
| errorCategory = ErrorCategory.SYSTEM_ERROR; | ||
| continue; // Re-enter the category handling loop as SYSTEM_ERROR | ||
| } |
There was a problem hiding this comment.
continue bypasses SYSTEM_ERROR bookkeeping, breaking provider failover
When continue is executed here (and in the two analogous guards at lines ~1527 and ~1642), the loop jumps directly to the next inner-loop iteration, completely skipping the SYSTEM_ERROR handler (lines ~1432–1522). That handler is responsible for:
- Recording the failed attempt via
session.addProviderToChain(...). - Advancing
currentEndpointIndex++on the first failure. - Calling
failedProviderIds.push(currentProvider.id)onceattemptCount >= maxAttemptsPerProvider. - Triggering
breakso the outer loop can switch providers.
Because none of those steps happen, when all maxAttemptsPerProvider inner-loop iterations exhaust through this guard (the same SocketError keeps matching isNonRetryableClientErrorAsync on every retry), the inner loop exits naturally with failedProviderIds still empty for the current provider. selectAlternative(session, failedProviderIds) then has no reason to exclude the broken provider and may re-select it in the next outer-loop iteration — the very failover the PR description promises.
The comment // Re-enter the category handling loop as SYSTEM_ERROR suggests the intent was to fall through to the SYSTEM_ERROR handler within the same iteration, but continue in JavaScript/TypeScript always starts the next iteration.
A minimal correction would be to add failedProviderIds.push(currentProvider.id) before the continue, and to handle the first-vs-last-attempt distinction (endpoint advancement vs. provider switch) similarly to what the SYSTEM_ERROR handler already does. Alternatively, restructure so the non-ProxyError path falls through to the existing SYSTEM_ERROR handler rather than using continue.
// Example correction for NON_RETRYABLE guard:
if (!(lastError instanceof ProxyError)) {
logger.warn("ProxyForwarder: Non-ProxyError miscategorized as NON_RETRYABLE_CLIENT_ERROR, reclassifying as SYSTEM_ERROR for retry", {
providerId: currentProvider.id,
providerName: currentProvider.name,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
originalCategory: errorCategory,
});
errorCategory = ErrorCategory.SYSTEM_ERROR;
// ⚠️ Do NOT `continue` here — fall through so the SYSTEM_ERROR handler
// below performs endpoint advancement, session chain recording, and
// failedProviderIds tracking.
}The same fix applies to the RESOURCE_NOT_FOUND guard (~line 1527) and the default PROVIDER_ERROR guard (~line 1642).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1359-1369
Comment:
**`continue` bypasses SYSTEM_ERROR bookkeeping, breaking provider failover**
When `continue` is executed here (and in the two analogous guards at lines ~1527 and ~1642), the loop jumps directly to the *next inner-loop iteration*, completely skipping the `SYSTEM_ERROR` handler (lines ~1432–1522). That handler is responsible for:
1. Recording the failed attempt via `session.addProviderToChain(...)`.
2. Advancing `currentEndpointIndex++` on the first failure.
3. Calling `failedProviderIds.push(currentProvider.id)` once `attemptCount >= maxAttemptsPerProvider`.
4. Triggering `break` so the outer loop can switch providers.
Because none of those steps happen, when all `maxAttemptsPerProvider` inner-loop iterations exhaust through this guard (the same SocketError keeps matching `isNonRetryableClientErrorAsync` on every retry), the inner loop exits *naturally* with `failedProviderIds` still empty for the current provider. `selectAlternative(session, failedProviderIds)` then has no reason to exclude the broken provider and may re-select it in the next outer-loop iteration — the very failover the PR description promises.
The comment `// Re-enter the category handling loop as SYSTEM_ERROR` suggests the intent was to fall through to the `SYSTEM_ERROR` handler *within the same iteration*, but `continue` in JavaScript/TypeScript always starts the **next** iteration.
A minimal correction would be to add `failedProviderIds.push(currentProvider.id)` before the `continue`, and to handle the first-vs-last-attempt distinction (endpoint advancement vs. provider switch) similarly to what the `SYSTEM_ERROR` handler already does. Alternatively, restructure so the non-ProxyError path falls through to the existing `SYSTEM_ERROR` handler rather than using `continue`.
```typescript
// Example correction for NON_RETRYABLE guard:
if (!(lastError instanceof ProxyError)) {
logger.warn("ProxyForwarder: Non-ProxyError miscategorized as NON_RETRYABLE_CLIENT_ERROR, reclassifying as SYSTEM_ERROR for retry", {
providerId: currentProvider.id,
providerName: currentProvider.name,
errorType: lastError?.constructor?.name,
errorMessage: lastError?.message,
originalCategory: errorCategory,
});
errorCategory = ErrorCategory.SYSTEM_ERROR;
// ⚠️ Do NOT `continue` here — fall through so the SYSTEM_ERROR handler
// below performs endpoint advancement, session chain recording, and
// failedProviderIds tracking.
}
```
The same fix applies to the `RESOURCE_NOT_FOUND` guard (~line 1527) and the default `PROVIDER_ERROR` guard (~line 1642).
How can I resolve this? If you propose a fix, please make it concise.
src/app/v1/_lib/proxy/forwarder.ts
Outdated
| if (!(lastError instanceof ProxyError)) { | ||
| logger.warn("ProxyForwarder: Non-ProxyError in default handler, reclassifying as SYSTEM_ERROR", { | ||
| providerId: currentProvider.id, | ||
| errorType: lastError?.constructor?.name, | ||
| errorMessage: lastError?.message, | ||
| }); | ||
| errorCategory = ErrorCategory.SYSTEM_ERROR; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Missing session.addProviderToChain call before reclassification
Like the NON_RETRYABLE_CLIENT_ERROR guard above, this guard and the RESOURCE_NOT_FOUND guard (~line 1527) all use continue without ever recording the failed attempt in the provider chain (session.addProviderToChain). When the inner loop exhausts entirely via these guards, the session audit trail for all those attempts will be silently missing, making it impossible to diagnose repeated SocketError failures after the fact.
The SYSTEM_ERROR handler (the intended destination) would have recorded each attempt as reason: "system_error" with structured errorDetails. That telemetry is lost with the current continue-based approach.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1642-1650
Comment:
**Missing `session.addProviderToChain` call before reclassification**
Like the `NON_RETRYABLE_CLIENT_ERROR` guard above, this guard and the `RESOURCE_NOT_FOUND` guard (~line 1527) all use `continue` without ever recording the failed attempt in the provider chain (`session.addProviderToChain`). When the inner loop exhausts entirely via these guards, the session audit trail for all those attempts will be silently missing, making it impossible to diagnose repeated SocketError failures after the fact.
The `SYSTEM_ERROR` handler (the intended destination) would have recorded each attempt as `reason: "system_error"` with structured `errorDetails`. That telemetry is lost with the current `continue`-based approach.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/app/v1/_lib/proxy/forwarder.ts (1)
1642-1650: 关键修复位置,建议增强日志信息这个 guard 位于
PROVIDER_ERROR的默认处理逻辑中(EmptyResponseError 已在上方处理),是修复 issue#909的关键位置。当 undici 的 SocketError 或其他网络层错误被错误分类为 PROVIDER_ERROR 时,这个 guard 可以正确地将其重新分类为 SYSTEM_ERROR 并触发重试。建议:为保持日志信息的一致性,可以补充
providerName字段(与第一个 guard 保持一致):建议的改进
if (!(lastError instanceof ProxyError)) { logger.warn("ProxyForwarder: Non-ProxyError in default handler, reclassifying as SYSTEM_ERROR", { providerId: currentProvider.id, + providerName: currentProvider.name, errorType: lastError?.constructor?.name, errorMessage: lastError?.message, });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1642 - 1650, The log in the default PROVIDER_ERROR handler reclassifies non-ProxyError as SYSTEM_ERROR but omits providerName; update the logger.warn call inside the guard (the block that checks "if (!(lastError instanceof ProxyError))" in forwarder.ts/ProxyForwarder) to include providerName alongside providerId, errorType, and errorMessage so it matches the earlier guard's fields; leave the errorCategory reassignment to ErrorCategory.SYSTEM_ERROR and the continue intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1642-1650: The log in the default PROVIDER_ERROR handler
reclassifies non-ProxyError as SYSTEM_ERROR but omits providerName; update the
logger.warn call inside the guard (the block that checks "if (!(lastError
instanceof ProxyError))" in forwarder.ts/ProxyForwarder) to include providerName
alongside providerId, errorType, and errorMessage so it matches the earlier
guard's fields; leave the errorCategory reassignment to
ErrorCategory.SYSTEM_ERROR and the continue intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2709196c-ce7d-4213-8e23-33f3acfb7b18
📒 Files selected for processing (1)
src/app/v1/_lib/proxy/forwarder.ts
There was a problem hiding this comment.
Code Review Summary
This PR adds defensive instanceof ProxyError type guards at three locations in forwarder.ts to prevent TypeError crashes when non-ProxyError errors (like SocketError) are miscategorized as NON_RETRYABLE_CLIENT_ERROR by the error categorizer. The fix correctly reclassifies such errors to SYSTEM_ERROR and continues the retry/failover path.
PR Size: XS
- Lines changed: 35
- Files changed: 1
Review Coverage
- Logic and correctness - The fix logic is sound
- Security - No security issues
- Error handling - The fix improves error handling
- Type safety - Uses proper instanceof checks
- Documentation accuracy - Comments are accurate
- Test coverage - Defensive bug fix
- Code clarity - Minimal and focused
Automated review by Claude AI
src/app/v1/_lib/proxy/forwarder.ts
Outdated
| // which truncates streaming responses. Reclassify as SYSTEM_ERROR | ||
| // so it enters the retry/failover path instead. | ||
| // See: https://github.com/ding113/claude-code-hub/issues/909 | ||
| if (!(lastError instanceof ProxyError)) { |
There was a problem hiding this comment.
[High] continue bypasses SYSTEM_ERROR bookkeeping, breaking provider failover
Why this is a problem: When continue is executed (lines 1359-1369, 1527-1535, 1642-1650), it jumps to the next iteration of the inner retry loop, completely skipping the SYSTEM_ERROR handler (lines 1416-1506). That handler is responsible for:
- Recording the failed attempt via
session.addProviderToChain(...) - Advancing
currentEndpointIndex++on first failure - Calling
failedProviderIds.push(currentProvider.id)when exhausted - Triggering
breakto switch providers
Because none of these happen, when all maxAttemptsPerProvider iterations exhaust through this guard, the inner loop exits naturally with failedProviderIds still empty. selectAlternative(session, failedProviderIds) then has no reason to exclude the broken provider and may re-select it.
Additionally, the error categorization result is cached (errors.ts:576), so categorizeErrorAsync will return the same NON_RETRYABLE_CLIENT_ERROR on every retry, causing the fix to reclassify and continue without ever executing the SYSTEM_ERROR path.
Suggested fix: Instead of continue, re-throw as a ProxyError that will be re-categorized as SYSTEM_ERROR, or manually add the bookkeeping before continuing.
There was a problem hiding this comment.
Code Review Summary
This PR adds defensive instanceof ProxyError type guards to prevent TypeError crashes when non-ProxyError errors (like SocketError) are miscategorized as NON_RETRYABLE_CLIENT_ERROR. However, there is a High priority issue with the implementation that could break provider failover.
PR Size: XS
- Lines changed: 35
- Files changed: 1
Issues Found
| Category | Critical | High | Medium | Low |
|---|---|---|---|---|
| Logic/Bugs | 0 | 1 | 0 | 0 |
| Error Handling | 0 | 1 | 0 | 0 |
High Priority Issues (Should Fix)
[High] continue bypasses SYSTEM_ERROR bookkeeping, breaking provider failover
The fix uses continue after reclassifying to SYSTEM_ERROR, which jumps to the next iteration and completely skips the SYSTEM_ERROR handler. This handler is responsible for:
- Recording to session chain (
session.addProviderToChain) - Tracking failed providers (
failedProviderIds.push) - Advancing endpoints and triggering provider switch
When all retry attempts exhaust through this guard path, failedProviderIds remains empty, so selectAlternative cannot exclude the broken provider. Additionally, error categorization is cached, so subsequent retries will keep returning the same category, causing an infinite loop through the guard without proper bookkeeping.
Review Coverage
- Logic and correctness - Issue found
- Security (OWASP Top 10) - Clean
- Error handling - Issue found (continue breaks error handling flow)
- Type safety - Clean (uses instanceof correctly)
- Documentation accuracy - Clean
- Test coverage - N/A (defensive fix)
- Code clarity - Clean
Automated review by Claude AI
70b4378 to
6bd3472
Compare
src/app/v1/_lib/proxy/forwarder.ts
Outdated
| errorCategory = ErrorCategory.SYSTEM_ERROR; | ||
| continue; // Re-enter the loop to hit SYSTEM_ERROR handler |
There was a problem hiding this comment.
errorCategory reassignment is a dead write — SYSTEM_ERROR handler is never reached
errorCategory is declared with let inside the catch block at line 1023:
let errorCategory = await categorizeErrorAsync(lastError);Because it is block-scoped to each catch execution, the assignment errorCategory = ErrorCategory.SYSTEM_ERROR here only modifies the current iteration's local copy. When continue jumps to the next loop iteration and the request fails again with the same SocketError, the catch block re-executes and a fresh let errorCategory is assigned from categorizeErrorAsync — which will again return NON_RETRYABLE_CLIENT_ERROR for the same error. The SYSTEM_ERROR handler (lines ~1471–1562) is therefore never reached for this error path.
This means:
- The comment
// Re-enter the loop to hit SYSTEM_ERROR handleris incorrect. - All
maxAttemptsPerProvideriterations are consumed by theisNetworkErrorguard, each timecontinue-ing back into the same path. - The SYSTEM_ERROR handler's bookkeeping (
session.addProviderToChain,failedProviderIds.push,break) is never executed, which is the root cause of the failover not working (as also noted in the existing review thread at line 1359).
To actually route into the SYSTEM_ERROR handler, the reclassification must persist across iterations — for example by hoisting a flag variable outside the catch block and checking it at the start of the catch block before calling categorizeErrorAsync.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1435-1436
Comment:
**`errorCategory` reassignment is a dead write — SYSTEM_ERROR handler is never reached**
`errorCategory` is declared with `let` *inside* the `catch` block at line 1023:
```typescript
let errorCategory = await categorizeErrorAsync(lastError);
```
Because it is block-scoped to each catch execution, the assignment `errorCategory = ErrorCategory.SYSTEM_ERROR` here only modifies the current iteration's local copy. When `continue` jumps to the next loop iteration and the request fails again with the same `SocketError`, the catch block re-executes and a **fresh** `let errorCategory` is assigned from `categorizeErrorAsync` — which will again return `NON_RETRYABLE_CLIENT_ERROR` for the same error. The SYSTEM_ERROR handler (lines ~1471–1562) is therefore *never reached* for this error path.
This means:
- The comment `// Re-enter the loop to hit SYSTEM_ERROR handler` is incorrect.
- All `maxAttemptsPerProvider` iterations are consumed by the `isNetworkError` guard, each time `continue`-ing back into the same path.
- The SYSTEM_ERROR handler's bookkeeping (`session.addProviderToChain`, `failedProviderIds.push`, `break`) is never executed, which is the root cause of the failover not working (as also noted in the existing review thread at line 1359).
To actually route into the SYSTEM_ERROR handler, the reclassification must persist *across* iterations — for example by hoisting a flag variable outside the catch block and checking it at the start of the catch block before calling `categorizeErrorAsync`.
How can I resolve this? If you propose a fix, please make it concise.| if (isNetworkError) { | ||
| const err = lastError as Error & { | ||
| code?: string; | ||
| errno?: number; | ||
| syscall?: string; | ||
| }; | ||
|
|
||
| logger.warn( | ||
| "ProxyForwarder: Network error miscategorized as NON_RETRYABLE_CLIENT_ERROR, handling as SYSTEM_ERROR for retry/failover", | ||
| { | ||
| providerId: currentProvider.id, | ||
| providerName: currentProvider.name, | ||
| errorType: err.constructor.name, | ||
| errorMessage: err.message, | ||
| errorCode: err.code, | ||
| attemptNumber: attemptCount, | ||
| totalProvidersAttempted, | ||
| willRetry: attemptCount < maxAttemptsPerProvider, | ||
| originalCategory: errorCategory, | ||
| } | ||
| ); | ||
|
|
||
| session.addProviderToChain(currentProvider, { | ||
| ...endpointAudit, | ||
| reason: "system_error", | ||
| circuitState: getCircuitState(currentProvider.id), | ||
| attemptNumber: attemptCount, | ||
| errorMessage: errorMessage, | ||
| errorDetails: { | ||
| system: { | ||
| errorType: err.constructor.name, | ||
| errorName: err.name, | ||
| errorMessage: err.message || err.name || "Unknown error", | ||
| errorCode: err.code, | ||
| errorSyscall: err.syscall, | ||
| }, | ||
| request: buildRequestDetails(session), | ||
| }, | ||
| }); | ||
|
|
||
| if (attemptCount < maxAttemptsPerProvider) { | ||
| currentEndpointIndex++; | ||
| logger.debug("ProxyForwarder: Advancing endpoint index due to miscategorized network error", { | ||
| providerId: currentProvider.id, | ||
| previousEndpointIndex: currentEndpointIndex - 1, | ||
| newEndpointIndex: currentEndpointIndex, | ||
| maxEndpointIndex: endpointCandidates.length - 1, | ||
| }); | ||
|
|
||
| await new Promise((resolve) => setTimeout(resolve, 100)); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
recordEndpointFailure is skipped for reclassified network errors
The endpoint-level circuit breaker call at lines 1037–1041 is gated on errorCategory === ErrorCategory.SYSTEM_ERROR:
if (activeEndpoint.endpointId != null) {
if (isTimeoutError || errorCategory === ErrorCategory.SYSTEM_ERROR) {
await recordEndpointFailure(activeEndpoint.endpointId, lastError);
}
}Because categorizeErrorAsync classifies a SocketError as NON_RETRYABLE_CLIENT_ERROR (that is the root cause this PR is fixing), errorCategory is never SYSTEM_ERROR when this guard runs. As a result, recordEndpointFailure is never called for the network errors handled by this new block.
Genuine SYSTEM_ERROR-classified network errors correctly record a per-endpoint failure. The new reclassified network-error path silently skips it, so degraded endpoints that keep producing SocketError are not penalised in the endpoint circuit breaker and will remain eligible for selection on every subsequent retry—potentially causing more SocketErrors on the same endpoint.
A minimal fix is to call recordEndpointFailure inside the isNetworkError branch before continue/break, mirroring what already happens at line 1039:
if (isNetworkError) {
// ... existing session.addProviderToChain call ...
// Mirror the endpoint-level circuit breaker recording that SYSTEM_ERROR gets at line 1039
if (activeEndpoint.endpointId != null) {
await recordEndpointFailure(activeEndpoint.endpointId, lastError);
}
if (attemptCount < maxAttemptsPerProvider) {
// ...continue path...
}
// ...break path...
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1423-1474
Comment:
**`recordEndpointFailure` is skipped for reclassified network errors**
The endpoint-level circuit breaker call at lines 1037–1041 is gated on `errorCategory === ErrorCategory.SYSTEM_ERROR`:
```typescript
if (activeEndpoint.endpointId != null) {
if (isTimeoutError || errorCategory === ErrorCategory.SYSTEM_ERROR) {
await recordEndpointFailure(activeEndpoint.endpointId, lastError);
}
}
```
Because `categorizeErrorAsync` classifies a `SocketError` as `NON_RETRYABLE_CLIENT_ERROR` (that is the root cause this PR is fixing), `errorCategory` is never `SYSTEM_ERROR` when this guard runs. As a result, `recordEndpointFailure` is **never called** for the network errors handled by this new block.
Genuine `SYSTEM_ERROR`-classified network errors correctly record a per-endpoint failure. The new reclassified network-error path silently skips it, so degraded endpoints that keep producing `SocketError` are not penalised in the endpoint circuit breaker and will remain eligible for selection on every subsequent retry—potentially causing more SocketErrors on the same endpoint.
A minimal fix is to call `recordEndpointFailure` inside the `isNetworkError` branch before `continue`/`break`, mirroring what already happens at line 1039:
```typescript
if (isNetworkError) {
// ... existing session.addProviderToChain call ...
// Mirror the endpoint-level circuit breaker recording that SYSTEM_ERROR gets at line 1039
if (activeEndpoint.endpointId != null) {
await recordEndpointFailure(activeEndpoint.endpointId, lastError);
}
if (attemptCount < maxAttemptsPerProvider) {
// ...continue path...
}
// ...break path...
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/app/v1/_lib/proxy/forwarder.ts (2)
1410-1508: 核心修复逻辑正确,网络错误现在能正确触发重试/故障转移该修复正确识别了被误分类为
NON_RETRYABLE_CLIENT_ERROR的网络错误,并内联处理为 SYSTEM_ERROR 路径。与之前使用continue跳转导致的死写问题相比,当前实现在continue/break之前完成了所有必要的记账操作(provider chain、endpoint 推进、failedProviderIds)。有两个小的改进建议:
Line 1482 重复声明
env:env已在 Line 536 声明,此处无需重新获取。Line 1451-1460 缺少
errorStack:与 SYSTEM_ERROR handler(Line 1573)相比,审计数据缺少errorStack字段,这会影响调试时的信息完整性。建议的改进
@@ -1448,6 +1448,7 @@ errorDetails: { system: { errorType: err.constructor.name, errorName: err.name, errorMessage: err.message || err.name || "Unknown error", errorCode: err.code, errorSyscall: err.syscall, + errorStack: err.stack?.split("\n").slice(0, 3).join("\n"), }, request: buildRequestDetails(session), },@@ -1479,7 +1479,7 @@ totalProvidersAttempted, }); - const env = getEnvConfig(); + // env already declared at line 536, reuse it failedProviderIds.push(currentProvider.id); if (env.ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1410 - 1508, Remove the redundant getEnvConfig() call inside the miscategorized-network-error branch (it’s already retrieved earlier) and include the error stack in the audit payload so it matches the SYSTEM_ERROR path: when building the audit entry in session.addProviderToChain (inside the ProxyForwarder miscategorized network error handling) add errorStack: err.stack (or err.stack || null) under errorDetails.system; delete the duplicate env = getEnvConfig() line and keep using the previously-scoped env for the ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS check and subsequent recordFailure/currentProvider handling.
1414-1421: 网络错误检测逻辑可考虑抽取为工具函数当前的
isNetworkError检测模式覆盖了常见的 undici/Node.js 网络错误场景。但如果这种检测逻辑在其他地方也需要使用(例如 hedge 路径或其他错误处理器),建议将其抽取为一个独立的工具函数放在errors.ts中,便于集中维护和复用。// 可在 errors.ts 中添加 export function isNetworkTransportError(error: Error): boolean { const err = error as Error & { code?: string }; return ( error.name === "SocketError" || err.code === "ECONNRESET" || err.code === "ECONNREFUSED" || err.code === "ETIMEDOUT" || err.code === "ENOTFOUND" || error.message?.toLowerCase().includes("fetch failed") || error.message?.toLowerCase().includes("other side closed") ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1414 - 1421, Extract the inline network-error detection into a reusable helper: create and export a function isNetworkTransportError(error: Error) in errors.ts that contains the existing checks (referencing the same checks now assigned to isNetworkError), then replace the inline isNetworkError expression in forwarder.ts with a call to isNetworkTransportError(lastError) and import it; also update any other places (e.g., hedge path or other error handlers) to use the new isNetworkTransportError to centralize and reuse the logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1410-1508: Remove the redundant getEnvConfig() call inside the
miscategorized-network-error branch (it’s already retrieved earlier) and include
the error stack in the audit payload so it matches the SYSTEM_ERROR path: when
building the audit entry in session.addProviderToChain (inside the
ProxyForwarder miscategorized network error handling) add errorStack: err.stack
(or err.stack || null) under errorDetails.system; delete the duplicate env =
getEnvConfig() line and keep using the previously-scoped env for the
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS check and subsequent
recordFailure/currentProvider handling.
- Around line 1414-1421: Extract the inline network-error detection into a
reusable helper: create and export a function isNetworkTransportError(error:
Error) in errors.ts that contains the existing checks (referencing the same
checks now assigned to isNetworkError), then replace the inline isNetworkError
expression in forwarder.ts with a call to isNetworkTransportError(lastError) and
import it; also update any other places (e.g., hedge path or other error
handlers) to use the new isNetworkTransportError to centralize and reuse the
logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9ba4d10d-7924-4dd8-bc25-469a2e1131b6
📒 Files selected for processing (1)
src/app/v1/_lib/proxy/forwarder.ts
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/app/v1/_lib/proxy/forwarder.ts (3)
1451-1461:errorDetails.system缺少errorStack字段与 SYSTEM_ERROR 处理器(第 1580 行)相比,此处的
errorDetails.system对象缺少errorStack字段。虽然不影响功能,但添加此字段可以在调试时提供更完整的错误上下文。建议添加 errorStack 字段
errorDetails: { system: { errorType: err.constructor.name, errorName: err.name, errorMessage: err.message || err.name || "Unknown error", errorCode: err.code, errorSyscall: err.syscall, + errorStack: err.stack?.split("\n").slice(0, 3).join("\n"), }, request: buildRequestDetails(session), },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1451 - 1461, The errorDetails.system object built where errors are reported is missing the errorStack field; update the object construction (the block that sets errorDetails: { system: { ... } }) to include errorStack: err.stack (or err.stack || null) so it matches the SYSTEM_ERROR handler behavior around the SYSTEM_ERROR handling code (referenced near line 1580) and provides full stack context when logging from the forwarder error reporting path.
1489-1489: 重复调用getEnvConfig()产生变量遮蔽第 1489 行的
const env = getEnvConfig();遮蔽了第 536 行已声明的外层env变量。由于外层env在整个send()方法作用域内有效,此处无需重新获取。移除重复声明
- const env = getEnvConfig(); failedProviderIds.push(currentProvider.id);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` at line 1489, Inside the send() method remove the redundant local declaration "const env = getEnvConfig();" that shadows the outer env variable; instead reuse the already-declared outer env (referenced earlier in send()) and delete the extra call to getEnvConfig() to avoid duplicate invocation and variable shadowing (search for the getEnvConfig() call within send() and remove it so subsequent references use the outer env).
1410-1421: 网络错误检测逻辑正确,但建议提取为独立的类型守卫函数当前的
isNetworkError检测逻辑正确覆盖了常见的网络/传输错误类型。然而,代码库中已有类似的错误检测函数(如isClientAbortError、isHttp2Error、isSSLCertificateError),建议将此逻辑提取到errors.ts作为独立的类型守卫函数,以保持一致性并便于复用。建议提取到 errors.ts 的示例
// 在 errors.ts 中添加 export function isNetworkError(error: unknown): error is Error & { code?: string } { if (!(error instanceof Error)) return false; const err = error as Error & { code?: string }; return ( err.name === "SocketError" || err.code === "ECONNRESET" || err.code === "ECONNREFUSED" || err.code === "ETIMEDOUT" || err.code === "ENOTFOUND" || err.message?.toLowerCase().includes("fetch failed") || err.message?.toLowerCase().includes("other side closed") ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1410 - 1421, Extract the inline network-detection logic into a reusable type-guard exported from errors.ts (e.g., export function isNetworkError(error: unknown): error is Error & { code?: string }) that first checks instanceof Error and then tests name, code values (ECONNRESET, ECONNREFUSED, ETIMEDOUT, ENOTFOUND) and message substrings ("fetch failed", "other side closed"); replace the inline isNetworkError computation in forwarder.ts with a call to this new isNetworkError(lastError) and add the appropriate import from errors.ts so other modules can reuse the guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1451-1461: The errorDetails.system object built where errors are
reported is missing the errorStack field; update the object construction (the
block that sets errorDetails: { system: { ... } }) to include errorStack:
err.stack (or err.stack || null) so it matches the SYSTEM_ERROR handler behavior
around the SYSTEM_ERROR handling code (referenced near line 1580) and provides
full stack context when logging from the forwarder error reporting path.
- Line 1489: Inside the send() method remove the redundant local declaration
"const env = getEnvConfig();" that shadows the outer env variable; instead reuse
the already-declared outer env (referenced earlier in send()) and delete the
extra call to getEnvConfig() to avoid duplicate invocation and variable
shadowing (search for the getEnvConfig() call within send() and remove it so
subsequent references use the outer env).
- Around line 1410-1421: Extract the inline network-detection logic into a
reusable type-guard exported from errors.ts (e.g., export function
isNetworkError(error: unknown): error is Error & { code?: string }) that first
checks instanceof Error and then tests name, code values (ECONNRESET,
ECONNREFUSED, ETIMEDOUT, ENOTFOUND) and message substrings ("fetch failed",
"other side closed"); replace the inline isNetworkError computation in
forwarder.ts with a call to this new isNetworkError(lastError) and add the
appropriate import from errors.ts so other modules can reuse the guard.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0ccbe974-0cec-48d8-8f1f-76416cb31560
📒 Files selected for processing (1)
src/app/v1/_lib/proxy/forwarder.ts
| const isNetworkError = | ||
| lastError.name === "SocketError" || | ||
| (lastError as Error & { code?: string }).code === "ECONNRESET" || | ||
| (lastError as Error & { code?: string }).code === "ECONNREFUSED" || | ||
| (lastError as Error & { code?: string }).code === "ETIMEDOUT" || | ||
| (lastError as Error & { code?: string }).code === "ENOTFOUND" || | ||
| lastError.message?.toLowerCase().includes("fetch failed") || | ||
| lastError.message?.toLowerCase().includes("other side closed"); |
There was a problem hiding this comment.
Incomplete duplicate of existing isTransportError — misses wrapped-cause errors, EAI_AGAIN, and undici-specific codes
errors.ts already has a private isTransportError function (used in categorizeErrorAsync at priority 1.5) that implements the canonical set of transport-error signals. The inline isNetworkError here duplicates that logic but with three important gaps:
-
error.cause?.codeis not checked. Undici frequently wraps transport failures: e.g., aTypeError: fetch failedcarriescause.code === "ECONNRESET".isTransportErrorchecks(error as ...).cause?.codebut the new code only tests the top-level.code, so wrapped errors escape detection. -
"EAI_AGAIN"is missing. Temporary DNS failures (common in Kubernetes/container environments) carry codeEAI_AGAIN.isTransportErrorincludes it; the new check does not, so a containerised DNS hiccup is silently treated as a genuine client error. -
Undici-internal codes are missing.
isTransportErroralso covers"UND_ERR_SOCKET","UND_ERR_CONNECT_TIMEOUT","UND_ERR_HEADERS_TIMEOUT", and"UND_ERR_BODY_TIMEOUT". The new code ignores them entirely.
Because categorizeErrorAsync already runs isTransportError at priority 1.5 — before isNonRetryableClientErrorAsync — a standard SocketError or ECONNRESET should never even reach NON_RETRYABLE_CLIENT_ERROR. This isNetworkError guard can therefore only fire when a network error somehow slips past isTransportError (e.g., an unexpected wrapping). In that edge case the incomplete implementation fails to catch the very same error patterns it was designed for.
Consider exporting isTransportError from errors.ts and reusing it here, or at minimum aligning the detection logic by adding the missing codes and the cause?.code check:
const errCode =
(lastError as Error & { code?: string }).code ??
(lastError as Error & { cause?: { code?: string } }).cause?.code;
const TRANSPORT_CODES = new Set([
"ECONNRESET", "ECONNREFUSED", "ETIMEDOUT", "ENOTFOUND",
"EAI_AGAIN",
"UND_ERR_SOCKET", "UND_ERR_CONNECT_TIMEOUT",
"UND_ERR_HEADERS_TIMEOUT", "UND_ERR_BODY_TIMEOUT",
]);
const isNetworkError =
lastError.name === "SocketError" ||
(errCode !== undefined && TRANSPORT_CODES.has(errCode)) ||
lastError.message?.toLowerCase().includes("fetch failed") ||
lastError.message?.toLowerCase().includes("other side closed");Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1414-1421
Comment:
**Incomplete duplicate of existing `isTransportError` — misses wrapped-cause errors, `EAI_AGAIN`, and undici-specific codes**
`errors.ts` already has a private `isTransportError` function (used in `categorizeErrorAsync` at priority 1.5) that implements the canonical set of transport-error signals. The inline `isNetworkError` here duplicates that logic but with three important gaps:
1. **`error.cause?.code` is not checked.** Undici frequently wraps transport failures: e.g., a `TypeError: fetch failed` carries `cause.code === "ECONNRESET"`. `isTransportError` checks `(error as ...).cause?.code` but the new code only tests the top-level `.code`, so wrapped errors escape detection.
2. **`"EAI_AGAIN"` is missing.** Temporary DNS failures (common in Kubernetes/container environments) carry code `EAI_AGAIN`. `isTransportError` includes it; the new check does not, so a containerised DNS hiccup is silently treated as a genuine client error.
3. **Undici-internal codes are missing.** `isTransportError` also covers `"UND_ERR_SOCKET"`, `"UND_ERR_CONNECT_TIMEOUT"`, `"UND_ERR_HEADERS_TIMEOUT"`, and `"UND_ERR_BODY_TIMEOUT"`. The new code ignores them entirely.
Because `categorizeErrorAsync` already runs `isTransportError` at priority 1.5 — before `isNonRetryableClientErrorAsync` — a standard `SocketError` or `ECONNRESET` should never even reach `NON_RETRYABLE_CLIENT_ERROR`. This `isNetworkError` guard can therefore only fire when a network error somehow slips past `isTransportError` (e.g., an unexpected wrapping). In that edge case the incomplete implementation fails to catch the very same error patterns it was designed for.
Consider exporting `isTransportError` from `errors.ts` and reusing it here, or at minimum aligning the detection logic by adding the missing codes and the `cause?.code` check:
```typescript
const errCode =
(lastError as Error & { code?: string }).code ??
(lastError as Error & { cause?: { code?: string } }).cause?.code;
const TRANSPORT_CODES = new Set([
"ECONNRESET", "ECONNREFUSED", "ETIMEDOUT", "ENOTFOUND",
"EAI_AGAIN",
"UND_ERR_SOCKET", "UND_ERR_CONNECT_TIMEOUT",
"UND_ERR_HEADERS_TIMEOUT", "UND_ERR_BODY_TIMEOUT",
]);
const isNetworkError =
lastError.name === "SocketError" ||
(errCode !== undefined && TRANSPORT_CODES.has(errCode)) ||
lastError.message?.toLowerCase().includes("fetch failed") ||
lastError.message?.toLowerCase().includes("other side closed");
```
How can I resolve this? If you propose a fix, please make it concise.| errorDetails: { | ||
| system: { | ||
| errorType: err.constructor.name, | ||
| errorName: err.name, | ||
| errorMessage: err.message || err.name || "Unknown error", | ||
| errorCode: err.code, | ||
| errorSyscall: err.syscall, | ||
| }, | ||
| request: buildRequestDetails(session), | ||
| }, | ||
| }); |
There was a problem hiding this comment.
Missing errorStack in system error details — inconsistent audit trail
Every other system-type errorDetails block in this file includes errorStack: err.stack?.split("\n").slice(0, 3).join("\n") (e.g., the canonical SYSTEM_ERROR handler at line ~1580, and the rectifier paths at lines ~1204 and ~1337). The new block omits it, making it impossible to distinguish call sites during post-incident diagnosis of miscategorised network errors.
| errorDetails: { | |
| system: { | |
| errorType: err.constructor.name, | |
| errorName: err.name, | |
| errorMessage: err.message || err.name || "Unknown error", | |
| errorCode: err.code, | |
| errorSyscall: err.syscall, | |
| }, | |
| request: buildRequestDetails(session), | |
| }, | |
| }); | |
| errorDetails: { | |
| system: { | |
| errorType: err.constructor.name, | |
| errorName: err.name, | |
| errorMessage: err.message || err.name || "Unknown error", | |
| errorCode: err.code, | |
| errorSyscall: err.syscall, | |
| errorStack: err.stack?.split("\n").slice(0, 3).join("\n"), | |
| }, | |
| request: buildRequestDetails(session), | |
| }, |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 1451-1461
Comment:
**Missing `errorStack` in `system` error details — inconsistent audit trail**
Every other `system`-type `errorDetails` block in this file includes `errorStack: err.stack?.split("\n").slice(0, 3).join("\n")` (e.g., the canonical SYSTEM_ERROR handler at line ~1580, and the rectifier paths at lines ~1204 and ~1337). The new block omits it, making it impossible to distinguish call sites during post-incident diagnosis of miscategorised network errors.
```suggestion
errorDetails: {
system: {
errorType: err.constructor.name,
errorName: err.name,
errorMessage: err.message || err.name || "Unknown error",
errorCode: err.code,
errorSyscall: err.syscall,
errorStack: err.stack?.split("\n").slice(0, 3).join("\n"),
},
request: buildRequestDetails(session),
},
```
How can I resolve this? If you propose a fix, please make it concise.…OR path When a SocketError or other transport error is miscategorized as NON_RETRYABLE_CLIENT_ERROR, the existing instanceof ProxyError guard (added in dev) routes it to the plain-error else branch, which throws immediately without retry or failover — truncating streaming responses. This patch adds inline SYSTEM_ERROR-equivalent handling for detected network errors in the plain-error branch: - Transport error detection aligned with isTransportError() in errors.ts, including cause?.code unwrapping, EAI_AGAIN, and undici-specific codes - Endpoint-level circuit breaker recording (recordEndpointFailure) - Retry with endpoint index advancement when attempts remain - Provider failover with failedProviderIds tracking when exhausted - Provider-level circuit breaker recording per config Fixes ding113#909
bfcbda3 to
01467ea
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/app/v1/_lib/proxy/forwarder.ts (3)
1419-1424:TRANSPORT_CODES在循环内重复创建,可考虑提取为模块级常量每次进入错误处理分支时都会重新创建
Set对象。虽然性能影响有限,但将其提取到模块级别可以避免重复创建,同时也便于与errors.ts中的isTransportError(如有)保持同步。建议提取为模块级常量
+/** Network/transport error codes that should trigger retry/failover */ +const TRANSPORT_CODES = new Set([ + "ECONNRESET", "ECONNREFUSED", "ETIMEDOUT", "ENOTFOUND", + "EAI_AGAIN", + "UND_ERR_SOCKET", "UND_ERR_CONNECT_TIMEOUT", + "UND_ERR_HEADERS_TIMEOUT", "UND_ERR_BODY_TIMEOUT", +]); + const STANDARD_ENDPOINTS = [ "/v1/messages",然后在第 1419-1424 行删除局部定义即可。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1419 - 1424, Extract the TRANSPORT_CODES Set out of the error handling block and define it as a module-level constant so it is created once and reusable; replace the local definition in forwarder.ts (currently named TRANSPORT_CODES inside the error handling branch) with a reference to the new module-level constant, and consider aligning/deferring to the existing isTransportError utility in errors.ts if present to avoid duplication of transport error logic.
1454-1470:errorDetails.system缺少errorStack字段,与 SYSTEM_ERROR 处理路径不一致在 SYSTEM_ERROR 处理分支(第 1582-1593 行),
errorDetails.system包含errorStack字段用于调试追踪。但此处新增的网络错误重分类路径缺少该字段,会导致决策链审计数据不一致。建议补充 errorStack 字段
errorDetails: { system: { errorType: err.constructor.name, errorName: err.name, errorMessage: err.message || err.name || "Unknown error", errorCode: err.code, errorSyscall: err.syscall, + errorStack: err.stack?.split("\n").slice(0, 3).join("\n"), }, request: buildRequestDetails(session), },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1454 - 1470, The errorDetails.system object created in the network-error reclassification path passed to session.addProviderToChain (inside the block building endpointAudit for currentProvider) is missing the errorStack field; to fix, include errorStack: err.stack (or err.stack || null) alongside errorType/name/message/code/syscall so the structure matches the SYSTEM_ERROR path used later (use the same keys as in the SYSTEM_ERROR handling, e.g., errorStack) and ensure you reference currentProvider.id and getCircuitState(currentProvider.id) as existing context when adding the provider to the chain.
1498-1498: 变量env重复声明(shadowing)第 536 行已有
const env = getEnvConfig(),此处重新声明会产生 shadowing。由于getEnvConfig()返回缓存的单例,功能上没有问题,但可以直接使用外层已有的env变量。建议移除重复声明
- const env = getEnvConfig(); failedProviderIds.push(currentProvider.id); if (env.ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/v1/_lib/proxy/forwarder.ts` at line 1498, 在 forwarder.ts 中第 1498 行的 `const env = getEnvConfig()` 与第 536 行已定义的同名变量产生了 shadowing;请删除此处重复的声明并直接复用外层的 `env` 变量(保留对 getEnvConfig() 的唯一调用处),以避免重复声明和潜在混淆。确保引用的标识符为 `env` 和 `getEnvConfig()`,并在删除重复声明后运行编译/类型检查确认没有未定义引用。
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 1419-1424: Extract the TRANSPORT_CODES Set out of the error
handling block and define it as a module-level constant so it is created once
and reusable; replace the local definition in forwarder.ts (currently named
TRANSPORT_CODES inside the error handling branch) with a reference to the new
module-level constant, and consider aligning/deferring to the existing
isTransportError utility in errors.ts if present to avoid duplication of
transport error logic.
- Around line 1454-1470: The errorDetails.system object created in the
network-error reclassification path passed to session.addProviderToChain (inside
the block building endpointAudit for currentProvider) is missing the errorStack
field; to fix, include errorStack: err.stack (or err.stack || null) alongside
errorType/name/message/code/syscall so the structure matches the SYSTEM_ERROR
path used later (use the same keys as in the SYSTEM_ERROR handling, e.g.,
errorStack) and ensure you reference currentProvider.id and
getCircuitState(currentProvider.id) as existing context when adding the provider
to the chain.
- Line 1498: 在 forwarder.ts 中第 1498 行的 `const env = getEnvConfig()` 与第 536
行已定义的同名变量产生了 shadowing;请删除此处重复的声明并直接复用外层的 `env` 变量(保留对 getEnvConfig()
的唯一调用处),以避免重复声明和潜在混淆。确保引用的标识符为 `env` 和
`getEnvConfig()`,并在删除重复声明后运行编译/类型检查确认没有未定义引用。
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 36fb2b44-ff97-474e-928d-6a8bd3f5b35d
📒 Files selected for processing (1)
src/app/v1/_lib/proxy/forwarder.ts
Problem
When an upstream provider connection drops (e.g.
SocketErrorfrom Node.js undici), the error can be miscategorized asNON_RETRYABLE_CLIENT_ERRORbyErrorCategorizer.The
devbranch already addedinstanceof ProxyErrorguards to prevent theTypeError: getDetailedErrorMessage is not a functioncrash. However, non-ProxyError network errors (SocketError, ECONNRESET, etc.) in theelsebranch still get thrown immediately without retry or failover, causing:Fix
In the
elsebranch of theNON_RETRYABLE_CLIENT_ERRORhandler, add network error detection before throwing:fetch failed,other side closed) are detectedSYSTEM_ERRORviacontinue, entering the retry/failover pathThis is consistent with how the existing
SYSTEM_ERRORhandler works — retry current provider once, then failover to next provider.Changes
src/app/v1/_lib/proxy/forwarder.ts: AddisNetworkErrorcheck in theelse(non-ProxyError) branch ofNON_RETRYABLE_CLIENT_ERRORhandler (+30 lines)Testing
Fixes #909
Greptile Summary
This PR adds a defensive network-error reclassification guard inside the
NON_RETRYABLE_CLIENT_ERRORhandler'selsebranch (non-ProxyErrorpath) inProxyForwarder.send(). When a SocketError or other transport-level error escapesisTransportError()classification (e.g. because a user-configured error rule matched its message content before the transport check could short-circuit it), the error previously caused immediatethrow-and-abandon behaviour; the fix correctly reclassifies it asSYSTEM_ERROR, performs all required bookkeeping, and routes it into the retry/failover path.isNetworkErrordetection set (SocketErrorname,TRANSPORT_CODES,fetch failed/other side closedmessage signatures) correctly mirrors the existingisTransportError()implementation inerrors.ts, includingerror.cause?.codewrapping,EAI_AGAIN, and all undici-specific codes.SYSTEM_ERRORhandler performs is explicitly replicated:session.addProviderToChain(reason: "system_error"),recordEndpointFailure,currentEndpointIndex++,failedProviderIds.push, conditionalrecordFailureguarded byENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS.continue/breakplacement correctly bypasses thethrow lastErrorat the end of theNON_RETRYABLE_CLIENT_ERRORblock.TRANSPORT_CODESis anew Set(...)literal allocated inside the catch block on every invocation rather than a module-level or shared constant, creating a silent drift risk iferrors.tsgains new transport codes without updating this copy.Confidence Score: 4/5
isTransportError()function. The only notable issue is a duplicated inlineTRANSPORT_CODESSet that could drift fromerrors.tsover time, which is a maintainability concern rather than a runtime bug.TRANSPORT_CODESconstant should ideally be unified with the one inerrors.ts.Important Files Changed
isNetworkErrorguard inside theNON_RETRYABLE_CLIENT_ERRORelse-branch that correctly replicates theisTransportError()detection logic fromerrors.tsand performs all required bookkeeping (session.addProviderToChain,recordEndpointFailure,failedProviderIds.push) before issuingcontinue/break. The retry/failover semantics are correct. Minor:TRANSPORT_CODESis a duplicatedSetliteral allocated inline rather than referencing the canonical constant fromerrors.ts.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Inner retry loop\nattemptCount++] --> B[HTTP request] B -->|success| C[Return response] B -->|error| D[categorizeErrorAsync] D -->|CLIENT_ABORT| E[throw immediately] D -->|SYSTEM_ERROR| F[SYSTEM_ERROR handler\nrecordChain + recordEndpointFailure] D -->|NON_RETRYABLE_CLIENT_ERROR| G{instanceof ProxyError?} D -->|RESOURCE_NOT_FOUND| H[retry / failover] D -->|PROVIDER_ERROR| I[circuit breaker + failover] G -->|yes| J[record chain\nthrow immediately] G -->|no| K{isNetworkError?\nSocketError / ECONNRESET\nECONNREFUSED / etc.} K -->|yes 🆕| L[record chain\nrecordEndpointFailure] K -->|no| M[log + record chain\nthrow immediately] L --> N{attemptCount <\nmaxAttempts?} N -->|yes| O[currentEndpointIndex++\ncontinue → retry] N -->|no| P[failedProviderIds.push\nrecordFailure if enabled\nbreak → provider failover] F --> N style K fill:#d4edda,stroke:#28a745 style L fill:#d4edda,stroke:#28a745 style N fill:#d4edda,stroke:#28a745 style O fill:#d4edda,stroke:#28a745 style P fill:#d4edda,stroke:#28a745Comments Outside Diff (1)
src/app/v1/_lib/proxy/forwarder.ts, line 1566 (link)Incomplete fix — two unsafe
as ProxyErrorcasts remainThe PR description states:
However, the diff only added a guard to the
NON_RETRYABLE_CLIENT_ERRORhandler. This line and the analogous cast at line 1672 still contain unguardedas ProxyErrorcasts:Line 1566 (
RESOURCE_NOT_FOUNDhandler):const proxyError = lastError as ProxyError;— if a non-ProxyErroris categorized here,proxyError.upstreamError?.statusCodeInferredis safe due to optional chaining, butproxyError.messagewould return the plain error message rather than the ProxyError-formatted one, silently producing incorrect telemetry insession.addProviderToChain.Line 1672 (
PROVIDER_ERRORhandler):const proxyError = lastError as ProxyError;followed immediately byconst statusCode = proxyError.statusCode;—statusCodewill beundefinedfor any non-ProxyError. ThestatusCode === 524guard at line 1678 evaluatesundefined === 524 → false(safe), butstatusCode: statusCodestored in the chain audit at line 1752 will beundefined, corrupting the audit trail. AdditionallyproxyError.upstreamError?.body/parsedwill beundefinedinstead of actual upstream body data.Both sites need the same
instanceof ProxyErrorguard that was applied to theNON_RETRYABLE_CLIENT_ERRORhandler.Prompt To Fix With AI
Prompt To Fix All With AI
Last reviewed commit: 01467ea