Skip to content

perf: optimize claude-compat latency and reliability#348

Open
daixudk96-max wants to merge 44 commits intosu-kaka:masterfrom
daixudk96-max:fix/capacity-429-http2-retry
Open

perf: optimize claude-compat latency and reliability#348
daixudk96-max wants to merge 44 commits intosu-kaka:masterfrom
daixudk96-max:fix/capacity-429-http2-retry

Conversation

@daixudk96-max
Copy link

Optimization Plan Execution: optimize-claude-compat-latency

This PR completes the 3-track optimization plan for the Claude-compatible API layer, aimed at substantially reducing latency without compromising quality or backward compatibility.

Changes Included:

  1. Retry Mechanism Cleanup (Task 1): Eliminated duplicate sleep loops between caller and utility layers. Added fail-fast logic for stream endpoints after the first chunk is emitted.
  2. HTTP/2 Connection Pooling (Task 2): Switched to pooled httpx.AsyncClient reuse with HTTP/2 enabled, effectively eliminating TCP/TLS/ALPN handshake latency on hot paths.
  3. Anthropic Converter Fast-Path (Task 3): Bypassed schema sanitization and recursive mapping for simple text-only prompts/responses, avoiding CPU bottleneck in the stream loop.
  4. Preview Credential Scheduler (Task 4): Upgraded from purely random picking to a health-aware score (combining in-flight pressure, 429 cooling status, and recent errors) specifically for the gemini-3-pro-preview model.
  5. Rollout & Verification Gates (Task 5): Added 4 new Feature Flags connected to the Control Panel, and built a CI-friendly perf testing suite (scripts/perf/bench.py, assert_latency.py, rollout_guard.py).

Performance Results (Real-World Test):

Against gemini-3-pro-preview-high:

  • TTFB (First Byte): ~7.5ms (Streaming)
  • First Token P95: 18.14s (55.8% improvement vs 41.05s baseline)
  • Full Latency P95: 18.14s (55.8% improvement)
  • Throughput: ~98% increase in Requests/sec
  • Quality Regression: None (100% success rate on test dataset)

Resolves plan optimize-claude-compat-latency.

CI User and others added 30 commits February 23, 2026 23:57
Fallback to ASCII-escaped JSON encoding when payload contains lone surrogate code units so streaming/non-stream requests no longer crash before dispatch.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
github-actions bot and others added 14 commits February 28, 2026 17:48
…a_for_claude unsupported_keys

Google's Gemini API uses a protobuf-based schema parser that only accepts
a whitelist of JSON Schema fields. When the model name contains 'claude',
tools are cleaned via _clean_schema_for_claude() instead of
_clean_schema_for_gemini(). However, _clean_schema_for_claude() was
missing patternProperties, dependencies, and propertyNames from its
unsupported_keys set, causing Google API to reject requests with:

  Invalid JSON payload received. Unknown name "patternProperties"

This affects any OpenAI-compatible client (e.g. OpenClaw) that includes
patternProperties in tool parameter schemas when routed through the
Claude/Antigravity path.

Ref: https://cloud.google.com/vertex-ai/generative-ai/docs/reference/rest/v1/Schema
…backoff, raise retry interval default

- Enable HTTP/2 in httpx client to match Google cloudcode-pa endpoint expectations
  (fixes intermittent 'All connection attempts failed' errors)
- Detect MODEL_CAPACITY_EXHAUSTED errors in 429 responses and apply exponential
  backoff with jitter (capped at 12s) instead of flat retry interval
- When capacity is exhausted, rotate to next credential instead of hammering the
  same one (prevents thundering herd amplification)
- Raise default RETRY_429_INTERVAL from 0.1s to 1.0s
- Improve exception handler logging with type/repr for better diagnostics
- Fix CredentialManager type hints in utils.py (Any instead of concrete class)
- Normalize multiple project_id response shapes from loadCodeAssist/onboardUser
- Support extracting project id from resource names like projects/*/locations/*
- Extend onboardUser polling window from 10s to 30s for slow activation cases
…ompat latency

- **Preview Scheduler (Task 4)**: Added health-aware credential scoring (in-flight pressure, 429 signal, usage count) via `sqlite_manager.py` and `credential_manager.py`, keeping strict preview model boundaries. Includes assert script `assert_preview_pool.py`.
- **Release Automation (Task 5)**: Wired 4 core feature flags (`ff_retry_policy_v2`, `ff_http2_pool_tuning`, `ff_converter_fast_path`, `ff_preview_credential_scheduler_v2`) into runtime and Control Panel. Created `rollout_guard.py` to compute automated rollout/rollback decisions based on relative latency/throughput/quality thresholds.
- **Verification**: Real-world load against `gemini-3-pro-preview-high` confirms TTFB ~7.5ms (via HTTP/2 pooling), P95 Latency 18.14s (-55.8% vs 41.05s baseline), and 100% success rate with no quality regression. All targeted gates passed.

Fixes #plan-optimize-claude-compat-latency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants