Skip to content

npow/kompact

Repository files navigation

Kompact

CI PyPI Python 3.10+ Docs

Context compression proxy for LLM agents. Sits between your agent and the LLM provider, compresses context on the fly, and cuts your token bill 40-70% — with zero code changes.

Save real money

For a team running 1,000 agentic requests/day with ~10K token contexts:

Model Without Kompact With Kompact Monthly Savings
Sonnet ($3/M) $900/mo $405/mo $495/mo
Opus ($15/M) $4,500/mo $2,025/mo $2,475/mo
GPT-4o ($2.50/M) $750/mo $338/mo $412/mo

Savings scale linearly. 10K requests/day = 10x the numbers above.

Get started in 30 seconds

pip install kompact   # or: uv add kompact
kompact proxy --port 7878
export ANTHROPIC_BASE_URL=http://localhost:7878
# That's it. Your agent now uses fewer tokens.

No SDK changes. No prompt rewriting. Just point your base URL at the proxy.

Quality stays intact

Evaluated on BFCL (1,431 real API schemas) — the standard benchmark for tool-calling agents. End-to-end through Claude, scored with context-bench.

Quality impact vs no compression (closer to 0% = better):

Model Kompact Headroom LLMLingua-2
Haiku -2.6% -3.0% -23.4%
Sonnet -3.9% -3.5% -20.6%
Opus -0.5% -0.5% -27.3%

Kompact and Headroom both stay within ~3% of baseline. LLMLingua-2 destroys tool schemas regardless of model (-20 to -27%).

Compression across content types

Measured offline on 12,795 examples across 3 datasets:

Dataset Examples Kompact Headroom LLMLingua-2
BFCL (tool schemas) 1,431 55.3% ~0% 55.4%
Glaive (tool calling) 3,959 56.6% ~0% ~50%
HotpotQA (prose QA) 7,405 17.9% ~0% 49.9%

Headroom's SmartCrusher doesn't compress JSON — it's designed for prose. LLMLingua-2 compresses aggressively but destroys information (see quality table above).

How it works

Kompact is a transparent HTTP proxy. It intercepts LLM API requests, compresses the context, then forwards to the provider.

        ┌──────────────────────────────────────────────┐
        │           Kompact Proxy (:7878)              │
        │                                              │
Agent ─>│  1. Schema Optimizer    (TF-IDF selection)   │─> LLM Provider
        │  2. Content Compressors (TOON, JSON, code)   │
        │  3. Extractive Compress (TF-IDF sentences)   │
        │  4. Observation Masker  (history mgmt)       │
        │  5. Cache Aligner       (prefix caching)     │
        │                                              │
        └──────────────────────────────────────────────┘

8 transforms, each targeting a different content type. The pipeline adapts automatically — short contexts get light compression, long contexts get aggressive optimization. Sub-millisecond overhead.

Per-request control

Disable transforms for a single request without affecting other clients using the X-Kompact-Disable header:

# Anthropic SDK
client.messages.create(..., extra_headers={"X-Kompact-Disable": "toon,code_compressor"})

# OpenAI SDK
client.chat.completions.create(..., extra_headers={"X-Kompact-Disable": "toon,code_compressor"})

Comma-separated transform names: toon, json_crusher, code_compressor, log_compressor, content_compressor, observation_masker, cache_aligner, schema_optimizer.

Monitoring

Kompact exports OpenTelemetry metrics (on by default, disable with --no-otel). A Prometheus + Grafana stack is included:

cd monitoring
docker compose up -d

The dashboard shows request rate, token savings, compression ratio, pipeline latency percentiles, and per-transform breakdowns.

Running benchmarks

# Offline compression (no LLM calls, measures compression + needle preservation)
uv run python benchmarks/run_dataset_eval.py --dataset bfcl

# End-to-end quality (sends through proxy chain, measures LLM answer quality)
# Requires: claude-relay running on :8084, kompact on :7878
uv run python benchmarks/run_e2e_eval.py --dataset bfcl --model haiku --workers 20

See benchmarks/README.md for full methodology.

Development

uv sync --extra dev
uv run pytest          # 48 tests
uv run ruff check src/ tests/

License

MIT