Skip to content

Commit 64a3674

Browse files
docs(cost): codify prompt/token cost playbook + Max-plan price ramp (#18)
1 parent 1a8d850 commit 64a3674

4 files changed

Lines changed: 400 additions & 5 deletions

File tree

CLAUDE.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ Quick reference (agent_docs/):
6060
Full documentation (docs/):
6161
- principles.md - Agent programming paradigm
6262
- architecture.md - Extension model and browser sandbox
63+
- cost-optimization.md - Prompt/token cost playbook + Max-plan price ramp
6364
- SPEC.md - Technical specification
6465
- API_REFERENCE.md - API docs
6566
- DEVELOPMENT.md - Dev guide
@@ -241,6 +242,20 @@ Quality gates scale with tier — don't over-engineer AUTOMATE tasks, don't unde
241242

242243
For AUTOMATE and STANDARD tiers: make only the requested changes. Don't refactor surrounding code, add abstractions for one-time operations, or create helpers that are used once. Three similar lines of code is better than a premature abstraction.
243244

245+
## Cost Optimization
246+
247+
Assume token costs only go up: Max-plan usage ramps from ~80% off to full price over ~3 months (codified in `src/core/models/provider-pricing.ts``MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`). The cheapest token is the one you don't send. Full playbook: `docs/cost-optimization.md`.
248+
249+
Codified defaults (cheapest lever first):
250+
- **Route by complexity.** Keep `multiProvider` on — `getOptimalProvider()`/`scoreComplexity()` send simple tasks to cheap models, hard ones to Anthropic. Opus only for CAREFUL/ARCHITECT; Sonnet default; Haiku/cheap providers for AUTOMATE. Output tokens cost 5× input.
251+
- **Tune `effort` before model.** Default `high` for coding; `medium` for cost-sensitive; `max`/`xhigh` only for correctness-critical. Pair with adaptive thinking.
252+
- **Protect the prompt cache.** Keep the prefix byte-stable — no timestamps/UUIDs/IDs in the system prompt, no mid-session tool/model swaps (full rebuild). Cache reads are ~0.1×.
253+
- **Batch non-interactive work** via `AnthropicBatchClient` (50% off).
254+
- **Cap context** via `ContextBudgetManager`; keep the token-optimization hooks on (dedup/prewarm/script-suggest).
255+
- **Close the loop:** `conductor learn --evolve` (GEPA) + `stackmemory optimize traces` shrink prompts permanently.
256+
257+
Guardrails (never trade for cost): the sensitive-content guard must keep forcing Anthropic for secrets/PII; correctness tiers stay on the capable model; never truncate inputs silently — cap deliberately via the budget manager.
258+
244259
## Session Budget
245260

246261
- Max 1 major topic per session — split unrelated work into separate sessions

docs/cost-optimization.md

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Cost Optimization
2+
3+
Codified practices for keeping StackMemory's prompt and token spend low — and for
4+
staying ahead of rising costs as the Max-plan discount expires.
5+
6+
> **Planning assumption.** Max-plan usage starts at an **80% discount** and ramps
7+
> **linearly to full price over ~3 months** (≈2026-06-06 → 2026-09-06). Every
8+
> token of agent effort gets ~5× more expensive over that window. This is
9+
> codified once in `src/core/models/provider-pricing.ts`
10+
> (`MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`); routing, budgets, and
11+
> this playbook all read from it. Treat optimizations below as "nice to have"
12+
> today and "load-bearing" by September.
13+
14+
## Guiding principle
15+
16+
**Be model-agnostic; route by task value, not by reflex.** The expensive
17+
failure mode is defaulting every workload to the most capable (most expensive)
18+
model "to be safe" — that burns budget with zero governance. Instead:
19+
20+
- **Match model to task.** Cheap/high-volume models for inference and simple
21+
transforms; a premium model (Opus) only for agent workflows where reliability
22+
pays for itself; the most expensive tier *only* when the incremental capability
23+
demonstrably justifies a multiple-× token premium.
24+
- **Govern, audit, control.** Spend should be measurable per run, attributable
25+
to a task tier, and bounded by a budget — not an untracked aggregate. Cost
26+
controls below are pointless without the trace-level visibility to enforce them.
27+
28+
This is the same axis as the [Task Delegation Model](../CLAUDE.md) tiers
29+
(AUTOMATE → ARCHITECT): route effort — and spend — by complexity and value.
30+
31+
## The cost model
32+
33+
Two distinct meters run:
34+
35+
1. **API spend** (third-party + Anthropic API) — billed per token via
36+
`MODEL_PRICING`. Cost-aware routing already optimizes this.
37+
2. **Max-plan agent effort** — the conductor spawns Claude Code agents on the
38+
Max plan. Today heavily discounted; ramping to full price. Optimize this by
39+
spending *fewer tokens per task* (tighter prompts, less context, fewer turns)
40+
and *fewer tasks on the expensive tier* (route, cache, batch).
41+
42+
Current list prices (per 1M tokens, sourced 2026-05-26):
43+
44+
| Model | Input | Output | Context |
45+
| ---------------- | ----- | ------ | ------- |
46+
| Opus 4.6/4.7/4.8 | $5 | $25 | 1M |
47+
| Sonnet 4.6 | $3 | $15 | 1M |
48+
| Haiku 4.5 | $1 | $5 | 200K |
49+
50+
Output tokens cost **** input. Cache reads cost **~0.1×** input; batch is
51+
**50% off**. The cheapest token is the one you don't send.
52+
53+
## Where spend happens (inventory)
54+
55+
| Surface | File(s) | Dominant cost | Lever |
56+
| --- | --- | --- | --- |
57+
| Conductor agent runs | `~/.stackmemory/conductor/prompt-template.md` | Output tokens, turn count | Prompt diet, GEPA, effort |
58+
| Context rehydration | `src/core/context/`, `src/core/digest/` | Input tokens | Budget caps, compression |
59+
| Ralph swarm iterations | `src/integrations/ralph/context/context-budget-manager.ts` | Input tokens/iteration | `maxTokens`, compression |
60+
| LLM retrieval | `src/core/retrieval/llm-*.ts` | Input + output | Cheap-model routing |
61+
| Hook overhead | `src/hooks/` (dedup, prewarm, script-suggest) | Duplicate reads, bad tool choice | Already enabled — keep on |
62+
| MCP tool surface | `src/integrations/mcp/tool-definitions.ts` | Input tokens (schemas in context) | Tool search / trim |
63+
64+
## Codified practices (ranked by impact)
65+
66+
### 1. Route by complexity — don't pay Opus for lint fixes
67+
`getOptimalProvider()` / `scoreComplexity()` already route low-complexity tasks
68+
to cheap providers and high-complexity to Anthropic, gated by the `multiProvider`
69+
feature flag. **Keep `multiProvider` enabled.** The sensitive-content guard
70+
(`detectSensitiveContent`) forces Anthropic for secrets/PII — never weaken it to
71+
save money.
72+
73+
- Default the conductor's *simple* tiers (AUTOMATE/STANDARD) to Sonnet, reserve
74+
Opus for CAREFUL/ARCHITECT. Opus↔Sonnet is a 1.7× swing; Opus↔Haiku is 5×.
75+
- Use **subagents on a cheaper model** for fan-out (Explore/grep/read) rather
76+
than switching the main loop's model — switching mid-session breaks the prompt
77+
cache (see #4).
78+
79+
### 2. Tune `effort`, not model, first
80+
On Opus 4.6+/Sonnet 4.6, `output_config: {effort: ...}` is the cheapest quality
81+
dial. `low`/`medium` mean fewer, more-consolidated tool calls and less preamble.
82+
Default to **`high`** for coding, drop to `medium` for cost-sensitive routes, and
83+
reserve `max`/`xhigh` for correctness-critical work. Pair with adaptive thinking
84+
(`thinking: {type: "adaptive"}`) so the model self-limits reasoning.
85+
86+
### 3. Spend fewer output tokens (5× input)
87+
- **Lower-effort, terser agents.** Add a silence-default to the conductor
88+
template: no narration between tool calls, one-or-two-sentence wrap-ups.
89+
- **Don't lowball `max_tokens`** — truncation forces a full re-run. Set a real
90+
ceiling, then let `effort`/`task_budget` moderate actual usage.
91+
- Use **Task Budgets** (`task_budget`, beta) for long agentic loops so the model
92+
sees a countdown and wraps up gracefully instead of being hard-truncated.
93+
94+
### 4. Prompt caching — keep the prefix frozen
95+
Cache reads are ~0.1× input. The entire win depends on a **byte-stable prefix**
96+
(`tools``system``messages`):
97+
- No `Date.now()`, UUIDs, or per-session IDs in the system prompt — inject
98+
volatile context later in `messages`.
99+
- Don't reorder/add tools or switch models mid-session (full cache rebuild).
100+
- Verify with `usage.cache_read_input_tokens`; zero across repeats = a silent
101+
invalidator. See the audit table in the `claude-api` skill (`prompt-caching`).
102+
- Pre-warm only when first-request latency is user-visible and traffic is bursty.
103+
104+
### 5. Cap context aggressively
105+
`ContextBudgetManager` (Ralph) already truncates, compresses, and
106+
priority-weights context with a `DEFAULT_MAX_TOKENS` budget. As prices ramp:
107+
- Lower per-iteration `maxTokens` budgets; keep `compressionEnabled` on.
108+
- Prefer digests/summaries over raw frame dumps for rehydration.
109+
- Trim the MCP tool surface or adopt tool-search so 56 tool schemas aren't all
110+
resident in context.
111+
112+
### 6. Batch the non-interactive work
113+
`AnthropicBatchClient` runs at **50% off**. Anything not latency-sensitive —
114+
backfills, bulk enrichment, digest regeneration, eval sweeps — belongs in a
115+
batch, not a live request.
116+
117+
### 7. Let the hooks do their job
118+
The token-optimization hooks (#14) already save ~22% (324K tokens on the
119+
benchmark): `dedup-reads` (escalates to `[STOP]` at 5+ duplicate reads),
120+
`desire-path-hook` (auto-routes Bash→Glob/Read/Grep), `prewarm-tools`,
121+
`script-suggest`. Don't disable them; extend them when new waste patterns show up
122+
in `scripts/benchmark-hooks.ts`.
123+
124+
### 8. Close the learning loop
125+
`conductor learn --evolve` (GEPA) mutates the prompt template from failure data,
126+
and `stackmemory optimize traces` surfaces repeated, wasteful patterns from
127+
`traces.db`. Run these regularly — a shorter, higher-success prompt is a
128+
permanent per-run discount that compounds as prices rise.
129+
130+
## 3-month phased playbook
131+
132+
The ramp is roughly: **month 0** ≈ 20% of list, **month 1.5** ≈ 60%, **month 3+**
133+
= full price. Escalate effort to match.
134+
135+
**Phase 1 — now (≈80% off): instrument & default-good.**
136+
- Confirm `multiProvider` on; verify cost-aware routing decisions in traces.
137+
- Land the terser conductor template + `effort` defaults.
138+
- Add cost-per-run to trace stats so the ramp is visible. Establish a baseline
139+
tokens/task number to measure against.
140+
141+
**Phase 2 — ~month 1–2 (≈40–70%): squeeze.**
142+
- Tighten `ContextBudgetManager` budgets; expand prompt-caching coverage and
143+
verify hit rates.
144+
- Move all non-interactive workloads to the Batches API.
145+
- Run a GEPA pass; adopt the winning template.
146+
147+
**Phase 3 — ~month 3 (full price): enforce.**
148+
- Treat budgets as hard limits, not hints. Alert when a run exceeds its
149+
`effectiveCost` budget.
150+
- Down-tier aggressively: Opus only for ARCHITECT/CAREFUL; Sonnet default;
151+
Haiku/cheap providers for AUTOMATE.
152+
- Re-baseline `count_tokens` against current models (token counting shifts
153+
between model versions — don't apply a blanket multiplier).
154+
155+
## Guardrails (don't optimize these away)
156+
157+
- **Security routing** — the sensitive-content guard must keep forcing Anthropic
158+
for secrets/PII regardless of cost.
159+
- **Correctness tiers** — CAREFUL/ARCHITECT work stays on the capable model; a
160+
cheap wrong answer that needs a re-run costs more than one right answer.
161+
- **No silent truncation** — cap context deliberately via the budget manager;
162+
never truncate inputs blindly.
163+
164+
## Quick reference
165+
166+
```bash
167+
stackmemory conductor learn --evolve # mutate prompt template from failures
168+
stackmemory optimize traces # find repeated wasteful patterns
169+
node scripts/benchmark-hooks.ts # measure hook token savings
170+
stackmemory conductor trace-stats # aggregate token usage
171+
```
172+
173+
```ts
174+
import {
175+
effectiveSpendMultiplier,
176+
effectiveCost,
177+
} from './core/models/provider-pricing.js';
178+
179+
effectiveSpendMultiplier(); // today's cost factor along the ramp (0.2 → 1.0)
180+
effectiveCost('anthropic', 'claude-opus-4-8', inTok, outTok); // ramp-adjusted cost
181+
```
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
import { describe, it, expect } from 'vitest';
2+
import {
3+
MODEL_PRICING,
4+
calculateCost,
5+
formatCost,
6+
effectiveSpendMultiplier,
7+
effectiveCost,
8+
MAX_PLAN_DISCOUNT_RAMP,
9+
type DiscountRamp,
10+
} from '../provider-pricing.js';
11+
12+
describe('provider-pricing table', () => {
13+
it('prices Opus 4.x at $5/$25 per 1M', () => {
14+
for (const id of [
15+
'anthropic/claude-opus-4-8',
16+
'anthropic/claude-opus-4-7',
17+
'anthropic/claude-opus-4-6',
18+
]) {
19+
expect(MODEL_PRICING[id]).toEqual({
20+
inputPer1M: 5.0,
21+
outputPer1M: 25.0,
22+
source: 'platform.claude.com',
23+
});
24+
}
25+
});
26+
27+
it('prices Sonnet 4.6 and Haiku 4.5 at current rates', () => {
28+
expect(MODEL_PRICING['anthropic/claude-sonnet-4-6'].outputPer1M).toBe(15.0);
29+
expect(MODEL_PRICING['anthropic/claude-haiku-4-5-20251001']).toEqual({
30+
inputPer1M: 1.0,
31+
outputPer1M: 5.0,
32+
source: 'platform.claude.com',
33+
});
34+
});
35+
36+
it('calculates cost from token counts', () => {
37+
const c = calculateCost('anthropic', 'claude-opus-4-8', 1_000_000, 1_000_000);
38+
expect(c).not.toBeNull();
39+
expect(c!.totalCost).toBeCloseTo(30.0, 6); // $5 in + $25 out
40+
});
41+
42+
it('returns null for unknown models', () => {
43+
expect(calculateCost('acme', 'gpt-9', 1, 1)).toBeNull();
44+
});
45+
46+
it('formats sub-cent and larger costs distinctly', () => {
47+
expect(formatCost(0.000123)).toBe('$0.000123');
48+
expect(formatCost(1.5)).toBe('$1.5000');
49+
});
50+
});
51+
52+
describe('Max-plan discount ramp', () => {
53+
const ramp: DiscountRamp = {
54+
start: '2026-06-06',
55+
end: '2026-09-06',
56+
startMultiplier: 0.2,
57+
endMultiplier: 1.0,
58+
};
59+
60+
it('is 80% off at (or before) the ramp start', () => {
61+
expect(effectiveSpendMultiplier(new Date('2026-06-06'), ramp)).toBeCloseTo(0.2);
62+
expect(effectiveSpendMultiplier(new Date('2026-01-01'), ramp)).toBeCloseTo(0.2);
63+
});
64+
65+
it('is full price at (or after) the ramp end', () => {
66+
expect(effectiveSpendMultiplier(new Date('2026-09-06'), ramp)).toBeCloseTo(1.0);
67+
expect(effectiveSpendMultiplier(new Date('2027-01-01'), ramp)).toBeCloseTo(1.0);
68+
});
69+
70+
it('interpolates linearly mid-ramp', () => {
71+
// ~halfway through the ~3-month window
72+
const mid = effectiveSpendMultiplier(new Date('2026-07-22'), ramp);
73+
expect(mid).toBeGreaterThan(0.5);
74+
expect(mid).toBeLessThan(0.65);
75+
});
76+
77+
it('falls back to full price on a misconfigured ramp', () => {
78+
const bad: DiscountRamp = { ...ramp, start: '2026-09-06', end: '2026-06-06' };
79+
expect(effectiveSpendMultiplier(new Date('2026-07-01'), bad)).toBe(1.0);
80+
});
81+
82+
it('exposes a default ramp ending at full price', () => {
83+
expect(MAX_PLAN_DISCOUNT_RAMP.endMultiplier).toBe(1.0);
84+
});
85+
86+
it('effectiveCost scales list cost by the ramp multiplier', () => {
87+
const r = effectiveCost(
88+
'anthropic',
89+
'claude-opus-4-8',
90+
1_000_000,
91+
0,
92+
new Date('2026-06-06')
93+
);
94+
expect(r).not.toBeNull();
95+
expect(r!.listCost).toBeCloseTo(5.0, 6);
96+
expect(r!.effectiveCost).toBeCloseTo(1.0, 6); // 20% of list at ramp start
97+
expect(r!.multiplier).toBeCloseTo(0.2, 6);
98+
});
99+
});

0 commit comments

Comments
 (0)