Skip to content

Commit b106553

Browse files
author
StackMemory Bot (CLI)
committed
Merge remote-tracking branch 'origin/main' into feature/webhook-retry-backoff
# Conflicts: # CLAUDE.md
2 parents f907c83 + 64a3674 commit b106553

32 files changed

Lines changed: 4670 additions & 6 deletions

CLAUDE.md

Lines changed: 189 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,37 @@ docker/ # Container configs
2424
prompts/ # Externalized LLM prompt templates
2525
```
2626

27+
## Key Files
28+
29+
- Entry: src/cli/index.ts
30+
- MCP Server: src/integrations/mcp/server.ts
31+
- Frame Manager: src/core/context/frame-manager.ts
32+
- Database: src/core/database/sqlite-adapter.ts
33+
- Snapshot: src/core/worktree/capture.ts
34+
- Preflight: src/core/worktree/preflight.ts
35+
- Conductor: src/cli/commands/orchestrator.ts (core) + orchestrate.ts (CLI)
36+
- Conductor Traces: src/cli/commands/conductor-traces.ts
37+
- Frame Enrichment: src/core/context/frame-enrichment.ts
38+
- Process Utils: src/utils/process-cleanup.ts
39+
- Shared Utils: src/core/utils/{git,text,fs}.ts
40+
41+
## Detailed Guides
42+
43+
Quick reference (agent_docs/):
44+
- linear_integration.md - Linear sync
45+
- mcp_server.md - MCP tools
46+
- database_storage.md - Storage
47+
- claude_hooks.md - Hooks
48+
49+
Full documentation (docs/):
50+
- principles.md - Agent programming paradigm
51+
- architecture.md - Extension model and browser sandbox
52+
- cost-optimization.md - Prompt/token cost playbook + Max-plan price ramp
53+
- SPEC.md - Technical specification
54+
- API_REFERENCE.md - API docs
55+
- DEVELOPMENT.md - Dev guide
56+
- SETUP.md - Installation
57+
2758
## Commands
2859

2960
```bash
@@ -84,4 +115,161 @@ docker-compose up -d # Start local DBs
84115
- `task_query`
85116
- `recover_on_low_signal: true`
86117
- Do not fetch raw `get_conversation` context for worker execution unless full transcript behavior is explicitly required.
87-
- The current assignment is persisted under `.stackmemory/worker-context/current-assignment.json` so wrappers and hooks can auto-fill or enforce `task_query`.
118+
- The current assignment is persisted under `.stackmemory/worker-context/current-assignment.json` so wrappers and hooks can auto-fill or enforce `task_query`.
119+
120+
## Security
121+
122+
NEVER hardcode secrets - use process.env with dotenv/config
123+
124+
```javascript
125+
import 'dotenv/config';
126+
const API_KEY = process.env.LINEAR_API_KEY;
127+
if (!API_KEY) {
128+
console.error('LINEAR_API_KEY not set');
129+
process.exit(1);
130+
}
131+
```
132+
133+
Environment sources (check in order):
134+
1. .env file
135+
2. .env.local
136+
3. ~/.zshrc
137+
4. Process environment
138+
139+
Secret patterns to block: lin_api_* | lin_oauth_* | sk-* | npm_*
140+
141+
## Deploy
142+
143+
```bash
144+
# npm publish (uses NPM_TOKEN from .env, no OTP needed)
145+
git stash -- scripts/gepa/ # stash GEPA state (dirties working tree)
146+
NPM_TOKEN=$(grep '^NPM_TOKEN=' .env | cut -d= -f2) \
147+
npm publish --registry https://registry.npmjs.org/ \
148+
--//registry.npmjs.org/:_authToken="$NPM_TOKEN"
149+
git stash pop # restore GEPA state
150+
151+
# Railway
152+
railway up
153+
154+
# Pre-publish checks require clean git status — stash GEPA files first
155+
```
156+
157+
## Conductor (Autonomous Agent Orchestration)
158+
159+
The conductor manages autonomous coding agents via Linear issues:
160+
161+
**Data files** (all under `~/.stackmemory/conductor/`):
162+
- `prompt-template.md` — Agent prompt template with `{{VARIABLE}}` substitution (auto-created on first `conductor start`)
163+
- `outcomes.jsonl` — JSONL log of agent outcomes (success/failure, phase, tokens, errors)
164+
- `evolution-log.jsonl` — History of `--evolve` mutations applied to the prompt template
165+
- `agents/<issue-id>/status.json` — Per-agent status files
166+
- `agents/<issue-id>/output.log` — Agent stdout/stderr
167+
- `traces.db` — SQLite database with per-turn conversation traces (tool calls, tokens, phases, content previews)
168+
169+
**Intelligence features**:
170+
- Multi-model routing with difficulty prediction (routes simple tasks to cheaper models)
171+
- Smart retry with exponential backoff and prior context injection
172+
- Auto-PR creation on successful agent completion
173+
- Trace-based evidence: per-turn conversation logging (tools, tokens, phases) to traces.db
174+
175+
**Learning loop**:
176+
1. Agents run → outcomes logged to `outcomes.jsonl`, traces to `traces.db`
177+
2. `conductor learn` analyzes patterns (success rate, failure phases, error types)
178+
3. `conductor learn --evolve` calls Claude to mutate `prompt-template.md` based on failure data
179+
4. Next agent run uses the improved template → repeat
180+
181+
**Template variables**: `{{ISSUE_ID}}`, `{{TITLE}}`, `{{DESCRIPTION}}`, `{{LABELS}}`, `{{PRIORITY}}`, `{{ATTEMPT}}`, `{{PRIOR_CONTEXT}}`
182+
183+
## Task Delegation Model
184+
185+
Route effort by task complexity — not all code changes deserve equal scrutiny:
186+
187+
**AUTOMATE** — Execute immediately, lint+test is sufficient:
188+
- CRUD operations, boilerplate, formatting, simple transforms
189+
- Adding a tool handler following existing switch/case pattern
190+
- Config additions (new env var, feature flag)
191+
192+
**STANDARD** — Normal workflow, lint+test+build:
193+
- Feature implementation, bug fixes, refactoring
194+
- New test coverage, documentation updates
195+
- Integration wiring (adding handler to server.ts dispatch)
196+
197+
**CAREFUL** — Review approach before implementation:
198+
- API/schema changes, database migrations, auth flows
199+
- New integration patterns (MCP tools, webhook handlers)
200+
- Changes to frame-manager, sqlite-adapter, or daemon lifecycle
201+
- Anything touching error handling chains
202+
203+
**ARCHITECT** — Plan mode required, explore existing patterns first:
204+
- New service boundaries, system integrations
205+
- Performance-critical paths (FTS5 queries, search scoring)
206+
- Breaking changes to MCP protocol or CLI interface
207+
208+
**HUMAN** — Explicit user approval before any changes:
209+
- Security-critical decisions, secret handling
210+
- Irreversible operations (data migrations, schema drops)
211+
- Publishing (npm publish, Railway deploy)
212+
213+
Quality gates scale with tier — don't over-engineer AUTOMATE tasks, don't under-review CAREFUL ones.
214+
215+
For AUTOMATE and STANDARD tiers: make only the requested changes. Don't refactor surrounding code, add abstractions for one-time operations, or create helpers that are used once. Three similar lines of code is better than a premature abstraction.
216+
217+
## Cost Optimization
218+
219+
Assume token costs only go up: Max-plan usage ramps from ~80% off to full price over ~3 months (codified in `src/core/models/provider-pricing.ts``MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`). The cheapest token is the one you don't send. Full playbook: `docs/cost-optimization.md`.
220+
221+
Codified defaults (cheapest lever first):
222+
- **Route by complexity.** Keep `multiProvider` on — `getOptimalProvider()`/`scoreComplexity()` send simple tasks to cheap models, hard ones to Anthropic. Opus only for CAREFUL/ARCHITECT; Sonnet default; Haiku/cheap providers for AUTOMATE. Output tokens cost 5× input.
223+
- **Tune `effort` before model.** Default `high` for coding; `medium` for cost-sensitive; `max`/`xhigh` only for correctness-critical. Pair with adaptive thinking.
224+
- **Protect the prompt cache.** Keep the prefix byte-stable — no timestamps/UUIDs/IDs in the system prompt, no mid-session tool/model swaps (full rebuild). Cache reads are ~0.1×.
225+
- **Batch non-interactive work** via `AnthropicBatchClient` (50% off).
226+
- **Cap context** via `ContextBudgetManager`; keep the token-optimization hooks on (dedup/prewarm/script-suggest).
227+
- **Close the loop:** `conductor learn --evolve` (GEPA) + `stackmemory optimize traces` shrink prompts permanently.
228+
229+
Guardrails (never trade for cost): the sensitive-content guard must keep forcing Anthropic for secrets/PII; correctness tiers stay on the capable model; never truncate inputs silently — cap deliberately via the budget manager.
230+
231+
## Session Budget
232+
233+
- Max 1 major topic per session — split unrelated work into separate sessions
234+
- Run /compact or summarize at ~50% context usage to avoid overflow
235+
- Plan-execute sessions (low interaction, high edits) are most efficient
236+
- Avoid exploratory marathons with topic-switching — burns 30-40% extra tokens
237+
238+
## Context Maintenance
239+
240+
**`/update-docs`** — Run weekly or when context feels stale:
241+
- Audits CLAUDE.md, MEMORY.md, agent_docs/ against git history and codebase
242+
- Detects stale entries, missing patterns, outdated paths
243+
- Trigger: start of week, after major refactors, or when sessions feel slow/confused
244+
245+
**`/recover`** — Run when a session goes off the rails:
246+
- Analyzes traces to find where context drifted from intent
247+
- Maps drift to specific doc fixes (missing guidance, stale memory, ambiguous instruction)
248+
- Trigger: user says "this is wrong", "not what I wanted", "off the rails", repeated corrections
249+
250+
**`/next`** — Run at session start or when asking "what's next":
251+
- Scans git log, TODO files, Linear issues, and memory for actionable items
252+
- Prioritizes: unfinished work > flagged issues > queued tasks > continuations
253+
- Trigger: session start, "what's next", "whats next", between tasks
254+
255+
**`/learn`** — Run at session end to capture learnings:
256+
- Reviews session work, then audits memory, CLAUDE.md, skills, scripts, and wiki
257+
- Proposes creates/updates/deletes with confirmation before applying
258+
- Trigger: end of session, after significant work, "what should I update"
259+
260+
**When to use which:**
261+
- Starting a session or between tasks → `/next` (pick what to work on)
262+
- Session producing wrong results → `/recover` (diagnose + fix now)
263+
- Routine maintenance, nothing broken → `/update-docs` (proactive gardening)
264+
- After publishing a new version → `/update-docs` (catch version/path drift)
265+
- After conductor failures → `/recover last` (learn from agent traces)
266+
- End of session → `/learn` (capture what changed, update artifacts)
267+
268+
## Workflow
269+
270+
- Check .env for API keys before asking
271+
- Run npm run linear:sync after task completion
272+
- Use browser MCP for visual testing
273+
- Review recent commits and stackmemory.json on session start
274+
- Use subagents for multi-step tasks
275+
- Ask 1-3 clarifying questions for complex commands (one at a time)

docs/cost-optimization.md

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Cost Optimization
2+
3+
Codified practices for keeping StackMemory's prompt and token spend low — and for
4+
staying ahead of rising costs as the Max-plan discount expires.
5+
6+
> **Planning assumption.** Max-plan usage starts at an **80% discount** and ramps
7+
> **linearly to full price over ~3 months** (≈2026-06-06 → 2026-09-06). Every
8+
> token of agent effort gets ~5× more expensive over that window. This is
9+
> codified once in `src/core/models/provider-pricing.ts`
10+
> (`MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`); routing, budgets, and
11+
> this playbook all read from it. Treat optimizations below as "nice to have"
12+
> today and "load-bearing" by September.
13+
14+
## Guiding principle
15+
16+
**Be model-agnostic; route by task value, not by reflex.** The expensive
17+
failure mode is defaulting every workload to the most capable (most expensive)
18+
model "to be safe" — that burns budget with zero governance. Instead:
19+
20+
- **Match model to task.** Cheap/high-volume models for inference and simple
21+
transforms; a premium model (Opus) only for agent workflows where reliability
22+
pays for itself; the most expensive tier *only* when the incremental capability
23+
demonstrably justifies a multiple-× token premium.
24+
- **Govern, audit, control.** Spend should be measurable per run, attributable
25+
to a task tier, and bounded by a budget — not an untracked aggregate. Cost
26+
controls below are pointless without the trace-level visibility to enforce them.
27+
28+
This is the same axis as the [Task Delegation Model](../CLAUDE.md) tiers
29+
(AUTOMATE → ARCHITECT): route effort — and spend — by complexity and value.
30+
31+
## The cost model
32+
33+
Two distinct meters run:
34+
35+
1. **API spend** (third-party + Anthropic API) — billed per token via
36+
`MODEL_PRICING`. Cost-aware routing already optimizes this.
37+
2. **Max-plan agent effort** — the conductor spawns Claude Code agents on the
38+
Max plan. Today heavily discounted; ramping to full price. Optimize this by
39+
spending *fewer tokens per task* (tighter prompts, less context, fewer turns)
40+
and *fewer tasks on the expensive tier* (route, cache, batch).
41+
42+
Current list prices (per 1M tokens, sourced 2026-05-26):
43+
44+
| Model | Input | Output | Context |
45+
| ---------------- | ----- | ------ | ------- |
46+
| Opus 4.6/4.7/4.8 | $5 | $25 | 1M |
47+
| Sonnet 4.6 | $3 | $15 | 1M |
48+
| Haiku 4.5 | $1 | $5 | 200K |
49+
50+
Output tokens cost **** input. Cache reads cost **~0.1×** input; batch is
51+
**50% off**. The cheapest token is the one you don't send.
52+
53+
## Where spend happens (inventory)
54+
55+
| Surface | File(s) | Dominant cost | Lever |
56+
| --- | --- | --- | --- |
57+
| Conductor agent runs | `~/.stackmemory/conductor/prompt-template.md` | Output tokens, turn count | Prompt diet, GEPA, effort |
58+
| Context rehydration | `src/core/context/`, `src/core/digest/` | Input tokens | Budget caps, compression |
59+
| Ralph swarm iterations | `src/integrations/ralph/context/context-budget-manager.ts` | Input tokens/iteration | `maxTokens`, compression |
60+
| LLM retrieval | `src/core/retrieval/llm-*.ts` | Input + output | Cheap-model routing |
61+
| Hook overhead | `src/hooks/` (dedup, prewarm, script-suggest) | Duplicate reads, bad tool choice | Already enabled — keep on |
62+
| MCP tool surface | `src/integrations/mcp/tool-definitions.ts` | Input tokens (schemas in context) | Tool search / trim |
63+
64+
## Codified practices (ranked by impact)
65+
66+
### 1. Route by complexity — don't pay Opus for lint fixes
67+
`getOptimalProvider()` / `scoreComplexity()` already route low-complexity tasks
68+
to cheap providers and high-complexity to Anthropic, gated by the `multiProvider`
69+
feature flag. **Keep `multiProvider` enabled.** The sensitive-content guard
70+
(`detectSensitiveContent`) forces Anthropic for secrets/PII — never weaken it to
71+
save money.
72+
73+
- Default the conductor's *simple* tiers (AUTOMATE/STANDARD) to Sonnet, reserve
74+
Opus for CAREFUL/ARCHITECT. Opus↔Sonnet is a 1.7× swing; Opus↔Haiku is 5×.
75+
- Use **subagents on a cheaper model** for fan-out (Explore/grep/read) rather
76+
than switching the main loop's model — switching mid-session breaks the prompt
77+
cache (see #4).
78+
79+
### 2. Tune `effort`, not model, first
80+
On Opus 4.6+/Sonnet 4.6, `output_config: {effort: ...}` is the cheapest quality
81+
dial. `low`/`medium` mean fewer, more-consolidated tool calls and less preamble.
82+
Default to **`high`** for coding, drop to `medium` for cost-sensitive routes, and
83+
reserve `max`/`xhigh` for correctness-critical work. Pair with adaptive thinking
84+
(`thinking: {type: "adaptive"}`) so the model self-limits reasoning.
85+
86+
### 3. Spend fewer output tokens (5× input)
87+
- **Lower-effort, terser agents.** Add a silence-default to the conductor
88+
template: no narration between tool calls, one-or-two-sentence wrap-ups.
89+
- **Don't lowball `max_tokens`** — truncation forces a full re-run. Set a real
90+
ceiling, then let `effort`/`task_budget` moderate actual usage.
91+
- Use **Task Budgets** (`task_budget`, beta) for long agentic loops so the model
92+
sees a countdown and wraps up gracefully instead of being hard-truncated.
93+
94+
### 4. Prompt caching — keep the prefix frozen
95+
Cache reads are ~0.1× input. The entire win depends on a **byte-stable prefix**
96+
(`tools``system``messages`):
97+
- No `Date.now()`, UUIDs, or per-session IDs in the system prompt — inject
98+
volatile context later in `messages`.
99+
- Don't reorder/add tools or switch models mid-session (full cache rebuild).
100+
- Verify with `usage.cache_read_input_tokens`; zero across repeats = a silent
101+
invalidator. See the audit table in the `claude-api` skill (`prompt-caching`).
102+
- Pre-warm only when first-request latency is user-visible and traffic is bursty.
103+
104+
### 5. Cap context aggressively
105+
`ContextBudgetManager` (Ralph) already truncates, compresses, and
106+
priority-weights context with a `DEFAULT_MAX_TOKENS` budget. As prices ramp:
107+
- Lower per-iteration `maxTokens` budgets; keep `compressionEnabled` on.
108+
- Prefer digests/summaries over raw frame dumps for rehydration.
109+
- Trim the MCP tool surface or adopt tool-search so 56 tool schemas aren't all
110+
resident in context.
111+
112+
### 6. Batch the non-interactive work
113+
`AnthropicBatchClient` runs at **50% off**. Anything not latency-sensitive —
114+
backfills, bulk enrichment, digest regeneration, eval sweeps — belongs in a
115+
batch, not a live request.
116+
117+
### 7. Let the hooks do their job
118+
The token-optimization hooks (#14) already save ~22% (324K tokens on the
119+
benchmark): `dedup-reads` (escalates to `[STOP]` at 5+ duplicate reads),
120+
`desire-path-hook` (auto-routes Bash→Glob/Read/Grep), `prewarm-tools`,
121+
`script-suggest`. Don't disable them; extend them when new waste patterns show up
122+
in `scripts/benchmark-hooks.ts`.
123+
124+
### 8. Close the learning loop
125+
`conductor learn --evolve` (GEPA) mutates the prompt template from failure data,
126+
and `stackmemory optimize traces` surfaces repeated, wasteful patterns from
127+
`traces.db`. Run these regularly — a shorter, higher-success prompt is a
128+
permanent per-run discount that compounds as prices rise.
129+
130+
## 3-month phased playbook
131+
132+
The ramp is roughly: **month 0** ≈ 20% of list, **month 1.5** ≈ 60%, **month 3+**
133+
= full price. Escalate effort to match.
134+
135+
**Phase 1 — now (≈80% off): instrument & default-good.**
136+
- Confirm `multiProvider` on; verify cost-aware routing decisions in traces.
137+
- Land the terser conductor template + `effort` defaults.
138+
- Add cost-per-run to trace stats so the ramp is visible. Establish a baseline
139+
tokens/task number to measure against.
140+
141+
**Phase 2 — ~month 1–2 (≈40–70%): squeeze.**
142+
- Tighten `ContextBudgetManager` budgets; expand prompt-caching coverage and
143+
verify hit rates.
144+
- Move all non-interactive workloads to the Batches API.
145+
- Run a GEPA pass; adopt the winning template.
146+
147+
**Phase 3 — ~month 3 (full price): enforce.**
148+
- Treat budgets as hard limits, not hints. Alert when a run exceeds its
149+
`effectiveCost` budget.
150+
- Down-tier aggressively: Opus only for ARCHITECT/CAREFUL; Sonnet default;
151+
Haiku/cheap providers for AUTOMATE.
152+
- Re-baseline `count_tokens` against current models (token counting shifts
153+
between model versions — don't apply a blanket multiplier).
154+
155+
## Guardrails (don't optimize these away)
156+
157+
- **Security routing** — the sensitive-content guard must keep forcing Anthropic
158+
for secrets/PII regardless of cost.
159+
- **Correctness tiers** — CAREFUL/ARCHITECT work stays on the capable model; a
160+
cheap wrong answer that needs a re-run costs more than one right answer.
161+
- **No silent truncation** — cap context deliberately via the budget manager;
162+
never truncate inputs blindly.
163+
164+
## Quick reference
165+
166+
```bash
167+
stackmemory conductor learn --evolve # mutate prompt template from failures
168+
stackmemory optimize traces # find repeated wasteful patterns
169+
node scripts/benchmark-hooks.ts # measure hook token savings
170+
stackmemory conductor trace-stats # aggregate token usage
171+
```
172+
173+
```ts
174+
import {
175+
effectiveSpendMultiplier,
176+
effectiveCost,
177+
} from './core/models/provider-pricing.js';
178+
179+
effectiveSpendMultiplier(); // today's cost factor along the ramp (0.2 → 1.0)
180+
effectiveCost('anthropic', 'claude-opus-4-8', inTok, outTok); // ramp-adjusted cost
181+
```

0 commit comments

Comments
 (0)