stackmemoryai
diff --git a/‎CLAUDE.md‎
Lines changed: 189 additions & 1 deletion b/‎CLAUDE.md‎
Lines changed: 189 additions & 1 deletion
diff --git a/‎docs/cost-optimization.md‎
Lines changed: 181 additions & 0 deletions b/‎docs/cost-optimization.md‎
Lines changed: 181 additions & 0 deletions
@@ -24,6 +24,37 @@ docker/         # Container configs
 prompts/        # Externalized LLM prompt templates
 ```
 
+## Key Files
+
+- Entry: src/cli/index.ts
+- MCP Server: src/integrations/mcp/server.ts
+- Frame Manager: src/core/context/frame-manager.ts
+- Database: src/core/database/sqlite-adapter.ts
+- Snapshot: src/core/worktree/capture.ts
+- Preflight: src/core/worktree/preflight.ts
+- Conductor: src/cli/commands/orchestrator.ts (core) + orchestrate.ts (CLI)
+- Conductor Traces: src/cli/commands/conductor-traces.ts
+- Frame Enrichment: src/core/context/frame-enrichment.ts
+- Process Utils: src/utils/process-cleanup.ts
+- Shared Utils: src/core/utils/{git,text,fs}.ts
+
+## Detailed Guides
+
+Quick reference (agent_docs/):
+- linear_integration.md - Linear sync
+- mcp_server.md - MCP tools
+- database_storage.md - Storage
+- claude_hooks.md - Hooks
+
+Full documentation (docs/):
+- principles.md - Agent programming paradigm
+- architecture.md - Extension model and browser sandbox
+- cost-optimization.md - Prompt/token cost playbook + Max-plan price ramp
+- SPEC.md - Technical specification
+- API_REFERENCE.md - API docs
+- DEVELOPMENT.md - Dev guide
+- SETUP.md - Installation
+
 ## Commands
 
 ```bash
@@ -84,4 +115,161 @@ docker-compose up -d   # Start local DBs
   - `task_query`
   - `recover_on_low_signal: true`
 - Do not fetch raw `get_conversation` context for worker execution unless full transcript behavior is explicitly required.
-- The current assignment is persisted under `.stackmemory/worker-context/current-assignment.json` so wrappers and hooks can auto-fill or enforce `task_query`.
+- The current assignment is persisted under `.stackmemory/worker-context/current-assignment.json` so wrappers and hooks can auto-fill or enforce `task_query`.
+
+## Security
+
+NEVER hardcode secrets - use process.env with dotenv/config
+
+```javascript
+import 'dotenv/config';
+const API_KEY = process.env.LINEAR_API_KEY;
+if (!API_KEY) {
+  console.error('LINEAR_API_KEY not set');
+  process.exit(1);
+}
+```
+
+Environment sources (check in order):
+1. .env file
+2. .env.local
+3. ~/.zshrc
+4. Process environment
+
+Secret patterns to block: lin_api_* | lin_oauth_* | sk-* | npm_*
+
+## Deploy
+
+```bash
+# npm publish (uses NPM_TOKEN from .env, no OTP needed)
+git stash -- scripts/gepa/           # stash GEPA state (dirties working tree)
+NPM_TOKEN=$(grep '^NPM_TOKEN=' .env | cut -d= -f2) \
+  npm publish --registry https://registry.npmjs.org/ \
+  --//registry.npmjs.org/:_authToken="$NPM_TOKEN"
+git stash pop                         # restore GEPA state
+
+# Railway
+railway up
+
+# Pre-publish checks require clean git status — stash GEPA files first
+```
+
+## Conductor (Autonomous Agent Orchestration)
+
+The conductor manages autonomous coding agents via Linear issues:
+
+**Data files** (all under `~/.stackmemory/conductor/`):
+- `prompt-template.md` — Agent prompt template with `{{VARIABLE}}` substitution (auto-created on first `conductor start`)
+- `outcomes.jsonl` — JSONL log of agent outcomes (success/failure, phase, tokens, errors)
+- `evolution-log.jsonl` — History of `--evolve` mutations applied to the prompt template
+- `agents/<issue-id>/status.json` — Per-agent status files
+- `agents/<issue-id>/output.log` — Agent stdout/stderr
+- `traces.db` — SQLite database with per-turn conversation traces (tool calls, tokens, phases, content previews)
+
+**Intelligence features**:
+- Multi-model routing with difficulty prediction (routes simple tasks to cheaper models)
+- Smart retry with exponential backoff and prior context injection
+- Auto-PR creation on successful agent completion
+- Trace-based evidence: per-turn conversation logging (tools, tokens, phases) to traces.db
+
+**Learning loop**:
+1. Agents run → outcomes logged to `outcomes.jsonl`, traces to `traces.db`
+2. `conductor learn` analyzes patterns (success rate, failure phases, error types)
+3. `conductor learn --evolve` calls Claude to mutate `prompt-template.md` based on failure data
+4. Next agent run uses the improved template → repeat
+
+**Template variables**: `{{ISSUE_ID}}`, `{{TITLE}}`, `{{DESCRIPTION}}`, `{{LABELS}}`, `{{PRIORITY}}`, `{{ATTEMPT}}`, `{{PRIOR_CONTEXT}}`
+
+## Task Delegation Model
+
+Route effort by task complexity — not all code changes deserve equal scrutiny:
+
+**AUTOMATE** — Execute immediately, lint+test is sufficient:
+- CRUD operations, boilerplate, formatting, simple transforms
+- Adding a tool handler following existing switch/case pattern
+- Config additions (new env var, feature flag)
+
+**STANDARD** — Normal workflow, lint+test+build:
+- Feature implementation, bug fixes, refactoring
+- New test coverage, documentation updates
+- Integration wiring (adding handler to server.ts dispatch)
+
+**CAREFUL** — Review approach before implementation:
+- API/schema changes, database migrations, auth flows
+- New integration patterns (MCP tools, webhook handlers)
+- Changes to frame-manager, sqlite-adapter, or daemon lifecycle
+- Anything touching error handling chains
+
+**ARCHITECT** — Plan mode required, explore existing patterns first:
+- New service boundaries, system integrations
+- Performance-critical paths (FTS5 queries, search scoring)
+- Breaking changes to MCP protocol or CLI interface
+
+**HUMAN** — Explicit user approval before any changes:
+- Security-critical decisions, secret handling
+- Irreversible operations (data migrations, schema drops)
+- Publishing (npm publish, Railway deploy)
+
+Quality gates scale with tier — don't over-engineer AUTOMATE tasks, don't under-review CAREFUL ones.
+
+For AUTOMATE and STANDARD tiers: make only the requested changes. Don't refactor surrounding code, add abstractions for one-time operations, or create helpers that are used once. Three similar lines of code is better than a premature abstraction.
+
+## Cost Optimization
+
+Assume token costs only go up: Max-plan usage ramps from ~80% off to full price over ~3 months (codified in `src/core/models/provider-pricing.ts` → `MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`). The cheapest token is the one you don't send. Full playbook: `docs/cost-optimization.md`.
+
+Codified defaults (cheapest lever first):
+- **Route by complexity.** Keep `multiProvider` on — `getOptimalProvider()`/`scoreComplexity()` send simple tasks to cheap models, hard ones to Anthropic. Opus only for CAREFUL/ARCHITECT; Sonnet default; Haiku/cheap providers for AUTOMATE. Output tokens cost 5× input.
+- **Tune `effort` before model.** Default `high` for coding; `medium` for cost-sensitive; `max`/`xhigh` only for correctness-critical. Pair with adaptive thinking.
+- **Protect the prompt cache.** Keep the prefix byte-stable — no timestamps/UUIDs/IDs in the system prompt, no mid-session tool/model swaps (full rebuild). Cache reads are ~0.1×.
+- **Batch non-interactive work** via `AnthropicBatchClient` (50% off).
+- **Cap context** via `ContextBudgetManager`; keep the token-optimization hooks on (dedup/prewarm/script-suggest).
+- **Close the loop:** `conductor learn --evolve` (GEPA) + `stackmemory optimize traces` shrink prompts permanently.
+
+Guardrails (never trade for cost): the sensitive-content guard must keep forcing Anthropic for secrets/PII; correctness tiers stay on the capable model; never truncate inputs silently — cap deliberately via the budget manager.
+
+## Session Budget
+
+- Max 1 major topic per session — split unrelated work into separate sessions
+- Run /compact or summarize at ~50% context usage to avoid overflow
+- Plan-execute sessions (low interaction, high edits) are most efficient
+- Avoid exploratory marathons with topic-switching — burns 30-40% extra tokens
+
+## Context Maintenance
+
+**`/update-docs`** — Run weekly or when context feels stale:
+- Audits CLAUDE.md, MEMORY.md, agent_docs/ against git history and codebase
+- Detects stale entries, missing patterns, outdated paths
+- Trigger: start of week, after major refactors, or when sessions feel slow/confused
+
+**`/recover`** — Run when a session goes off the rails:
+- Analyzes traces to find where context drifted from intent
+- Maps drift to specific doc fixes (missing guidance, stale memory, ambiguous instruction)
+- Trigger: user says "this is wrong", "not what I wanted", "off the rails", repeated corrections
+
+**`/next`** — Run at session start or when asking "what's next":
+- Scans git log, TODO files, Linear issues, and memory for actionable items
+- Prioritizes: unfinished work > flagged issues > queued tasks > continuations
+- Trigger: session start, "what's next", "whats next", between tasks
+
+**`/learn`** — Run at session end to capture learnings:
+- Reviews session work, then audits memory, CLAUDE.md, skills, scripts, and wiki
+- Proposes creates/updates/deletes with confirmation before applying
+- Trigger: end of session, after significant work, "what should I update"
+
+**When to use which:**
+- Starting a session or between tasks → `/next` (pick what to work on)
+- Session producing wrong results → `/recover` (diagnose + fix now)
+- Routine maintenance, nothing broken → `/update-docs` (proactive gardening)
+- After publishing a new version → `/update-docs` (catch version/path drift)
+- After conductor failures → `/recover last` (learn from agent traces)
+- End of session → `/learn` (capture what changed, update artifacts)
+
+## Workflow
+
+- Check .env for API keys before asking
+- Run npm run linear:sync after task completion
+- Use browser MCP for visual testing
+- Review recent commits and stackmemory.json on session start
+- Use subagents for multi-step tasks
+- Ask 1-3 clarifying questions for complex commands (one at a time)
@@ -0,0 +1,181 @@
+# Cost Optimization
+
+Codified practices for keeping StackMemory's prompt and token spend low — and for
+staying ahead of rising costs as the Max-plan discount expires.
+
+> **Planning assumption.** Max-plan usage starts at an **80% discount** and ramps
+> **linearly to full price over ~3 months** (≈2026-06-06 → 2026-09-06). Every
+> token of agent effort gets ~5× more expensive over that window. This is
+> codified once in `src/core/models/provider-pricing.ts`
+> (`MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`); routing, budgets, and
+> this playbook all read from it. Treat optimizations below as "nice to have"
+> today and "load-bearing" by September.
+
+## Guiding principle
+
+**Be model-agnostic; route by task value, not by reflex.** The expensive
+failure mode is defaulting every workload to the most capable (most expensive)
+model "to be safe" — that burns budget with zero governance. Instead:
+
+- **Match model to task.** Cheap/high-volume models for inference and simple
+  transforms; a premium model (Opus) only for agent workflows where reliability
+  pays for itself; the most expensive tier *only* when the incremental capability
+  demonstrably justifies a multiple-× token premium.
+- **Govern, audit, control.** Spend should be measurable per run, attributable
+  to a task tier, and bounded by a budget — not an untracked aggregate. Cost
+  controls below are pointless without the trace-level visibility to enforce them.
+
+This is the same axis as the [Task Delegation Model](../CLAUDE.md) tiers
+(AUTOMATE → ARCHITECT): route effort — and spend — by complexity and value.
+
+## The cost model
+
+Two distinct meters run:
+
+1. **API spend** (third-party + Anthropic API) — billed per token via
+   `MODEL_PRICING`. Cost-aware routing already optimizes this.
+2. **Max-plan agent effort** — the conductor spawns Claude Code agents on the
+   Max plan. Today heavily discounted; ramping to full price. Optimize this by
+   spending *fewer tokens per task* (tighter prompts, less context, fewer turns)
+   and *fewer tasks on the expensive tier* (route, cache, batch).
+
+Current list prices (per 1M tokens, sourced 2026-05-26):
+
+| Model            | Input | Output | Context |
+| ---------------- | ----- | ------ | ------- |
+| Opus 4.6/4.7/4.8 | $5    | $25    | 1M      |
+| Sonnet 4.6       | $3    | $15    | 1M      |
+| Haiku 4.5        | $1    | $5     | 200K    |
+
+Output tokens cost **5×** input. Cache reads cost **~0.1×** input; batch is
+**50% off**. The cheapest token is the one you don't send.
+
+## Where spend happens (inventory)
+
+| Surface | File(s) | Dominant cost | Lever |
+| --- | --- | --- | --- |
+| Conductor agent runs | `~/.stackmemory/conductor/prompt-template.md` | Output tokens, turn count | Prompt diet, GEPA, effort |
+| Context rehydration | `src/core/context/`, `src/core/digest/` | Input tokens | Budget caps, compression |
+| Ralph swarm iterations | `src/integrations/ralph/context/context-budget-manager.ts` | Input tokens/iteration | `maxTokens`, compression |
+| LLM retrieval | `src/core/retrieval/llm-*.ts` | Input + output | Cheap-model routing |
+| Hook overhead | `src/hooks/` (dedup, prewarm, script-suggest) | Duplicate reads, bad tool choice | Already enabled — keep on |
+| MCP tool surface | `src/integrations/mcp/tool-definitions.ts` | Input tokens (schemas in context) | Tool search / trim |
+
+## Codified practices (ranked by impact)
+
+### 1. Route by complexity — don't pay Opus for lint fixes
+`getOptimalProvider()` / `scoreComplexity()` already route low-complexity tasks
+to cheap providers and high-complexity to Anthropic, gated by the `multiProvider`
+feature flag. **Keep `multiProvider` enabled.** The sensitive-content guard
+(`detectSensitiveContent`) forces Anthropic for secrets/PII — never weaken it to
+save money.
+
+- Default the conductor's *simple* tiers (AUTOMATE/STANDARD) to Sonnet, reserve
+  Opus for CAREFUL/ARCHITECT. Opus↔Sonnet is a 1.7× swing; Opus↔Haiku is 5×.
+- Use **subagents on a cheaper model** for fan-out (Explore/grep/read) rather
+  than switching the main loop's model — switching mid-session breaks the prompt
+  cache (see #4).
+
+### 2. Tune `effort`, not model, first
+On Opus 4.6+/Sonnet 4.6, `output_config: {effort: ...}` is the cheapest quality
+dial. `low`/`medium` mean fewer, more-consolidated tool calls and less preamble.
+Default to **`high`** for coding, drop to `medium` for cost-sensitive routes, and
+reserve `max`/`xhigh` for correctness-critical work. Pair with adaptive thinking
+(`thinking: {type: "adaptive"}`) so the model self-limits reasoning.
+
+### 3. Spend fewer output tokens (5× input)
+- **Lower-effort, terser agents.** Add a silence-default to the conductor
+  template: no narration between tool calls, one-or-two-sentence wrap-ups.
+- **Don't lowball `max_tokens`** — truncation forces a full re-run. Set a real
+  ceiling, then let `effort`/`task_budget` moderate actual usage.
+- Use **Task Budgets** (`task_budget`, beta) for long agentic loops so the model
+  sees a countdown and wraps up gracefully instead of being hard-truncated.
+
+### 4. Prompt caching — keep the prefix frozen
+Cache reads are ~0.1× input. The entire win depends on a **byte-stable prefix**
+(`tools` → `system` → `messages`):
+- No `Date.now()`, UUIDs, or per-session IDs in the system prompt — inject
+  volatile context later in `messages`.
+- Don't reorder/add tools or switch models mid-session (full cache rebuild).
+- Verify with `usage.cache_read_input_tokens`; zero across repeats = a silent
+  invalidator. See the audit table in the `claude-api` skill (`prompt-caching`).
+- Pre-warm only when first-request latency is user-visible and traffic is bursty.
+
+### 5. Cap context aggressively
+`ContextBudgetManager` (Ralph) already truncates, compresses, and
+priority-weights context with a `DEFAULT_MAX_TOKENS` budget. As prices ramp:
+- Lower per-iteration `maxTokens` budgets; keep `compressionEnabled` on.
+- Prefer digests/summaries over raw frame dumps for rehydration.
+- Trim the MCP tool surface or adopt tool-search so 56 tool schemas aren't all
+  resident in context.
+
+### 6. Batch the non-interactive work
+`AnthropicBatchClient` runs at **50% off**. Anything not latency-sensitive —
+backfills, bulk enrichment, digest regeneration, eval sweeps — belongs in a
+batch, not a live request.
+
+### 7. Let the hooks do their job
+The token-optimization hooks (#14) already save ~22% (324K tokens on the
+benchmark): `dedup-reads` (escalates to `[STOP]` at 5+ duplicate reads),
+`desire-path-hook` (auto-routes Bash→Glob/Read/Grep), `prewarm-tools`,
+`script-suggest`. Don't disable them; extend them when new waste patterns show up
+in `scripts/benchmark-hooks.ts`.
+
+### 8. Close the learning loop
+`conductor learn --evolve` (GEPA) mutates the prompt template from failure data,
+and `stackmemory optimize traces` surfaces repeated, wasteful patterns from
+`traces.db`. Run these regularly — a shorter, higher-success prompt is a
+permanent per-run discount that compounds as prices rise.
+
+## 3-month phased playbook
+
+The ramp is roughly: **month 0** ≈ 20% of list, **month 1.5** ≈ 60%, **month 3+**
+= full price. Escalate effort to match.
+
+**Phase 1 — now (≈80% off): instrument & default-good.**
+- Confirm `multiProvider` on; verify cost-aware routing decisions in traces.
+- Land the terser conductor template + `effort` defaults.
+- Add cost-per-run to trace stats so the ramp is visible. Establish a baseline
+  tokens/task number to measure against.
+
+**Phase 2 — ~month 1–2 (≈40–70%): squeeze.**
+- Tighten `ContextBudgetManager` budgets; expand prompt-caching coverage and
+  verify hit rates.
+- Move all non-interactive workloads to the Batches API.
+- Run a GEPA pass; adopt the winning template.
+
+**Phase 3 — ~month 3 (full price): enforce.**
+- Treat budgets as hard limits, not hints. Alert when a run exceeds its
+  `effectiveCost` budget.
+- Down-tier aggressively: Opus only for ARCHITECT/CAREFUL; Sonnet default;
+  Haiku/cheap providers for AUTOMATE.
+- Re-baseline `count_tokens` against current models (token counting shifts
+  between model versions — don't apply a blanket multiplier).
+
+## Guardrails (don't optimize these away)
+
+- **Security routing** — the sensitive-content guard must keep forcing Anthropic
+  for secrets/PII regardless of cost.
+- **Correctness tiers** — CAREFUL/ARCHITECT work stays on the capable model; a
+  cheap wrong answer that needs a re-run costs more than one right answer.
+- **No silent truncation** — cap context deliberately via the budget manager;
+  never truncate inputs blindly.
+
+## Quick reference
+
+```bash
+stackmemory conductor learn --evolve     # mutate prompt template from failures
+stackmemory optimize traces              # find repeated wasteful patterns
+node scripts/benchmark-hooks.ts          # measure hook token savings
+stackmemory conductor trace-stats        # aggregate token usage
+```
+
+```ts
+import {
+  effectiveSpendMultiplier,
+  effectiveCost,
+} from './core/models/provider-pricing.js';
+
+effectiveSpendMultiplier();              // today's cost factor along the ramp (0.2 → 1.0)
+effectiveCost('anthropic', 'claude-opus-4-8', inTok, outTok); // ramp-adjusted cost
+```