Point at any LLM agent codebase. Harness Evolver will autonomously improve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.
/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver
npx harness-evolver@latestWorks with Claude Code, Cursor, Codex, and Windsurf.
cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude
/harness:setup # explores project, configures LangSmith
/harness:health # check dataset quality (auto-corrects issues)
/harness:evolve # runs the optimization loop
/harness:status # check progress (rich ASCII chart)
/harness:deploy # tag, push, finalizeTested on a RAG agent (Agno framework, Gemini 3.1 Flash Lite, light mode):
xychart-beta
title "agno-deepknowledge: 0.575 → 1.000 (+74%)"
x-axis ["base", "v001", "v002", "v003", "v004", "v005", "v006", "v007"]
y-axis "Correctness" 0 --> 1
line [0.575, 0.575, 0.950, 0.950, 0.950, 0.950, 0.950, 1.0]
bar [0.575, 0.333, 0.950, 0.720, 0.875, 0.680, 0.880, 1.0]
| Iter | Score | Merged? | What the proposer did |
|---|---|---|---|
| baseline | 0.575 | — | Original agent — hallucinations, broken tool calls, no retry logic |
| v001 | 0.333 | Yes | Anti-hallucination prompt (100% correct when API responded, but 60% hit rate limits) |
| v002 | 0.950 | Yes | Breakthrough: inlined 17-line KB into prompt, eliminated vector search entirely. 5.7x faster, zero rate limits |
| v003 | 0.720 | No | Attempted hybrid retrieval — regressed, rejected by constraint gate |
| v004 | 0.875 | No | Response completeness fix — improved one case but regressed others |
| v005 | 0.680 | No | Reduced tool calls — broke edge cases, rejected |
| v006 | 0.880 | Yes | Evolution memory insight: combined v001's anti-hallucination with one-shot example from archive |
| v007 | 1.000 | Yes | One-shot example injection + rubric-aligned responses — perfect on held-out |
The line shows best score (only goes up — regressions aren't merged). The bars show each candidate's raw score. 4 merged, 3 rejected by gate checks. Not every iteration improves — that's the point.
| LangSmith-Native | No custom scripts. Uses LangSmith Datasets, Experiments, and LLM-as-judge. Everything visible in the LangSmith UI. |
| Real Code Evolution | Proposers modify actual code in isolated git worktrees. Winners merge automatically. |
| Self-Organizing Proposers | Two-wave spawning, dynamic lenses from failure data, archive branching from losing candidates. Self-abstention when redundant. |
| Rubric-Based Evaluation | LLM-as-judge with justification-before-score, rubrics, few-shot calibration, pairwise comparison. |
| Smart Gating | Constraint gates, efficiency gate (cost/latency pre-merge), regression guards, Pareto selection, holdout enforcement, rate-limit early abort, stagnation detection. |
/harness:evolve
|
+- 1. Preflight (validate state + dataset health + baseline scoring)
+- 2. Analyze (trace insights + failure clusters + strategy synthesis)
+- 3. Propose (spawn N proposers in git worktrees, two-wave)
+- 4. Evaluate (canary → run target → auto-spawn LLM-as-judge → rate-limit abort)
+- 5. Select (held-out comparison → Pareto front → efficiency gate → constraint gate → merge)
+- 6. Learn (archive candidates + regression guards + evolution memory)
+- 7. Gate (plateau → target check → critic/architect → continue or stop)
Detailed loop with all sub-steps
| Agent | Role |
|---|---|
| Proposer | Self-organizing — investigates a data-driven lens, decides own approach, may abstain |
| Evaluator | LLM-as-judge — rubric-aware scoring via langsmith-cli, few-shot calibration |
| Architect | ULTRAPLAN mode — deep topology analysis with Opus model |
| Critic | Active — detects evaluator gaming, implements stricter evaluators |
| Consolidator | Cross-iteration memory — anchored summarization, garbage collection |
| TestGen | Generates test inputs with rubrics + adversarial injection |
- LangSmith account +
LANGSMITH_API_KEY - Python 3.10+ · Git · Claude Code (or Cursor/Codex/Windsurf)
Dependencies installed automatically by the plugin hook or npx installer.
LangSmith traces any AI framework: LangChain/LangGraph (auto), OpenAI/Anthropic SDK (wrap_*, 2 lines), CrewAI/AutoGen (OpenTelemetry), any Python (@traceable).
For full observability into what each proposer does during evolution (every file read, edit, and commit), install the LangSmith tracing plugin:
/plugin marketplace add langchain-ai/langsmith-claude-code-plugins
/plugin install langsmith-tracing@langsmith-claude-code-plugins
With both plugins installed, the evolution loop traces to LangSmith as a hierarchy: iteration → proposers → tool calls.
- Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
- Self-Organizing LLM Agents Outperform Designed Structures — Dochkina, 2026
- Hermes Agent Self-Evolution — NousResearch
- Agent Skills for Context Engineering — Koylan
- A-Evolve: Automated Agent Evolution — Amazon (5-stage evolution loop, git-tagged mutations)
- Meta Context Engineering via Agentic Skill Evolution — Ye et al., Peking University, 2026
- EvoAgentX: Evolving Agentic Workflows — Wang et al., 2026
- Darwin Godel Machine — Sakana AI
- AlphaEvolve — DeepMind
- LangSmith Evaluation — LangChain
- Harnessing Claude's Intelligence — Martin, Anthropic, 2026
- Traces Start the Agent Improvement Loop — LangChain
MIT
