Prompt evolution grounded in real code execution — not just keyword matching.
Overview • Why Grounded? • Architecture • Quick Start • Results • Project Structure
Last Evolution Cycle: 2026-05-28T21:29:06+00:00 UTC
Generations: 203
Best Score: 39.0
Population Size: 50
Benchmarks: 7
Test Quality: 4–5 real assertions per cycle
Grounded Evolution is an autonomous prompt optimization platform built on a simple premise:
A prompt is only as good as the code it actually produces.
Most prompt optimizers score prompts by keyword matching — checking whether certain words appear in the prompt text. This measures what the prompt says, not what it generates.
Grounded Evolution does both — and the execution-grounded signal is what makes this fundamentally different.
┌────────────────────────────────────────────────────────────┐
│ GROUNDED EVOLUTION │
│ │
│ ┌──────────────────────┐ ┌──────────────────────────┐ │
│ │ LOOP 1: Lexical │ │ LOOP 2: Execution │ │
│ │ │ │ │ │
│ │ evaluate.py │ │ generator.py │ │
│ │ 400+ keyword checks │ │ LLM → actual project │ │
│ │ Scores prompt TEXT │ │ │ │
│ │ │ │ runtime_evaluator.py │ │
│ │ mutate.py │ │ AST + pytest + flake8 │ │
│ │ 5 genetic strategies│ │ Scores GENERATED CODE │ │
│ │ │ │ │ │
│ │ auto_evolve.py │ │ infinite_research_loop │ │
│ │ Meta-evolution │ │ Continuous evolution │ │
│ └──────────────────────┘ └──────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
The original autoresearch-ai-agent-skeleton system (and most prompt optimizers) works like this:
# evaluate.py — checks if prompt TEXT contains keywords
if "kubernetes" in prompt_text:
score += 2
if "pytest" in prompt_text:
score += 2This measures signal coverage — does the prompt mention the right things? But it cannot answer:
- Does the prompt actually produce working code?
- Does the generated project compile?
- Do the tests pass?
- Is the code well-structured?
The grounded loop actually generates code from the prompt, then validates it:
Prompt text
│
▼
generator.py ────► LLM (Mistral/OpenAI) ────► generated_project/
│ │
│ ┌─────────┴──────────┐
│ │ runtime_evaluator │
│ │ ├── AST parse │
│ │ ├── pytest │
│ │ ├── flake8 │
│ │ └── structure │
│ └─────────┬──────────┘
│ │
└──────────── Execution score feeds back ─────────┘
| Capability | Lexical-Only (original) | Grounded (this repo) |
|---|---|---|
| Evaluates prompt text | ✅ 400+ keyword signals | ✅ Same 400+ signals |
| Generates code from prompt | ❌ | ✅ LLM generates real project |
| Compiles generated code | ❌ | ✅ AST syntax validation |
| Runs generated tests | ❌ | ✅ pytest execution |
| Lints generated code | ❌ | ✅ flake8 checks |
| Scores execution quality | ❌ | ✅ 10 structural metrics |
| Continuous evolution | ❌ (finite generations) | ✅ Infinite research loop |
| Auto-commits improvements | ❌ | ✅ git auto-commit |
| Meta-evolution | ✅ injects new signals | ✅ injects new signals |
| Reflection/analysis | ✅ reflect.py | ✅ reflect.py |
| LLM Provider | N/A | Mistral by default (configurable) |
Total Fitness = Lexical Score + Execution Score
↑ ↑
What the prompt What the generated
mentions code actually does
The lexical score (0–1000) estimates what a prompt might produce. The execution score (0–100) validates what it actually produces. Together, they prevent the system from optimizing for keyword coverage at the expense of real code quality.
graph TB
subgraph Lexical["📝 Lexical Loop (evaluate.py)"]
L1["Population<br/>(150 prompts)"] --> L2["mutate.py<br/>5 genetic strategies"]
L2 --> L3["evaluate.py<br/>400+ keyword signals"]
L3 --> L4["reflect.py<br/>rank + persist"]
L4 --> L1
L4 -->|"meta-evolve"| L5["auto_evolve.py<br/>inject new signals"]
L5 --> L3
end
subgraph Grounded["⚡ Execution Grounding Loop"]
G1["select_best(population)"] --> G2["mutation_engine.py<br/>mutate or crossover"]
G2 --> G3["generator.py<br/>LLM → project files"]
G3 --> G4["runtime_evaluator.py<br/>AST + pytest + flake8"]
G4 --> G5["Score execution<br/>& update population"]
G5 --> G1
end
Lexical -.->|champion prompt| Grounded
Grounded -.->|execution score| Lexical
graph LR
P["🧬 150 prompts"] --> M["✂️ mutate.py"]
M --> E["📊 evaluate.py"]
E --> R["📝 reflect.py"]
R --> S["🏆 Select best"]
S --> P
S -->|meta| A["auto_evolve.py"]
A --> E
graph LR
PROMPT["Best prompt"] --> GEN["generator.py"]
GEN --> RUN["runtime_evaluator.py"]
RUN --> SCORE["Execution metrics"]
SCORE --> POP["Update population"]
POP --> PROMPT
xychart-beta
title "Lexical Score Evolution (Generations 1–150)"
x-axis ["Gen 1", "Gen 15", "Gen 30", "Gen 50", "Gen 75", "Gen 100", "Gen 125", "Gen 150"]
y-axis "Score" 0 --> 1000
bar [35, 196, 330, 422, 632, 740, 840, 862]
line [35, 196, 330, 422, 632, 740, 840, 862]
400+ keyword-based quality signals across 19 categories. Each signal is a simple presence check:
if "kubernetes" in content or "k8s" in content:
score += 2| Category | Signals | Rewards |
|---|---|---|
| Tech Stack | 14 | Ollama, LangGraph, Pydantic, httpx, Rich |
| Quality | 30+ | type hints, error handling, async, streaming |
| Security & Auth | 12 | encryption, RBAC, MFA, OAuth, JWT |
| Performance | 8 | caching, pooling, circuit breaker |
| Testing | 14 | integration, e2e, property-based, mutation |
| Deployment & Ops | 16 | Docker, K8s, Terraform, Helm, ArgoCD |
| ML/AI | 12 | RAG, embeddings, chain-of-thought, fine-tuning |
| Data & Storage | 14 | SQL, NoSQL, Redis, migrations, ETL |
| Cloud & IaC | 30+ | AWS, GCP, Azure, Terraform, Pulumi |
| Compliance | 15+ | HIPAA, GDPR, SOC2, PCI DSS, ISO 27001 |
| Plus 9 more categories | 200+ | Mobile, messaging, databases, design patterns... |
The execution-grounded validator. This is what makes the system grounded — it doesn't just check keywords, it runs the code:
| Check | Max Score | What It Validates |
|---|---|---|
| AST parse | 20 | Every .py file is syntactically valid Python |
| Function count | 5 | At least 1 function defined |
| Class count | 5 | At least 1 class defined |
| pytest | +25 / -5 | Tests pass; failures penalize |
| flake8 | 10 | PEP 8 compliance (select rules) |
| Runtime import | 15 | main.py imports without errors |
| Has tests | 5 | Test files present |
| Has README | 2 | Documentation exists |
| Has requirements | 3 | Dependencies declared |
| Multi-file | 5 | 3+ files rewarded |
Connects to an LLM (Mistral by default) to generate real project files from prompts:
# generator.py
def generate_code(prompt):
response = client.chat.completions.create(
model=os.environ["LLM_MODEL"],
messages=[
{"role": "system", "content": "You are an autonomous software architect..."},
{"role": "user", "content": prompt},
],
)
return response.choices[0].message.contentAlso provides write_project_files() to parse multi-file code blocks and write them to disk for evaluation.
The continuous evolution loop that ties everything together:
Each cycle:
1. Load population from population.json
2. Select best prompt
3. Mutate it (30% crossover chance)
4. Pick a random benchmark task
5. Generate code via LLM
6. Write project files to disk
7. Evaluate execution (AST, pytest, flake8)
8. Score = base execution score + content bonus
9. Update population
10. Auto-update README status
11. If score improved → git commit & push
12. Wait 10 seconds, repeat
39 mutation operations that transform prompts, with self-tuning weights:
MUTATIONS = [
{"desc": "Add stronger modularity requirements", "weight": 1.0},
{"desc": "Require async support", "weight": 1.0},
{"desc": "Require comprehensive tests with pytest", "weight": 1.0},
# ... 36 more, weights auto-adjusted by success rate
]Mutations that consistently produce negative score deltas have their probability reduced (down to 0.1x). Successful mutations maintain or increase their weight. Weights persist to memory/mutation_weights.json.
Also provides crossover_prompts() for genetic recombination between two prompts.
JSON-based population persistence with:
- Tournament selection — random subset, pick best
- Elitism — top-k selection
- Capped population — keeps top 50 individuals
- Generation tracking — each individual tagged with generation number
When prompts saturate the current scoring ceiling, these scripts inject entirely new scoring signals into evaluate.py:
auto_evolve.py— 10 signal pools (CI/CD, containers, databases, testing, etc.)evolve_forever.py— 40+ signal pools (400+ signals across cloud, mobile, compliance, design patterns, etc.)
After each generation, ranks all prompts and records:
- Best/worst/average scores
- Spread between 1st and 2nd place
- Pattern observations (e.g., "Auth/security is differentiating top prompts")
- Python 3.12+
- LLM API key — Mistral, OpenAI, or any OpenAI-compatible provider
git clone git@github.com:NullLabTests/grounded_evolution.git
cd grounded_evolution
python -m venv .venv && source .venv/bin/activate
pip install openai pytest flake8 black rich gitpython psutil
# Set your LLM provider (Mistral by default)
export LLM_API_KEY='your_key_here'
export LLM_MODEL="mistral-large-latest" # or "gpt-4o", etc.
export LLM_BASE_URL="https://api.mistral.ai/v1" # or OpenAI's URL
# Or use OpenAI directly:
# export OPENAI_API_KEY='your_key_here'python eval.py # Score the seed prompt
python auto_evolve.py # 25 generations of lexical evolution# Continuous evolution with real code validation
python infinite_research_loop.pyThis starts the infinite loop:
- Takes the best prompt from population
- Generates a real Python project via LLM
- Validates by compiling, testing, and linting
- Updates scores with execution results
- Auto-commits improvements to git
# Run the full experiment grid (4 ablations × 3 benchmarks)
python run_experiment.py
# Quick smoke test (10 cycles per condition)
python run_experiment.py --quick
# Preview what will run
python run_experiment.py --list
# Resume incomplete runs
python run_experiment.py --resumeResults are logged to experiments/run_log.jsonl and per-condition files in experiments/ablation_runs/.
# Plot from main experiment log
python analysis/plot_convergence.py
# Plot from ablation runs with rolling average
python analysis/plot_convergence.py --ablation --rolling=5Charts saved to analysis/charts/. Requires matplotlib.
# Edit the seed prompt
$EDITOR prompt.txt
# Score it
python eval.py
# Keep or revert
git add prompt.txt && git commit -m "Score improved"
# or
git checkout prompt.txt| Metric | Value |
|---|---|
| Cycles | 203 |
| Best Execution Score | 39.0 / 80 |
| Score Range | 17.0 → 39.0 |
| Average Score | 30.9 |
| Test Quality | 4–5 real assertions/cycle (after prompt fix) |
| Hidden Tests Passed | 0 / 203 |
| Total LLM Tokens | ~1,000,000 |
| Mutation Operators | 127 uses |
| Crossover Operators | 76 uses |
| Process Stability | 100% (no crashes) |
1. Score plateau at 39/80 — Despite 203 cycles, the execution score never exceeded 39. The system converged to a fitness plateau. Prompts evolved to produce projects that pass basic structural checks (syntax, imports, file count) but consistently failed benchmark-specific behavioral tests. This suggests the mutation operators explore prompt text similarity space, not functional correctness space — and these are not the same.
2. Test quality is directly controllable via prompt engineering — The single most impactful change was improving the LLM system prompt from "Generate clean code" to "Generate real tests with assertions, no placeholders". This moved test quality from 0 real assertions to 4–5 per cycle instantly. The generator system prompt is a critical lever.
3. Self-tuning mutation weights work but converge — The weight adjustment system successfully downweighted 5 consistently harmful mutations to 0.1× probability. However, this also reduced exploration diversity — the same few mutations were repeatedly selected, narrowing the search space.
4. Hidden behavioral tests remain unsolved — Across 203 cycles, not a single generated project passed benchmark-specific hidden tests. The generated code produces correct structure (right function signatures, right file layout) but not correct behavior (the functions don't actually work as specified). This is the fundamental open problem.
5. LLM generation is reliable — The Mistral API completed all 203 generations without a single failure. Average generation time was stable at ~60s/cycle.
6. Population converges without diversity preservation — The greedy elitist selection (keep top 50) caused the population to converge to near-identical prompts producing 3-file projects. A diversity-preserving mechanism (e.g., novelty search or fitness sharing) is needed.
| Challenge | Impact | Hypothesis |
|---|---|---|
| Hidden test failure | Blocks scores above 40 | Need behavioral validation mid-generation, not post-hoc |
| Population convergence | Stagnation after ~50 cycles | Add novelty search or multi-objective optimization |
| Token cost | ~5,000 tokens/cycle | Cache similar prompts, use cheaper models for pre-filtering |
| Mutation granularity | Most mutations are neutral | Add structured mutations that modify specific prompt sections |
The original autoresearch-ai-agent-skeleton pioneered the idea of evolving prompts with genetic algorithms and lexical scoring. It proved that prompts could be optimized algorithmically.
Grounded Evolution extends that work by adding an entirely new dimension — execution-grounded validation. The prompt isn't just scored on what it says; it's scored on what the code it generates actually does.
| Aspect | Lexical-Only | Grounded |
|---|---|---|
| Prompt scoring | Keyword presence in text | Keyword presence + real execution |
| Code generation | None | LLM generates projects from prompts |
| Validation | None | AST parse + pytest + flake8 |
| Evolution loop | Finite generations | Continuous (infinite) |
| Fitness signal | Text coverage | Text coverage + code quality |
| Meta-evolution | Injects new keywords | Injects new keywords + auto-commit |
| Research question | "What keywords make a good prompt?" | "What prompts produce code that actually works?" |
# evaluate.py
if "your-keyword" in content:
score += 2Or add to SIGNAL_POOLS in auto_evolve.py for automated meta-injection.
# evaluator/runtime_evaluator.py — extend evaluate_project()
if some_condition:
score += N
metrics["my_check"] = result# mutate.py, line 130-133
strategy = random.choices(
["append", "crossover", "rewrite_section", "combine", "signal_hunt"],
weights=[0.2, 0.2, 0.15, 0.15, 0.3],
)[0]// benchmarks/tasks.json
[
{
"name": "my_benchmark",
"prompt": "Create a ..."
}
]export LLM_API_KEY='sk-...'
export LLM_MODEL="gpt-4o"
export LLM_BASE_URL="https://api.openai.com/v1"Or use local models via Ollama:
export LLM_BASE_URL="http://localhost:11434/v1"
export LLM_MODEL="qwen2.5:7b"
export LLM_API_KEY="ollama" # Ollama ignores the keygrounded_evolution/
├── README.md # This file
├── CHANGELOG.md # Release history
├── CONTRIBUTING.md # Contribution guide
├── SECURITY.md # Security policy
├── LICENSE # MIT license
├── pyproject.toml # Project metadata
├── program.md # Agent instructions
├── prompt.txt # Seed prompt
│
├── evaluate.py # Lexical scoring (400+ signals)
├── eval.py # Quick eval (30 signals)
├── mutate.py # Genetic mutation (5 strategies)
├── reflect.py # Generation analysis
├── auto_evolve.py # Meta-evolution (10 pools)
├── evolve_forever.py # Aggressive meta-evolution (40+ pools)
│
├── generator.py # LLM code generation
├── mutation_engine.py # Prompt mutation operators
├── population_manager.py # Population persistence
├── infinite_research_loop.py # Continuous grounded evolution
├── beautify_readme.py # README status updater
├── run_evolution.sh # Bash automation
│
├── evaluator/
│ └── runtime_evaluator.py # Execution validation
│
├── benchmarks/
│ └── tasks.json # Benchmark definitions
│
├── population/ # 150 evolved prompts
├── generated_projects/ # Generated code outputs
├── memory/ # Evolution state
├── reports/ # Generated reports
├── runtime_logs/ # Execution logs
├── reflection.md # Full evolution history
│
├── docs/
│ ├── COMPARISON.md # Lexical vs Grounded
│ └── ARCHITECTURE.md # Architecture docs
│
└── .github/
├── workflows/ # CI pipeline
└── ISSUE_TEMPLATE/ # Issue templates
Grounded Evolution is framed within evolutionary software optimization research:
- Evaluator-grounded prompt evolution — Fitness functions grounded in both lexical coverage and execution-based validation
- Autonomous experimentation infrastructure — Continuous, unattended evolution cycles with meta-level adaptation
- Recursive benchmark optimization — The evaluator evolves alongside the prompts, preventing fitness stagnation
- Execution-grounded fitness — The core innovation: prompts are not just scored on what they say, but on what the code they generate actually does
This is not:
- A claim of AGI or sentience
- A self-conscious or self-aware system
- Runaway recursive self-improvement
It is a well-scoped experimental system for studying how genetic algorithms can optimize prompts for code generation quality — with real execution validation.
MIT — see LICENSE.
Inspired by Andrej Karpathy's autoresearch. The original lexical evolution framework was developed as autoresearch-ai-agent-skeleton. Grounded Evolution adds execution-grounded validation and continuous autonomous experimentation.