🧬 Grounded Evolution

Prompt evolution grounded in real code execution — not just keyword matching.

Overview • Why Grounded? • Architecture • Quick Start • Results • Project Structure

Current Status

Last Evolution Cycle: 2026-05-28T21:29:06+00:00 UTC
Generations: 203
Best Score: 39.0
Population Size: 50
Benchmarks: 7
Test Quality: 4–5 real assertions per cycle

Overview

Grounded Evolution is an autonomous prompt optimization platform built on a simple premise:

A prompt is only as good as the code it actually produces.

Most prompt optimizers score prompts by keyword matching — checking whether certain words appear in the prompt text. This measures what the prompt says, not what it generates.

Grounded Evolution does both — and the execution-grounded signal is what makes this fundamentally different.

Two Evaluation Loops in One System

┌────────────────────────────────────────────────────────────┐
│                   GROUNDED EVOLUTION                       │
│                                                            │
│  ┌──────────────────────┐    ┌──────────────────────────┐  │
│  │  LOOP 1: Lexical     │    │  LOOP 2: Execution       │  │
│  │                      │    │                          │  │
│  │  evaluate.py         │    │  generator.py            │  │
│  │  400+ keyword checks │    │  LLM → actual project    │  │
│  │  Scores prompt TEXT  │    │                          │  │
│  │                      │    │  runtime_evaluator.py    │  │
│  │  mutate.py           │    │  AST + pytest + flake8   │  │
│  │  5 genetic strategies│    │  Scores GENERATED CODE   │  │
│  │                      │    │                          │  │
│  │  auto_evolve.py      │    │  infinite_research_loop  │  │
│  │  Meta-evolution      │    │  Continuous evolution    │  │
│  └──────────────────────┘    └──────────────────────────┘  │
└────────────────────────────────────────────────────────────┘

Why Grounded?

The Problem with Pure Lexical Scoring

The original autoresearch-ai-agent-skeleton system (and most prompt optimizers) works like this:

# evaluate.py — checks if prompt TEXT contains keywords
if "kubernetes" in prompt_text:
    score += 2
if "pytest" in prompt_text:
    score += 2

This measures signal coverage — does the prompt mention the right things? But it cannot answer:

Does the prompt actually produce working code?
Does the generated project compile?
Do the tests pass?
Is the code well-structured?

Grounded Evolution Answers Those Questions

The grounded loop actually generates code from the prompt, then validates it:

Prompt text
    │
    ▼
generator.py ────► LLM (Mistral/OpenAI) ────► generated_project/
    │                                                │
    │                                      ┌─────────┴──────────┐
    │                                      │  runtime_evaluator  │
    │                                      │  ├── AST parse      │
    │                                      │  ├── pytest         │
    │                                      │  ├── flake8         │
    │                                      │  └── structure      │
    │                                      └─────────┬──────────┘
    │                                                │
    └──────────── Execution score feeds back ─────────┘

Side-by-Side Comparison

Capability	Lexical-Only (original)	Grounded (this repo)
Evaluates prompt text	✅ 400+ keyword signals	✅ Same 400+ signals
Generates code from prompt	❌	✅ LLM generates real project
Compiles generated code	❌	✅ AST syntax validation
Runs generated tests	❌	✅ pytest execution
Lints generated code	❌	✅ flake8 checks
Scores execution quality	❌	✅ 10 structural metrics
Continuous evolution	❌ (finite generations)	✅ Infinite research loop
Auto-commits improvements	❌	✅ git auto-commit
Meta-evolution	✅ injects new signals	✅ injects new signals
Reflection/analysis	✅ reflect.py	✅ reflect.py
LLM Provider	N/A	Mistral by default (configurable)

Dual Evaluation: Two Fitness Signals

Total Fitness = Lexical Score + Execution Score
                     ↑                ↑
              What the prompt     What the generated
              mentions            code actually does

The lexical score (0–1000) estimates what a prompt might produce. The execution score (0–100) validates what it actually produces. Together, they prevent the system from optimizing for keyword coverage at the expense of real code quality.

Architecture

System Diagram

graph TB
    subgraph Lexical["📝 Lexical Loop (evaluate.py)"]
        L1["Population<br/>(150 prompts)"] --> L2["mutate.py<br/>5 genetic strategies"]
        L2 --> L3["evaluate.py<br/>400+ keyword signals"]
        L3 --> L4["reflect.py<br/>rank + persist"]
        L4 --> L1
        L4 -->|"meta-evolve"| L5["auto_evolve.py<br/>inject new signals"]
        L5 --> L3
    end

    subgraph Grounded["⚡ Execution Grounding Loop"]
        G1["select_best(population)"] --> G2["mutation_engine.py<br/>mutate or crossover"]
        G2 --> G3["generator.py<br/>LLM → project files"]
        G3 --> G4["runtime_evaluator.py<br/>AST + pytest + flake8"]
        G4 --> G5["Score execution<br/>& update population"]
        G5 --> G1
    end

    Lexical -.->|champion prompt| Grounded
    Grounded -.->|execution score| Lexical

Evolution Cycle (Lexical)

graph LR
    P["🧬 150 prompts"] --> M["✂️ mutate.py"]
    M --> E["📊 evaluate.py"]
    E --> R["📝 reflect.py"]
    R --> S["🏆 Select best"]
    S --> P
    S -->|meta| A["auto_evolve.py"]
    A --> E

Execution Cycle (Grounded)

graph LR
    PROMPT["Best prompt"] --> GEN["generator.py"]
    GEN --> RUN["runtime_evaluator.py"]
    RUN --> SCORE["Execution metrics"]
    SCORE --> POP["Update population"]
    POP --> PROMPT

Score Trajectory

xychart-beta
    title "Lexical Score Evolution (Generations 1–150)"
    x-axis ["Gen 1", "Gen 15", "Gen 30", "Gen 50", "Gen 75", "Gen 100", "Gen 125", "Gen 150"]
    y-axis "Score" 0 --> 1000
    bar [35, 196, 330, 422, 632, 740, 840, 862]
    line [35, 196, 330, 422, 632, 740, 840, 862]

Module Deep-Dive

Lexical Layer: `evaluate.py` (61 KB)

400+ keyword-based quality signals across 19 categories. Each signal is a simple presence check:

if "kubernetes" in content or "k8s" in content:
    score += 2

Category	Signals	Rewards
Tech Stack	14	Ollama, LangGraph, Pydantic, httpx, Rich
Quality	30+	type hints, error handling, async, streaming
Security & Auth	12	encryption, RBAC, MFA, OAuth, JWT
Performance	8	caching, pooling, circuit breaker
Testing	14	integration, e2e, property-based, mutation
Deployment & Ops	16	Docker, K8s, Terraform, Helm, ArgoCD
ML/AI	12	RAG, embeddings, chain-of-thought, fine-tuning
Data & Storage	14	SQL, NoSQL, Redis, migrations, ETL
Cloud & IaC	30+	AWS, GCP, Azure, Terraform, Pulumi
Compliance	15+	HIPAA, GDPR, SOC2, PCI DSS, ISO 27001
Plus 9 more categories	200+	Mobile, messaging, databases, design patterns...

Grounded Layer: `runtime_evaluator.py` (184 lines)

The execution-grounded validator. This is what makes the system grounded — it doesn't just check keywords, it runs the code:

Check	Max Score	What It Validates
AST parse	20	Every `.py` file is syntactically valid Python
Function count	5	At least 1 function defined
Class count	5	At least 1 class defined
pytest	+25 / -5	Tests pass; failures penalize
flake8	10	PEP 8 compliance (select rules)
Runtime import	15	`main.py` imports without errors
Has tests	5	Test files present
Has README	2	Documentation exists
Has requirements	3	Dependencies declared
Multi-file	5	3+ files rewarded

Generator: `generator.py` (235 lines)

Connects to an LLM (Mistral by default) to generate real project files from prompts:

# generator.py
def generate_code(prompt):
    response = client.chat.completions.create(
        model=os.environ["LLM_MODEL"],
        messages=[
            {"role": "system", "content": "You are an autonomous software architect..."},
            {"role": "user", "content": prompt},
        ],
    )
    return response.choices[0].message.content

Also provides write_project_files() to parse multi-file code blocks and write them to disk for evaluation.

Infinite Research Loop: `infinite_research_loop.py` (187 lines)

The continuous evolution loop that ties everything together:

Each cycle:
1. Load population from population.json
2. Select best prompt
3. Mutate it (30% crossover chance)
4. Pick a random benchmark task
5. Generate code via LLM
6. Write project files to disk
7. Evaluate execution (AST, pytest, flake8)
8. Score = base execution score + content bonus
9. Update population
10. Auto-update README status
11. If score improved → git commit & push
12. Wait 10 seconds, repeat

Mutation Engine: `mutation_engine.py` (136 lines)

39 mutation operations that transform prompts, with self-tuning weights:

MUTATIONS = [
    {"desc": "Add stronger modularity requirements", "weight": 1.0},
    {"desc": "Require async support", "weight": 1.0},
    {"desc": "Require comprehensive tests with pytest", "weight": 1.0},
    # ... 36 more, weights auto-adjusted by success rate
]

Mutations that consistently produce negative score deltas have their probability reduced (down to 0.1x). Successful mutations maintain or increase their weight. Weights persist to memory/mutation_weights.json.

Also provides crossover_prompts() for genetic recombination between two prompts.

Population Manager: `population_manager.py` (50 lines)

JSON-based population persistence with:

Tournament selection — random subset, pick best
Elitism — top-k selection
Capped population — keeps top 50 individuals
Generation tracking — each individual tagged with generation number

Meta-Evolution: `auto_evolve.py` / `evolve_forever.py`

When prompts saturate the current scoring ceiling, these scripts inject entirely new scoring signals into evaluate.py:

auto_evolve.py — 10 signal pools (CI/CD, containers, databases, testing, etc.)
evolve_forever.py — 40+ signal pools (400+ signals across cloud, mobile, compliance, design patterns, etc.)

Reflection: `reflect.py` (270 lines)

After each generation, ranks all prompts and records:

Best/worst/average scores
Spread between 1st and 2nd place
Pattern observations (e.g., "Auth/security is differentiating top prompts")

Quick Start

Prerequisites

Python 3.12+
LLM API key — Mistral, OpenAI, or any OpenAI-compatible provider

Setup

git clone git@github.com:NullLabTests/grounded_evolution.git
cd grounded_evolution

python -m venv .venv && source .venv/bin/activate
pip install openai pytest flake8 black rich gitpython psutil

# Set your LLM provider (Mistral by default)
export LLM_API_KEY='your_key_here'
export LLM_MODEL="mistral-large-latest"        # or "gpt-4o", etc.
export LLM_BASE_URL="https://api.mistral.ai/v1" # or OpenAI's URL

# Or use OpenAI directly:
# export OPENAI_API_KEY='your_key_here'

Quick Lexical Evaluation

python eval.py          # Score the seed prompt
python auto_evolve.py   # 25 generations of lexical evolution

Execution-Grounded Evolution

# Continuous evolution with real code validation
python infinite_research_loop.py

This starts the infinite loop:

Takes the best prompt from population
Generates a real Python project via LLM
Validates by compiling, testing, and linting
Updates scores with execution results
Auto-commits improvements to git

Ablation Experiments

# Run the full experiment grid (4 ablations × 3 benchmarks)
python run_experiment.py

# Quick smoke test (10 cycles per condition)
python run_experiment.py --quick

# Preview what will run
python run_experiment.py --list

# Resume incomplete runs
python run_experiment.py --resume

Results are logged to experiments/run_log.jsonl and per-condition files in experiments/ablation_runs/.

Visualize Convergence

# Plot from main experiment log
python analysis/plot_convergence.py

# Plot from ablation runs with rolling average
python analysis/plot_convergence.py --ablation --rolling=5

Charts saved to analysis/charts/. Requires matplotlib.

Manual Evolution

# Edit the seed prompt
$EDITOR prompt.txt

# Score it
python eval.py

# Keep or revert
git add prompt.txt && git commit -m "Score improved"
# or
git checkout prompt.txt

Results

Grounded Evolution Results (203 Cycles)

Metric	Value
Cycles	203
Best Execution Score	39.0 / 80
Score Range	17.0 → 39.0
Average Score	30.9
Test Quality	4–5 real assertions/cycle (after prompt fix)
Hidden Tests Passed	0 / 203
Total LLM Tokens	~1,000,000
Mutation Operators	127 uses
Crossover Operators	76 uses
Process Stability	100% (no crashes)

What We Learned

1. Score plateau at 39/80 — Despite 203 cycles, the execution score never exceeded 39. The system converged to a fitness plateau. Prompts evolved to produce projects that pass basic structural checks (syntax, imports, file count) but consistently failed benchmark-specific behavioral tests. This suggests the mutation operators explore prompt text similarity space, not functional correctness space — and these are not the same.

2. Test quality is directly controllable via prompt engineering — The single most impactful change was improving the LLM system prompt from "Generate clean code" to "Generate real tests with assertions, no placeholders". This moved test quality from 0 real assertions to 4–5 per cycle instantly. The generator system prompt is a critical lever.

3. Self-tuning mutation weights work but converge — The weight adjustment system successfully downweighted 5 consistently harmful mutations to 0.1× probability. However, this also reduced exploration diversity — the same few mutations were repeatedly selected, narrowing the search space.

4. Hidden behavioral tests remain unsolved — Across 203 cycles, not a single generated project passed benchmark-specific hidden tests. The generated code produces correct structure (right function signatures, right file layout) but not correct behavior (the functions don't actually work as specified). This is the fundamental open problem.

5. LLM generation is reliable — The Mistral API completed all 203 generations without a single failure. Average generation time was stable at ~60s/cycle.

6. Population converges without diversity preservation — The greedy elitist selection (keep top 50) caused the population to converge to near-identical prompts producing 3-file projects. A diversity-preserving mechanism (e.g., novelty search or fitness sharing) is needed.

Open Challenges

Challenge	Impact	Hypothesis
Hidden test failure	Blocks scores above 40	Need behavioral validation mid-generation, not post-hoc
Population convergence	Stagnation after ~50 cycles	Add novelty search or multi-objective optimization
Token cost	~5,000 tokens/cycle	Cache similar prompts, use cheaper models for pre-filtering
Mutation granularity	Most mutations are neutral	Add structured mutations that modify specific prompt sections

Comparison: Why This Repo Exists

The original autoresearch-ai-agent-skeleton pioneered the idea of evolving prompts with genetic algorithms and lexical scoring. It proved that prompts could be optimized algorithmically.

Grounded Evolution extends that work by adding an entirely new dimension — execution-grounded validation. The prompt isn't just scored on what it says; it's scored on what the code it generates actually does.

Aspect	Lexical-Only	Grounded
Prompt scoring	Keyword presence in text	Keyword presence + real execution
Code generation	None	LLM generates projects from prompts
Validation	None	AST parse + pytest + flake8
Evolution loop	Finite generations	Continuous (infinite)
Fitness signal	Text coverage	Text coverage + code quality
Meta-evolution	Injects new keywords	Injects new keywords + auto-commit
Research question	"What keywords make a good prompt?"	"What prompts produce code that actually works?"

Customization

Adding Lexical Signals

# evaluate.py
if "your-keyword" in content:
    score += 2

Or add to SIGNAL_POOLS in auto_evolve.py for automated meta-injection.

Adding Execution Checks

# evaluator/runtime_evaluator.py — extend evaluate_project()
if some_condition:
    score += N
    metrics["my_check"] = result

Tuning Mutation Weights

# mutate.py, line 130-133
strategy = random.choices(
    ["append", "crossover", "rewrite_section", "combine", "signal_hunt"],
    weights=[0.2, 0.2, 0.15, 0.15, 0.3],
)[0]

Adding Benchmark Tasks

// benchmarks/tasks.json
[
  {
    "name": "my_benchmark",
    "prompt": "Create a ..."
  }
]

Changing the LLM Provider

export LLM_API_KEY='sk-...'
export LLM_MODEL="gpt-4o"
export LLM_BASE_URL="https://api.openai.com/v1"

Or use local models via Ollama:

export LLM_BASE_URL="http://localhost:11434/v1"
export LLM_MODEL="qwen2.5:7b"
export LLM_API_KEY="ollama"  # Ollama ignores the key

Project Structure

grounded_evolution/
├── README.md                       # This file
├── CHANGELOG.md                    # Release history
├── CONTRIBUTING.md                 # Contribution guide
├── SECURITY.md                     # Security policy
├── LICENSE                         # MIT license
├── pyproject.toml                  # Project metadata
├── program.md                      # Agent instructions
├── prompt.txt                      # Seed prompt
│
├── evaluate.py                     # Lexical scoring (400+ signals)
├── eval.py                         # Quick eval (30 signals)
├── mutate.py                       # Genetic mutation (5 strategies)
├── reflect.py                      # Generation analysis
├── auto_evolve.py                  # Meta-evolution (10 pools)
├── evolve_forever.py               # Aggressive meta-evolution (40+ pools)
│
├── generator.py                    # LLM code generation
├── mutation_engine.py              # Prompt mutation operators
├── population_manager.py           # Population persistence
├── infinite_research_loop.py       # Continuous grounded evolution
├── beautify_readme.py              # README status updater
├── run_evolution.sh                # Bash automation
│
├── evaluator/
│   └── runtime_evaluator.py        # Execution validation
│
├── benchmarks/
│   └── tasks.json                  # Benchmark definitions
│
├── population/                     # 150 evolved prompts
├── generated_projects/             # Generated code outputs
├── memory/                         # Evolution state
├── reports/                        # Generated reports
├── runtime_logs/                   # Execution logs
├── reflection.md                   # Full evolution history
│
├── docs/
│   ├── COMPARISON.md               # Lexical vs Grounded
│   └── ARCHITECTURE.md             # Architecture docs
│
└── .github/
    ├── workflows/                  # CI pipeline
    └── ISSUE_TEMPLATE/             # Issue templates

Research Context

Grounded Evolution is framed within evolutionary software optimization research:

Evaluator-grounded prompt evolution — Fitness functions grounded in both lexical coverage and execution-based validation
Autonomous experimentation infrastructure — Continuous, unattended evolution cycles with meta-level adaptation
Recursive benchmark optimization — The evaluator evolves alongside the prompts, preventing fitness stagnation
Execution-grounded fitness — The core innovation: prompts are not just scored on what they say, but on what the code they generate actually does

This is not:

A claim of AGI or sentience
A self-conscious or self-aware system
Runaway recursive self-improvement

It is a well-scoped experimental system for studying how genetic algorithms can optimize prompts for code generation quality — with real execution validation.

License

MIT — see LICENSE.

Credits

Inspired by Andrej Karpathy's autoresearch. The original lexical evolution framework was developed as autoresearch-ai-agent-skeleton. Grounded Evolution adds execution-grounded validation and continuous autonomous experimentation.

FilesExpand file tree

README.md

Latest commit

History