Agentic-first Programming Language and Compiler Infra for Recursive Self-Improvement
Specification Β· Architecture Β· Benchmarks Β· Agent Protocol Β· Roadmap Β· Examples
agentic-eval's four agentic axes (0β1) + composite, across the profiled
languages. MAGE ranks #1 of implemented languages β only the unreachable
ideal design-target ceiling scores higher.
| Language | token | determ | reliab | safety | fitness | SWE-wtΒΉ |
|---|---|---|---|---|---|---|
| MAGE | 0.80 | 0.97 | 0.94 | 0.95 | 0.915 | 0.930 |
| Rust | 0.55 | 0.90 | 0.95 | 0.80 | 0.800 | 0.845 |
| Go | 0.60 | 0.85 | 0.70 | 0.55 | 0.675 | 0.700 |
| Java | 0.40 | 0.75 | 0.70 | 0.60 | 0.612 | 0.650 |
| TypeScript | 0.65 | 0.55 | 0.70 | 0.40 | 0.575 | 0.588 |
| Python | 0.85 | 0.45 | 0.45 | 0.35 | 0.525 | 0.490 |
ideal (ceiling) |
0.85 | 0.97 | 0.95 | 0.96 | 0.932 | 0.943 |
ΒΉ SWE-weighted = reliabilityΒ·.35 + determinismΒ·.30 + safetyΒ·.20 + tokenΒ·.15. The four axes are agentic-eval's curated, bias-audited judgments (--example swe_languages).
Measured anchor (compile+run, real BPE β not curated): MAGE executes
5/5 of the cross-language task suite and is the tersest runnable language
(173 cl100k tokens vs Rust 275, Java 297); eval_bench computes 73/73
programs to exact results. Full measured tables in Benchmarks.
MAGE is an agentic-first language: the same logic spans human-readable
prose, agent-dense sigils, a declarative neural-net DSL, and the byte-level
binary IR they lower to β each a view of the one artifact, each measured. The
prototype lexes, type-checks, and
executes general .mg programs, and lowers neural networks to a compact
binary IR (Agentic Binary Language) run on a CPU/CUDA backend.
1 Β· Human-first β a typed surface that reads like a modern language (verified --check; 61 cl100k tokens):
pub fn sum_even_squares(xs: [i32]~) -> i32 {
var total: i32 = 0;
for x in xs {
if x % 2 == 0 {
total = total + x * x;
}
}
total
}2 Β· Agentic-first β the same program in sigils (+f = pub fn) + the standard vocabulary (map/filter/fold, each a single BPE token). β25 % real tokens, identical result (verified: --check + executes β 56; 46 cl100k tokens):
+f sum_even_squares(xs) = fold(map(filter(xs, fn(x) => x % 2 == 0), fn(x) => x * x), 0, fn(a, b) => a + b)3 Β· High-level declarative form β neural networks declared, not hand-wired (still source β it lowers to the binary IR in form 4):
net MLP {
layer fc1: Linear(8, 16);
layer act1: ReLU;
layer fc2: Linear(16, 4);
layer act2: Sigmoid;
forward { fc1 }
}4 Β· Binary IR β what an agent actually ships: the net above lowered to the Agentic Binary Language container. 92 bytes β 71.9 % smaller than its text β and it round-trips back to source (measured):
ABL1 02 00 01 00 β¦ 4d 4c 50 3f β¦ β "ABL1" magic + the MLP module
327 B .mg text β 92 B binary β decompiles to the exact net above
An agent writes intent in form 2 (fewest tokens), the compiler verifies it against form 1's types, and ships form 4 (fewest bytes) β none of which a human-first language gives you in one artifact. Every figure above is reproduced by the commands in Benchmarks.
The net DSL is more than a flat layer list. A handful of orthogonal composition
operators combine reusable blocks, so an agent expresses a deep architecture
in a few tokens instead of hand-wiring every layer:
// A reusable block β published once to a shared, content-addressed registry
// (forge publish), then referenced by name; its body lives off-context.
block TransformerBlock(d, h, ff) {
wrap LayerNorm { // pre/post sandwich
residual { layer attn: MultiHeadAttention(d, h); } // x + f(x)
residual { layer ff1: Linear(d, ff); layer act: GELU; layer ff2: Linear(ff, d); }
}
}
net GPT {
layer embed: Embedding(50000, 256);
stack 12 { TransformerBlock(256, 8, 1024) } // repeat β O(1) in depth
forward { embed }
}- Operators β
stack N { β¦ }(repeat),residual { β¦ }(x + f(x)),branch { β¦ } { β¦ }(parallel paths),wrap Op { β¦ }(Op >> body >> Op) β lower to the RMIL primitivesREPEAT/RES_ADD/PARand execute on the CPU backend. - Registry handles β
forge publishstores a block under the SHA-256 of its source (deduplicated); any project references it by name whileforge check/buildresolve the definition off-context. - Typed-composition gate β a shape-mismatched composition (e.g. a
residualwhose body changes dimension) is rejected at--checkwith an actionable diagnostic, before any compute runs. - O(1) artifact β
stack 12ships as one block + a count (aREPEATfold), so the binary is flat in depth.
One reproducible command threads the whole story β
benchmarks/capstone/run.sh: forge publish β
~41-token GPT β forge check (resolve + gate) β forge build (REPEAT-folded
binary, ~1.1Γ for 12 blocks) β the full GPT runs (dispatched=97, unsupported=[]).
Every number below is produced by actually compiling and running code and
comparing real output β or by counting real cl100k BPE tokens of the exact
executed files β not by curated judgment. Reproduce with
benchmarks/cross_lang/run.sh and the
agentic-eval tokens_of / swe_executability examples.
Cross-language executability + terseness β the same 5 tasks
(factΒ·sumtoΒ·fibΒ·distinctΒ·collatz, known integer outputs) written idiomatically
in each language, compiled+run on the host toolchain, stdout compared to the
expected value (measured 2026-06-11):
| Language | Executes (5 tasks) | Real cl100k tokens | Source bytes |
|---|---|---|---|
| MAGE | 5 / 5 | 173 | 401 |
| JavaScript | 5 / 5 | 199 | 513 |
| TypeScript | 5 / 5 | 220 | 593 |
| Go | 5 / 5 | 271 | 727 |
| Rust | 5 / 5 | 275 | 769 |
| Java | 5 / 5 | 297 | 1033 |
| Python | runtime absent on host β excluded, not estimated | β | β |
- Executability is a gate the agentic editβbuildβtestβdebug loop depends
on (
testmust run the program and check output). Every runnable language clears it β including MAGE, via its tree-walking evaluator (mage-parse --eval). This records a threshold crossed, not a lead on a graded axis. - On this identical task set MAGE is the tersest by real tokens (173,
1.00Γ) and by bytes (401). A second real-BPE set (
swe_token_benchmark, 6 languages Γ 3 different tasks) agrees: MAGE 85 cl100k vs Python 89, Go 93, Java 98, TS 102, Rust 113. - MAGE surface coverage: its
eval_benchcorrectness harness asserts 73 / 73 general-purpose programs each compute an exact result, exercising every reachable expression/statement form, all pattern kinds (tuple/slice/struct/option), and the standard vocabulary over lists/strings/maps. Reproduce:cargo test --release eval_bench -- --ignored(inprototype/).
Agentic-first toolchain β measured improvement. The same lens applied to the
Forge project toolchain (forge): a human-text baseline vs. the agentic-first
surface (self-describing manifest, --json, effect classes). Every figure
measured (forge runs + real BPE + node JSON.parse + 5Γ sha256; reproduce:
agentic-eval --example swe_forge_agentic):
| Axis (same toolchain, 8 commands) | text-only baseline | agentic-first |
|---|---|---|
| Result machine-parseable (reliability) | 0.00 | 1.00 (8/8 structured JSON) |
| Effect-gated before exec (safety) | 0.00 | 1.00 (8/8 effect-classed) |
| Output reproducible (determinism) | β | 1.00 (5/5 byte-identical) |
| Discovery cost (real cl100k tokens) | 547 (prose) | 232 (forge manifest) |
Self-describing + machine-readable lifts reliability and safety 0 β 1.00, keeps output deterministic, and makes discovery 2.36Γ cheaper in real tokens and parseable. The one measured cost is +3 tokens (12%) per structured result β reported, not hidden.
Neural architecture DSL vs. PyTorch. The same architecture declared in
MAGE's net DSL vs. an equivalent PyTorch nn.Module (MAGE declares the
layers; PyTorch must also spell out the imperative forward). Token counts real
cl100k BPE; binary sizes measured live (reproduce:
benchmarks/constructs/run.sh):
| Architecture | MAGE | PyTorch | fewer tokens | MAGE text β binary IR |
|---|---|---|---|---|
| MLP | 50 | 78 | 36 % | 139 B β 92 B (β34 %) |
| Transformer | 73 | 142 | 49 % | 235 B β 137 B (β42 %) |
The saving grows with complexity (the more forward-wiring the DSL subsumes, the
bigger the win), and the declaration then lowers to a binary IR a further ~34β42 %
under its own text. The full pipeline β registry block β --check shape-gate β
REPEAT-folded binary β execution β runs in
benchmarks/capstone/run.sh (a 12-deep GPT in ~41
tokens, binary 1.09Γ for 12 blocks vs. 1, dispatched=97 unsupported=[]).
The full agentic-SWE scorecard (agentic-eval's four 0β1 axes + composite
across all profiled languages) is the table at the top of this
README; falsifiable guards hold there: token (0.80) β€
Python (0.85), reliability (0.94) β€ Rust (0.95), no axis β₯ 0.98 (it is a
prototype). Reproduce with agentic-eval --example swe_languages.
Honesty. Two kinds of number appear in this README, kept distinct. Measured (compile+run, real BPE, sha256, JSON-parse): the executability/ terseness, eval-bench, and agentic-toolchain tables. Curated (
agentic-eval's 0β1 language axes): the four-axis scorecard at the top β encoded judgments, bias-audited (scores were corrected down on evidence; this is the project's own language). Executability is a gate, not a parity claim β the runtime is a young tree-walker (no JIT;awaitis run-to-completion) on curated tasks, not an application corpus.
Honest framing: MAGE's value for agents lives in two places: (1) a structurally reliable, executable text surface β LL(1) grammar, tracked effects, machine-readable diagnostics, a self-describing ontology, and an evaluator that runs it β and (2) the Agentic Binary Language binary IR, where a full neural-network module fits in ~300 bytes (~83 % smaller than the equivalent text).
The text surface itself is roughly byte-tied with idiomatic Rust on the 100-task benchmark corpus β not the "~50 % reduction" earlier versions of this README claimed. See
benchmarks/FINDINGS.mdfor measurement andAGENT_PROTOCOL.mdfor how agents should target the IR directly.The capabilities below are marked β working in the prototype today or π― design goal (specified/partially built, not yet in the prototype). See ROADMAP.md for status.
-
β Executes End to End β A tree-walking evaluator (
mage-parse --eval) runs general-purpose programs across the full surface β every expression/statement form, all pattern kinds, and the standard vocabulary over lists/strings/maps. Theeval_benchsuite computes 73 / 73 programs to exact results. -
β Zero-Ambiguity Syntax β Deterministic LL(1) grammar eliminates parsing failures for both humans and AI agents. No backtracking, no ambiguity.
-
β Binary IR for Agents (Agentic Binary Language) β A transformer block encodes to 47 bytes of Agentic Binary Language, a 5-item module to ~300 bytes (vs ~1.8 KB of text). Agents target the IR directly via
--target=abl-bytes; the text surface is a human-readable view via the round-trip decompiler. -
β Neural architecture DSL + composition algebra β declarative
nets composed from a few orthogonal operators (stack/residual/branch/wrap) over reusableblocks, shared across projects via a content-addressed registry (forge publish+ name handles). Shape-mismatched compositions are rejected at--check; repeated depth folds to anO(1)binary (REPEAT); and the operators execute on the CPU backend. See Composing architectures. -
β Sigil-Based Text Surface β Canonical forms (
+f= pub fn,v/val= immutable binding,m/var= mutable binding,?= match,@= for) keep the human view compact. (letis not a keyword β bindings are alwaysval/var; the compiler rejects a strayletwith a fix hint.) On the benchmark corpus the text is ~tied with idiomatic Rust on raw bytes (declaration-heavy code wins 4β14 %, expression-heavy code loses 8β15 %). The structural reliability matters more than the byte delta. -
β Algebraic Effects β A tracked effect system (
/ io,/ net,/ io + net) makes side effects explicit in function signatures, enabling composition without monadic boilerplate. -
π― Formal Contracts β Built-in
@req,@ens, and@invannotations enable spec-first development. The compiler verifies contracts and uses them for synthesis. -
π― Safety Knowledge Base β 9,157 safety rules across ownership, borrowing, lifetimes, type safety, concurrency, and FFI β queryable at compile time via SKB-QL, removing surface-syntax noise (no lifetime annotations in source).
-
π― Cost Oracle β Every construct exposes predicted cost (cycles, memory, latency, energy) per target architecture before code generation. Agents make informed optimization decisions.
-
π― Self-Healing Compiler β Errors produce ranked repair candidates with confidence scores. The compiler proposes fixes, applies them, and re-checks automatically.
-
π― Swarm-Native β First-class multi-agent coordination primitives: leases, consensus protocols, capability-based sandboxing, CRDT-based merging, and a message bus.
-
π― Hot Reload β Function-level live patching with <1ms swap time. Rollback on regression, versioned function slots, zero-downtime iteration.
-
π― Hardware-Agnostic Compilation β MLIR-native dialect with lowering passes for LLVM, SPIR-V, WASM, and RISC-V. Autotuning selects optimal strategies per target.
-
β Built-in AI Framework (RecursiveMachineIntelligence) β The
RecursiveMachineIntelligence/rmicrate ships inside the project: Agentic Binary Language binary neurosymbolic IR, compute backends (CPU + CUDA via IronAccelerator β tensor-core F16/BF16, calibrated INT8/INT4 quantization), a self-describing ontology with a token-compactmanifest()/describe()front door, machine-parseable error diagnostics, and effect-mapped safety. The compiler's--target=abl-*modes lower straight onto it. -
β Complete Ontologies, End to End β Every layer self-describes for agents: the language/compiler (
mage-parse --emit-ontology,--manifest), the framework (rmi::core::manifest,FrameworkOntology), and the CLI (effect-classed mode index). Deterministic output everywhere β agents can cache, diff, and gate without prose docs.
MAGE is a prototype. The working entry point today is the
mage-parse binary; build it from prototype/ and run a program:
# Build the prototype compiler/evaluator
cargo build --release --manifest-path prototype/Cargo.toml
# Write and run a program
echo 'f main(){ sum(map(range(5), fn(x) => x * x)) }' > squares.mg
./prototype/target/release/mage-parse --eval squares.mg main # β 30
# Parse + typecheck + report on any .mg file
./prototype/target/release/mage-parse squares.mgOr use the Forge project toolchain (forge/) for a manifest-driven project:
cargo build --release --manifest-path forge/Cargo.toml # builds the `forge` binary
forge new my-project # scaffold Forge.toml + src/main.mg
cd my-project
forge check # parse + typecheck the entry point (+ shape gate)
forge build # check, then lower through the binary IR
forge run # execute `main` β 120
forge publish block.mg # publish a block to the shared registry (SHA-256)
forge block # list referenceable blocks (local + registry)
forge manifest # token-compact, effect-classed command index (--json)forge drives the same mage-parse compiler (auto-located, or set
FORGE_MG). It is agentic-first β forge manifest/forge describe <cmd>
self-describe the toolchain and every command takes --json. The lower-level
targets below are the compiler's own interface.
Still planned. A
mgshort alias forforgeand the Rust transpilers are on the roadmap and not yet built.
The prototype compiler in prototype/ ships a single binary with the
following targets. Most of them came from the MAGE β RMI unification
work; see UNIFICATION.md for the full phase log.
mage-parse <file.mg> # parse + check + report
mage-parse --target=abl <file> # lowering summary (sizes, hashes)
mage-parse --target=abl-bytes <file> [out.abl]
# emit binary Agentic Binary Language container
mage-parse --from=abl-bytes <file.abl>
# decode bytes β human-readable .mg
mage-parse --target=abl-compute <file> # dispatch nets to CpuBackend
mage-parse --target=abl-train <file> # SGD/Adam training loop
mage-parse --target=abl-infer <file> # load checkpoint, predict
mage-parse --target=abl-generate <file> # autoregressive decode
mage-parse --eval <file.mg> <fn> [int args]
# execute a function in the tree-walking
# evaluator and print its resultThe --eval evaluator runs general-purpose .mg programs end to end
(lex β parse β evaluate). Its correctness harness (eval_bench) executes
73 programs to exact results, covering every expression/statement form, all
pattern kinds (tuple/slice/struct/option), and the standard vocabulary over
lists, strings, and maps β see Benchmarks.
# e.g. given `f fib(n){ if n < 2 { n } else { fib(n-1) + fib(n-2) } }` in fib.mg
mage-parse --eval fib.mg fib 25 # β 75025# Run the token-efficiency benchmark across all 100 corpus tasks
cargo run --bin token-bench --manifest-path prototype/Cargo.toml
# β writes benchmarks/TOKEN_REPORT.md, exits non-zero on regressions| MAGE | Rust equivalent | MAGE | Rust equivalent |
|---|---|---|---|
f |
fn |
v/val |
let (immutable) |
+f |
pub fn |
m/var |
let mut (mutable) |
af |
async fn |
? |
match |
uf |
unsafe fn |
?: |
if |
+S |
pub struct |
:? |
else if |
+E |
pub enum |
: |
else |
+T |
pub trait |
@ |
for .. in |
I |
impl |
@@ |
loop |
u |
use |
@w |
while |
. |
:: (path) |
! |
break |
@d() |
#[derive()] |
>> |
continue |
p"" |
println!() |
1b |
true |
[T]~ |
Vec<T> |
0b |
false |
{K:V} |
HashMap<K,V> |
?T |
Option<T> |
{K} |
HashSet<K> |
R[T,E] |
Result<T,E> |
/ io |
effect annotation | @req |
precondition contract |
@ens |
postcondition contract | @inv |
invariant contract |
prototype/ The working compiler + evaluator (this is MAGE today):
lexer, LL(1) parser, type inference, tree-walking evaluator,
Agentic Binary Language lowering, and the RAP agent server
β 1,184 tests
RecursiveMachineIntelligence/ Built-in agentic-first AI framework (`rmi` crate):
Agentic Binary Language binary IR, compute backends
(CPU + CUDA via IronAccelerator, F32βF16/BF16βINT8/4),
self-describing ontology + token-compact manifest
framework/ Framewerx β neurosymbolic layer over `rmi`
forge/ Project toolchain + content-addressed block registry
(`forge new/check/build/run/publish/block`, `Forge.toml`)
stdlib/ Standard library (`.mg` source)
skb/ Safety Knowledge Base (9,157 rules, 6 categories)
benchmarks/ Evaluation corpus + cross-language executability harness
examples/ Self-contained example projects (`Forge.toml` + `src/main.mg`)
editors/ Editor support: tree-sitter grammar, Helix, Neovim
agent-guide/ AI-agent integration guide (prompts, RAP methods)
cookbook/ Practical recipes (I/O, HTTP, agents, CLI)
quick-start/ Install β hello-world β syntax β build/run/test tutorials
internals/ Compiler-internals documentation
migration-guide/ Rust β MAGE migration guide
community/ Contributing, governance, issue templates
training/ Training data (100 samples, JSONL)
An earlier branch carried a forked-rustc native compiler (
compiler/, MLIR + LLVM backends). It was a separate, dormant experiment and has been removed; code generation today runs through the Agentic Binary Language IR onto thermibackends. Native text-language codegen is a roadmap item, not a current claim.
Twelve self-contained projects in examples/, each with a
Forge.toml manifest and src/main.mg entry point:
| Project | Focus |
|---|---|
| β hello-world | Bindings, f-strings, vocabulary β forge run-able |
| β data-structures | Structs, pattern matching, map/sum/min β forge run-able |
| http-client | Async HTTP, effects, JSON |
| cli-tool | File I/O, iterators, arguments |
| agent-swarm | Multi-agent coordination |
| effects-showcase | Effect declarations and handlers |
| autonomous-pipeline | Task decomposition and pipelines |
| swarm-code-review | Scatter/gather consensus review |
| safe-plugin-host | Capability sandbox with auditing |
| live-compiler | Hot reload and self-healing |
| multilang-bindings | FFI bridge (C, Python, WASM) |
| cost-aware-optimizer | Cost-model strategy selection |
The β examples
forge checkandforge runtoday (cd examples/hello-world && forge run). The rest exercise the full intended surface β async HTTP, FFI, swarm coordination, hot reload β which the prototype evaluator does not execute yet; they are scaffolds for the roadmap, not runnable demos.forge newalso scaffolds a project that checks and runs out of the box. More.mgprograms that run today live inprototype/examples/(e.g.agent_rpn.mg, thenetexamples).
| Document | Description |
|---|---|
| MAGE_SPEC.md | Formal language specification |
| ARCHITECTURE.md | Compiler and system architecture |
| AB_INITIO_DESIGN.md | Ab-initio language design (standard vocabulary Β§8) |
| AGENT_PROTOCOL.md | How agents target the binary IR directly |
| MEASUREMENTS.md | Measured results and methodology |
| ROADMAP.md | Roadmap and status |
| MAGE_ECOSYSTEM.md | Ecosystem architecture (Forge, RAP, migration) |
| MAGE_PROPOSAL.md | Design philosophy and design principles |
| prototype/docs/BOOK.md | User guide (12 chapters) |
| prototype/docs/INTERNALS.md | Compiler architecture (36 modules) |
| agent-guide/ | AI agent SDK (prompts, patterns, RAP methods) |
| cookbook/ | Practical recipes (I/O, HTTP, agents, concurrency) |
| migration-guide/ | Rust β MAGE migration |
| skb/ | Safety Knowledge Base (9,157 rules, 6 categories) |
| training/ | Training data and evaluation (100 samples) |
| benchmarks/ | 100-task evaluation corpus with metrics |
The prototype builds with a stable Rust toolchain β no LLVM or external build system required for the core compiler/evaluator:
# Prerequisites: Rust (stable, edition 2024)
git clone https://github.com/nervosys/MachineGenetics.git
cd MachineGenetics
# Build and test the prototype
cargo build --release --manifest-path prototype/Cargo.toml
cargo test --release --manifest-path prototype/Cargo.toml(The optional CUDA backend is feature-gated and loads the driver at runtime via
dlopen; the build succeeds with or without a GPU present.)
Issues and pull requests are welcome β see community/CONTRIBUTING.md and community/GOVERNANCE.md. For compiler internals and module layout, see prototype/docs/INTERNALS.md; for how agents consume the language, see AGENT_PROTOCOL.md.
MAGE is licensed under the Apache License, Version 2.0. See LICENSE for the full text.