Skip to content

Add reproducibility runbooks for five multi-reasoner templates#57

Open
cafzal wants to merge 21 commits intotelco-network-recovery-templatefrom
worktree-runbooks-and-base-ontologies
Open

Add reproducibility runbooks for five multi-reasoner templates#57
cafzal wants to merge 21 commits intotelco-network-recovery-templatefrom
worktree-runbooks-and-base-ontologies

Conversation

@cafzal
Copy link
Copy Markdown
Collaborator

@cafzal cafzal commented May 5, 2026

Why

These templates already work end-to-end as scripts, but a user starting from scratch with an RAI agent has to reverse-engineer the chain — which skill to load, which question to ask, what to expect back — to recreate the workflow on their own data. This PR adds a one-screen runbook per template that turns each into a ready-to-paste agent recipe: skill + prompt + expected response per step, in chain order, starting from raw demo data. Goal: drop the time-to-first-success for someone starting with an agent + RAI skills against a multi-reasoner workflow.

Templates covered

Template Workflow
energy_grid_planning build ontology → examine ontology → discover reasoner questions → forecast substation load → find structural bottlenecks → screen DC requests → approve DCs and fund upgrades → read the frontier → persist into ontology
supply_chain_resilience build ontology → examine ontology → discover reasoner questions → map upstream supplier exposure → rank network hubs → classify supplier reliability → solve risk-adjusted flow → quantify disruption scenarios → persist into ontology
machine_maintenance build ontology → examine ontology → discover reasoner questions → diagnose plant operations → find scheduling bottlenecks → classify machine risk → schedule maintenance → stress-test concentration → persist into ontology
portfolio_balancing build ontology → examine ontology → discover reasoner questions → compliance scan → cluster correlated bets → solve mean-variance frontier → read the frontier → stress under crisis → persist into ontology
telco_network_recovery (lands with #56) build ontology → examine ontology → discover reasoner questions → diagnose WEST → flag critical-restore towers → score subscriber blast radius → forecast regional demand → optimize tier selection → interpret the plan → persist into ontology

Per-step format

Each step is a ### N. <topic> header followed by two bullets:

  • Prompt: an inline skill invocation the user can copy and paste — /rai-skill <natural-language question>
  • Response: what the agent produces (key numbers, properties written back to the ontology)

The chain bookends are: /rai-build-starter-ontology (Step 1) builds against the bundled CSVs in data/; /rai-querying (Step 2) examines the result; /rai-discovery (Step 3) scopes sub-questions to reasoner families; the chain stages do the work; /rai-ontology-design (final step) promotes per-stage enrichments to first-class ontology state and adds new Concepts where a stage produced new entities (SelectedUpgrade, InvestmentPortfolio, SupplyPlan, MaintenancePlan + CrossTrainingRecommendation, FrontierPoint).

Skills used (all public): rai-build-starter-ontology, rai-querying, rai-discovery, rai-rules-authoring, rai-graph-analysis, rai-predictive-modeling + rai-predictive-training, rai-prescriptive-problem-formulation, rai-prescriptive-solver-management, rai-prescriptive-results-interpretation, rai-ontology-design.

Source-of-truth alignment

Numbers cross-referenced against each template's README and main script. Prompts describe the user's question (no solver names, no Σ formulas, no agent-implementation scaffolding). Telco workflow ordering and prompts mirror the summit-demo runbook so the two stay in sync as a single chain recipe.

Test plan

  • Workflow ordering and stage names match each template's main script
  • Concept.property names referenced in Response bullets all exist in the script
  • Headline numbers match each template's README "Expected output" section
  • No internal references (Snowflake account names, internal repo paths, eval CSV filenames)
  • End-to-end script execution against bundled data/ CSVs (all 5 templates):
    • energy_grid_planning ✓ — knee $300M / 5 DCs / 1,500 MW / $264.35M, marginal $995K → $400K/$M, 3 Louvain communities
    • machine_maintenance ✓ — OEE 79.8 / 68.2 / 61.4, 1 Critical (M013) / 1 Elevated (M016) / 28 Standard, OPTIMAL $605,240.61, T006 cross-train $3,200 / 5 weeks
    • supply_chain_resilience ✓ — centrality 0.5016 / 0.3895 / 0.3688 (max-normalized to 1.000 / 0.776 / 0.735 in runbook), 2 components, B017 avoid + B003 watch, baseline OPTIMAL $1,865 / 8 flows / 0 unmet, +88.5% S004-offline / +0.0% watch→avoid
    • portfolio_balancing ✓ — 4 holdings + 2 sectors flagged, 5 reps PFE/GOOGL/JPM/PG/XOM, frontier 32.43→40.28 / 1160→1742 with knee at eps_1, crisis vol gap +28.4% min_risk → +29.8% peak (eps_1, eps_2) → +25.2% eps_5. (One correction applied: frontier is 6 points per scenario, not 7 — runbook updated.)
    • telco_network_recovery ✓ (run with a predictive-enabled venv) — WEST multiplier 0.9998× (-0.0002 raw), other regions +0.45% (SOUTH) → +0.91% (NORTH), 15 critical_restore towers, TWR-0014 weighted_impact 0.0502 / 61 subs, OPTIMAL $4,956,843 / 164 install-weeks / 122 Gbps, 12 GOLD / 2 SILVER / 1 BRONZE, all 15 covered (TWR-0009 BRONZE, TWR-0005 + TWR-0006 SILVER, rest GOLD)

Each runbook is an agent prompt sequence to recreate the template's
multi-reasoner pipeline using the bundled CSVs in ../data/, mapping
each stage to the template's actual concepts, properties, and outputs:

- telco_network_recovery (5-stage: descriptive -> rules -> graph ->
  predictive -> prescriptive; mirrors PR #56's existing structure)
- energy_grid_planning (4-stage: predictive -> graph -> rules ->
  prescriptive with InvestmentLevel scenarios)
- supply_chain_resilience (4-stage: blast-radius -> graph -> rules ->
  min-cost flow + scenarios)
- machine_maintenance (5-stage: querying -> graph -> rules ->
  prescriptive maintenance schedule -> resilience cross-training)

Reproducible against the bundled template CSVs; one-line notes on
swapping to a Snowflake schema for users wiring to their own data.
Apply the dual-audience plan in dev_temp/pr57_runbook_hybrid_plan.md
to all five runbooks. Each runbook now serves both the stakeholder
(narrative + ASCII visualizations) and the practitioner (explicit
skill + prompt to recreate the stage):

- "How to read this runbook" preface explaining the dual purpose
- Step 0 discovery section using rai-discovery to scope sub-questions
  to reasoner families before any chain stage runs
- Skill / Prompt boxed callout immediately under every Stage heading
  (rai-querying, rai-rules-authoring, rai-graph-analysis, rai-prescriptive-*)
- "Adapting this recipe to a new domain" closing section

Adds portfolio_balancing/references/runbook.md (4 stages: rules ->
graph clustering -> bi-objective Markowitz frontier -> crisis-regime
stress test, all aligned to the template's actual 8-stock dataset
and epsilon-rate frontier sweep).

Numbers cross-referenced against each template's README and main
script; reflects the templates as shipped (not the larger demos
they were sourced from). Prompts use domain-natural language with
no Concept.property syntax inside the prompt strings.
@cafzal cafzal changed the title Add reproducibility runbooks for four multi-reasoner templates Add hybrid runbooks (narrative + agent recipe) for five multi-reasoner templates May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

The docs preview for this pull request has been deployed to Vercel!

✅ Preview: https://relationalai-docs-590zl5pex-relationalai.vercel.app/build/templates
🔍 Inspect: https://vercel.com/relationalai/relationalai-docs/9j5GvDmvZ7B84nmwJ9WdfqSGyf75

Replace the prior verbose, mechanical prompts with short natural
questions a user would actually type, modeled on the reasoner-eval
QA catalog. Where an eval QA exists for the same skill+pattern, the
runbook prompt mirrors that question directly.

Also fix the predictive skill names: rai-predictive-modeling +
rai-predictive-training (the public skills are available; the
"no public skill yet" placeholder was stale).

Each prompt is now:
- 1-2 sentences asking what the user wants to know
- Domain-natural language
- Aligned to a real eval-style question

Affects all five runbooks: telco_network_recovery,
energy_grid_planning, supply_chain_resilience,
machine_maintenance, portfolio_balancing.
Replace the two-line `**Skill:** ... · **Prompt:** "..."` callout with
a single-line `> /rai-skill "question"` form that mirrors how a user
actually invokes a skill in chat. Multi-skill stages stay readable
as `> /rai-A + /rai-B "question"`.

One change per stage callout, all five runbooks. No content rewrite.
Each runbook is now ~50 lines: 1-paragraph intro, the TL;DR chain
ASCII, a workflow table (skill + prompt + expected output per step),
and a brief data footer. All per-stage narrative subsections,
"how to read" / "adapting" / "why the chain matters" sections, and
duplicate enrichment diagrams are gone.

Total: 5 files, ~2000 lines removed, ~260 retained.
Bullet format gives each prompt its own line so users can triple-click
to select and copy. Expected output renders as a paragraph under each
bullet. No content changes, just structural.
Each step is now `### N. <topic>` followed by two bullets:
- Prompt: <skill> <question> (in code formatting, no quotes — easy
  to triple-click and copy as a single agent invocation)
- Response: <expected output>

Same content, clearer structure for skim + copy.
- telco: reorder to summit-demo workflow (descriptive -> rules ->
  graph -> predictive -> prescriptive -> interpret); fix WEST
  multiplier 0.993x -> 0.9998x; fix other-region growth range to
  +0.45-0.91%/day; clarify projected_demand_growth is written to
  all 250 towers via region join, not just 15
- energy: drop fabricated 36-month forecast horizon -> 24-month;
  rewrite Stage 3 low-carbon prompt to describe the actual rule
  (per-DC requirement vs zero-emission share) instead of a
  fabricated 25%/100% threshold
- machine_maintenance: drop unsupported "Turbines need on-site
  qualified technician" hard constraint -- script penalizes
  travel cost, doesn't enforce co-location; add parts_cost
  factor to failure cost formula
- portfolio: tighten crisis vol-gap range to actual 25-30%
  (peak +29.8% at eps_1, low +25.2% at eps_5)
- supply_chain: no factual changes (verified clean)
Prompts should describe what the user wants, not how the agent
should do it. The skill (with the agent) handles solver choice,
formula construction, and implementation details.

- telco Stage 5: drop the explicit Σ formula
- portfolio Stage 2: 'force the rest to zero' -> 'only invest in those'
- portfolio Stage 3: drop 'anchor / sweep / forced to zero' agent
  scaffolding; show 7 frontier points instead
- supply chain Stage 3: rephrase as 'find the minimum-cost shipping
  plan' (no 'Solve a ... LP'), 'don't ship from avoid suppliers',
  'prefer non-bottleneck sites'
- machine_maintenance Stage 5: drop 'Solve with HiGHS' (mechanical)
- machine_maintenance Stage 4: simplify cost-formula language
@cafzal cafzal changed the title Add hybrid runbooks (narrative + agent recipe) for five multi-reasoner templates Add reproducibility runbooks for five multi-reasoner templates May 6, 2026
Step 1 is now /rai-build-starter-ontology against the bundled CSVs.
Discovery, the chain stages, and interpretation shift to 2..N.
Reflects that users start with the demo data and need the ontology
materialized before any reasoner skill can run.
5-runbook audit against template scripts and READMEs.

Telco: fix concept list to the 9 the script defines (drop
Contract/BillingEvent/etc. that are not in the script, add
RegionMetric and TemporalEdge); sharpen Steps 2,3,4,6,7,8 prompts
and responses; correct TWR-0009 BRONZE->GOLD delta to +5 Gbps
(BRONZE=3, GOLD=8).

Energy: sharpen Stage 4 graph prompt to ask for WCC + Louvain +
centrality (script computes all three); reword "structurally
constrained bottleneck" to clarify DFW is the binding capacity
bottleneck specifically.

Supply chain: fix concept list to actual 7 (drop StockKeepingUnit
/Inventory/BillOfMaterial that aren't Concepts, rename to SKU);
sharpen Step 2 discovery prompt + response to enumerate the 5
chained reasoning steps; name the 6 SUPPLIER-typed upstream nodes
in Step 3.

Machine maintenance: fix concept list (drop TrainingOption — used
as DataFrame, not Concept; add CertificationExpiry); correct
x_assigned binary count from ~250 to 384 (96 qualified pairs x 4
periods).

Portfolio: drop fictitious StockPair Concept (script uses binary
property Stock.covar(Stock,Stock) instead); add Regime to the
Stage 5 Concept callout; enumerate the 6 actual constraint
families.

Step rename: '### 2. Discovery' -> '### 2. Discover reasoner
questions' across all five runbooks for clearer step labelling.
Runbooks live alongside the template script, README, and data/
directory now (was under references/). Updated relative paths
inside each runbook from ../data/ -> data/ and ../<template>.py ->
<template>.py.
Each chain now ends with /rai-ontology-design promoting the
per-stage enrichments into first-class ontology state and adding
new Concepts where a stage produced new entities (SelectedUpgrade,
InvestmentPortfolio, SupplyPlan, MaintenancePlan +
CrossTrainingRecommendation, FrontierPoint). The chain output
persists as queryable ontology rather than stage-local Python
state, which is what enables a downstream analyst to keep working
without re-running the chain.
Step 2 is now /rai-querying showing the concept-relationship diagram
and row counts so a user can confirm the ontology came out the way
they expected before any reasoner skill runs against it. Discovery
shifts to step 3, downstream chain steps shift by 1.
End-to-end run shows the script outputs 6 frontier points per
scenario (min-risk anchor + 5 epsilon sweep points). Max-return
is computed as a separate anchor for setting the rate range but
isn't included in the frontier table. Fix runbook accordingly:

- Step 6 prompt: drop 'Show 7 points', describe the actual sweep
- Step 6 response: '6-point frontier per scenario; 7 solves per
  scenario x 6 scenarios = 42 LOCALLY_SOLVED'
- Step 7 prompt: 'six-point Pareto frontier'
- Chain ASCII: '6-point frontier'
- Closing step response: FrontierPoint count = 36, not 42

All other portfolio numbers verified against the actual run.
The chain already writes per-stage enrichments back to the ontology
via model.define() in each reasoner stage, so promoting them is
redundant. The real gap is the prescriptive aggregates and post-solve
metadata that currently live only in pandas / stdout.

Per-template, the closing step now adds the specific Concepts that
materialize what the chain doesn't:

- telco: RestorePlan (singleton plan summary) + SelectedUpgrade
  (view-concept over the 15 chosen tower-tier rows)
- energy: InvestmentPortfolio(InvestmentLevel) holding per-budget
  totals + marginal_per_m + knee flag (5 rows)
- supply_chain: RoutingScenario (3 rows: Baseline, S004-offline,
  Watch-Avoid) with status, total_cost, cost_delta_pct, blocked
  businesses
- machine_maintenance: MaintenancePlan, TypeConcentration(machine_type)
  per-type concentration analysis, and CrossTrainingRecommendation
  with ranked candidates
- portfolio: FrontierPoint(Scenario, eps_label) — 36 rows holding
  return, risk, vol_base, vol_crisis, vol_gap_pct, is_knee

Also strips inner backticks from the prompt code span (which were
breaking the outer markdown code rendering) and restores the blank
line before the Data section.
Strip the meta-framing ('the chain already writes X, what's still
only in pandas...') from each /rai-ontology-design prompt. A user
wouldn't talk to an agent that way — they'd just say what they
want added to the ontology. The agent (with the ontology-design
skill loaded) figures out the gap.

Also drop 'support temporal GNN message passing downstream' tail
from the telco build prompt — replaced with the user-facing
reason ('we'll want to forecast region-level trends later').
Strip implementation details a user wouldn't type — those belong
to the agent + loaded skill, not the user's question:

- telco diagnose: drop DAILY_REVENUE_USD column reference
- telco rules: drop 'first derive averages from NetworkPerformance,
  via NetworkEquipment -> EquipmentHealth' join paths
- telco graph: drop 'rank by total PageRank influence' algorithm name
- telco predictive: drop 'GNN', 'TemporalEdge', 'message passing',
  'lag features (prev-day, prev-week, 7-day mean)' feature
  engineering, and 'Mean each region's Dec predictions, convert
  to 1+x multiplier, bind via region' implementation steps
- telco prescriptive: drop 'MIP scoped to options where X.for_tower(Y)
  AND Z.is_critical_restore()' join syntax, decision-variable typing
  ('binary, keyed by tower_id+tier'), and explicit 'sum(...)'
  objective formula
- energy graph: drop 'WCC, Louvain, betweenness/degree/eigenvector'
  algorithm enumeration
- supply chain rules: drop downstream-coupling explanation ('avoid
  hard-blocked downstream', 'watch surcharged') from the rule
  prompt — that's the optimizer's concern
- machine maintenance graph: 'Compute centrality' -> 'Score by how
  central in the qualification network'
- portfolio frontier: drop 'Anchor at min-risk and max-return,
  then sweep 5 epsilon points' agent-implementation; user just
  asks for '6 frontier points per scenario from min-risk through
  high-return'
- portfolio stress: drop 'shrink correlations toward all-ones with
  weight 0.7 on base covariance + 0.3 on outer-product' formula
  — user just says 'pushes correlations 70% of the way toward
  all-ones'
cafzal added 2 commits May 6, 2026 12:31
Subagents audited each prompt against (a) the named skill's
SKILL.md and (b) the template script to verify the agent + skill
+ ontology have enough business signal to land on the script's
behavior, without re-adding mechanics.

- portfolio Step 6: spell out the 3 budgets (500, 1000, 2000) and
  2 regimes (base, crisis), and call out the fully-invested
  constraint so the solver doesn't drop budget equality
- portfolio Step 8: fix a real numerical inversion — alpha=0.7
  means 30% shrinkage toward all-ones, not 70% (the prompt was
  saying the opposite of the script). Also add 're-solve the
  same frontier under crisis covariance' so the agent re-runs
  rather than just re-evaluating risk
- energy Step 7: add 'across all five levels in a single solve'
  + clarify Stage 6 compliance flags are informational, not a
  hard pre-filter (otherwise risked producing a degenerate
  2-DC frontier from filtering to only the compliant pair)
- supply_chain Step 6: ask the rules step to also flag
  HIGH-priority demand as escalated (was in the Response but
  missing from the Prompt)
- machine_maintenance Step 7: replace 'Schedule maintenance for
  all 30 machines' with 'maintained or left exposed' framing,
  and name the 5-jobs-per-period parts/bay cap and per-tech
  hours capacity that the script enforces — a literal 'must
  maintain all 30' read otherwise conflicts with the cap
- telco: no edits; subagent flagged 2 minor risks (TemporalEdge
  comes from inline-derived edges, not bare CSVs; Step 5 only
  names 2 of 4 derived health metrics) but both are
  non-load-bearing for the chain
5 subagents simulated Prompt -> agent (with skill loaded) -> output
for each step and compared to the canonical script.

Sharpens applied (runbook only):
- machine_maintenance Step 5: drop 'betweenness (24.0 raw, 1.0
  normalized)' from response — algorithm name + raw centrality
  numeric leak agent-level mechanics into a user-facing
  description. Now says 'top centrality (normalized to 1.0)'.
- machine_maintenance Step 7: drop the leaking decision-variable
  counts/typing ('120 x_maintain + 120 x_vulnerable + 384
  x_assigned binaries') and the 'failure cost = x_vulnerable x
  predicted_fp x parts_cost x criticality x (1 + 2 x betweenness)'
  formula from the response — those are implementation
  scaffolding. Kept the 5-constraint-family enumeration and the
  Stage 2 deadline handoff which are business-level facts.
- portfolio Step 5: tighten to 'Cluster stocks where absolute
  return correlation is at least 0.3 ... pick representative by
  highest Sharpe ratio and flag the rest as non-representatives'
  — drops the over-mechanical 'derive per-stock volatility and
  pairwise correlation from the covariance property' framing.
- portfolio Step 8: re-route the 'crisis-regime covariance
  derivation' work from /rai-prescriptive-solver-management to
  /rai-pyrel-coding (it's a derived ontology property, not
  solver lifecycle); /rai-prescriptive-results-interpretation
  still does the comparison.

No sharpens (prompts already sufficient):
- telco, energy, supply_chain — all Matches across the chain
  stages; left untouched.

Alignment gaps surfaced (script and runbook diverge — NOT fixed
per the no-script-edit rule, flagged for separate decision):

* All 5 templates: closing /rai-ontology-design step claims an
  ontology Concept that the script never materializes
  (RestorePlan, InvestmentPortfolio, RoutingScenario,
  MaintenancePlan/TypeConcentration/CrossTrainingRecommendation,
  FrontierPoint). The closing step is aspirational — what the
  agent would do AFTER the script's chain runs — but the
  Response's 'Ontology now carries X' phrasing implies the
  script did it.

* telco Step 9 (interpret): response narrates a sensitivity
  outcome ('flexing budget to $6M would promote TWR-0009
  BRONZE->GOLD') the script doesn't actually compute (single
  solve only).

* telco Step 1: response lists a TemporalEdge concept the
  prompt doesn't ask for; an agent following only the prompt
  + rai-build-starter-ontology would not produce it (script
  computes it via pandas elsewhere).

* energy Step 4: prompt invokes /rai-predictive-modeling +
  /rai-predictive-training, script does a CSV lookup with
  gnn.load() stub (no actual training).

* energy Step 5: prompt + skill imply single-algorithm
  centrality (skill explicitly forbids composite), script
  computes a composite-rank of betweenness + degree +
  eigenvector.

* supply_chain Step 5: centrality persisted via pandas
  round-trip rather than the canonical
  graph.Node.X = algorithm() shorthand.

* supply_chain Step 6: the 'avoid' tier is computed Python-side
  as a set intersection rather than as a RAI Relationship.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant