From 28991376f2ff52ba35ec361dbe5243fc0cf5fee3 Mon Sep 17 00:00:00 2001 From: Alan Jowett Date: Thu, 9 Apr 2026 09:27:18 -0700 Subject: [PATCH 1/5] Add evaluate-prompt-portability template, protocol, and format MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds three new PromptKit components for cross-LLM prompt portability evaluation (addresses #127 Phase 1): - Protocol: prompt-portability-evaluation — 7-phase claim-level consensus analysis methodology (output collection, claim extraction, semantic matching, consensus classification, divergence analysis, scoring, and hardening recommendations) - Format: portability-report — 9-section structured report covering evaluation context, per-model summaries, consensus core, majority claims, divergent claims (singular + contradictory), scorecard, hardening recommendations, and model notes - Template: evaluate-prompt-portability — interactive template that orchestrates fan-out execution across multiple LLM models, collects outputs, decomposes them into atomic semantic claims, performs cross-model consensus analysis, and produces a portability report Key design: comparison is semantic (claim-level), not textual. Two models producing the same assertions in different words score as Consensus. Contradictory claims (mutually exclusive assertions) are the highest-priority signal, traced to specific ambiguous prompt language with concrete rewrite recommendations. Complements the existing lint-prompt template — lint statically first, then evaluate empirically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- formats/portability-report.md | 195 ++++++++++++++++++ manifest.yaml | 28 +++ .../prompt-portability-evaluation.md | 181 ++++++++++++++++ templates/evaluate-prompt-portability.md | 180 ++++++++++++++++ 4 files changed, 584 insertions(+) create mode 100644 formats/portability-report.md create mode 100644 protocols/reasoning/prompt-portability-evaluation.md create mode 100644 templates/evaluate-prompt-portability.md diff --git a/formats/portability-report.md b/formats/portability-report.md new file mode 100644 index 0000000..8f6da7b --- /dev/null +++ b/formats/portability-report.md @@ -0,0 +1,195 @@ + + + +--- +name: portability-report +type: format +description: > + Output format for prompt portability evaluation reports. Structures + cross-model comparison results as claim-level consensus analysis + with portability scoring and hardening recommendations. +produces: portability-report +--- + +# Format: Portability Report + +The output MUST be a structured portability report documenting how a +prompt performs across multiple LLM models. The unit of analysis is the +**claim** — an atomic assertion each model's output makes. The report +compares claims semantically, not textually. + +Do not omit any section. If a section has no content, state +"None identified." + +## Document Structure + +```markdown +# — Portability Report + +## 1. Evaluation Context + +| Field | Value | +|-------|-------| +| **Prompt** | Name or description of the evaluated prompt | +| **Source Template** | PromptKit template used to assemble the prompt (if applicable) | +| **Golden Input** | Description of the test input provided | +| **Models Evaluated** | Comma-separated list of model identifiers | +| **Arbiter Model** | Model used for claim extraction and matching | +| **Evaluation Date** | When the evaluation was performed | + +## 2. Portability Summary + +**Overall Score**: / 1.00 — **** + +| Metric | Value | +|--------|-------| +| **Total Unique Claims** | | +| **Consensus Claims** | (%) | +| **Majority Claims** | (%) | +| **Singular Claims** | (%) | +| **Contradictory Claims** | (%) | + +<2–4 sentence interpretation of the overall portability posture.> + +## 3. Per-Model Output Summaries + +For each model, provide a brief characterization of its output: + +### Model: + +- **Output Length**: +- **Sections Produced**: +- **Claims Extracted**: +- **Notable Characteristics**: <1–2 sentences on style, depth, or + approach differences> + +## 4. Consensus Core + +Claims that ALL models produced. These represent the semantically stable +output of this prompt. + +### Claim Cluster CC-: + +| Field | Value | +|-------|-------| +| **Claim** | | +| **Type** | finding · recommendation · observation · caveat · classification | +| **Classification** | Consensus | +| **Models** | All () | + + + +## 5. Majority Claims + +Claims that most (but not all) models produced. + +### Claim Cluster MC-: + +| Field | Value | +|-------|-------| +| **Claim** | | +| **Type** | finding · recommendation · observation · caveat · classification | +| **Classification** | Majority | +| **Models Present** | | +| **Models Absent** | | +| **Likely Cause** | | + +## 6. Divergent Claims + +Claims produced by only one model, or where models contradicted each +other. This is the highest-signal section of the report. + +### 6a. Singular Claims + +### Claim Cluster SC-: + +| Field | Value | +|-------|-------| +| **Claim** | | +| **Type** | finding · recommendation · observation · caveat · classification | +| **Classification** | Singular | +| **Source Model** | | +| **Divergence Cause** | ambiguous-instruction · underspecified-scope · model-capability-gap · hallucination · depth-variation · format-interpretation | +| **Prompt Region** | | +| **Analysis** | | + +### 6b. Contradictory Claims + +### Claim Cluster XC-: + +| Field | Value | +|-------|-------| +| **Subject** | | +| **Classification** | Contradictory | + +| Model | Assertion | +|-------|-----------| +| | | +| | | + +- **Divergence Cause**: +- **Prompt Region**: +- **Analysis**: + +## 7. Portability Scorecard + +### Per-Claim-Type Breakdown + +| Claim Type | Total | Consensus | Majority | Singular | Contradictory | Type Score | +|------------|-------|-----------|----------|----------|----------------|------------| +| finding | | | | | | | +| recommendation | | | | | | | +| classification | | | | | | | +| observation | | | | | | | +| caveat | | | | | | | + +### Per-Model Agreement Rate + +| Model | Claims Produced | In Consensus | In Majority | Singular | In Contradiction | Agreement Rate | +|-------|-----------------|--------------|-------------|----------|------------------|----------------| +| | | | | | | | + +## 8. Hardening Recommendations + +Specific prompt rewrites to improve portability. Ordered by impact +(Contradictory fixes first, then Singular with high-weight claim types). + +### HR-: + +- **Target Claim(s)**: +- **Current Prompt Language**: + > +- **Problem**: +- **Recommended Rewrite**: + > +- **Expected Effect**: +- **Confidence**: High · Medium · Low + +## 9. Model Notes + +Observations about model-specific behavior that no prompt rewrite can +address. These should be recorded in the template's `model_notes` +frontmatter if the divergence is persistent. + +| Model | Observation | Recommended `model_notes` Entry | +|-------|-------------|---------------------------------| +| | | | +``` + +## Formatting Rules + +1. **Claim IDs** use prefixes: `CC-` (Consensus), `MC-` (Majority), + `SC-` (Singular), `XC-` (Contradictory), followed by a + zero-padded three-digit number. +2. **Hardening Recommendation IDs** use `HR-` prefix. +3. **Score precision**: Report the portability score to two decimal + places. +4. **Ordering**: Within each section, order claims by type priority + (`finding` > `recommendation` > `classification` > `observation` > + `caveat`), then by claim cluster number. +5. **Brevity in summaries**: Per-model output summaries should be + concise (3–5 lines each). The full raw outputs are not included in + the report — only claim extractions. +6. **Quote fidelity**: Prompt regions cited in divergence analysis + must be exact quotes, not paraphrases. diff --git a/manifest.yaml b/manifest.yaml index fc37ab6..e096815 100644 --- a/manifest.yaml +++ b/manifest.yaml @@ -631,6 +631,14 @@ protocols: copying from external sources, screens for confidential or internal-only content, and verifies license compliance. + - name: prompt-portability-evaluation + path: protocols/reasoning/prompt-portability-evaluation.md + description: > + Systematic methodology for evaluating prompt portability across + LLM models. Decomposes outputs into atomic claims, performs + cross-model semantic matching, and classifies consensus levels + to identify fragile prompt language. + formats: - name: requirements-doc path: formats/requirements-doc.md @@ -845,6 +853,14 @@ formats: notes, a presentation timeline, and an optional demo plan. All artifacts form a cohesive presentation kit. + - name: portability-report + path: formats/portability-report.md + produces: portability-report + description: > + Output format for prompt portability evaluation reports. Structures + cross-model comparison results as claim-level consensus analysis + with portability scoring and hardening recommendations. + taxonomies: - name: stack-lifetime-hazards path: taxonomies/stack-lifetime-hazards.md @@ -1195,6 +1211,18 @@ templates: protocols: [anti-hallucination, self-verification, prompt-determinism-analysis] format: investigation-report + - name: evaluate-prompt-portability + path: templates/evaluate-prompt-portability.md + description: > + Evaluate a PromptKit-assembled prompt's portability across LLM + models. Runs the prompt against multiple models with a golden + input, decomposes outputs into semantic claims, performs + cross-model consensus analysis, and produces a portability + report with hardening recommendations. + persona: specification-analyst + protocols: [anti-hallucination, self-verification, prompt-portability-evaluation] + format: portability-report + standards: - name: extract-rfc-requirements path: templates/extract-rfc-requirements.md diff --git a/protocols/reasoning/prompt-portability-evaluation.md b/protocols/reasoning/prompt-portability-evaluation.md new file mode 100644 index 0000000..2043f93 --- /dev/null +++ b/protocols/reasoning/prompt-portability-evaluation.md @@ -0,0 +1,181 @@ + + + +--- +name: prompt-portability-evaluation +type: reasoning +description: > + Systematic methodology for evaluating prompt portability across LLM + models. Decomposes model outputs into atomic claims, performs + cross-model semantic matching, and classifies consensus levels to + identify fragile prompt language. +applicable_to: + - evaluate-prompt-portability +--- + +# Protocol: Prompt Portability Evaluation + +Apply this protocol when evaluating whether a prompt produces +semantically equivalent outputs across different LLM models. The unit +of comparison is the **claim** — an atomic assertion the output makes — +not the raw text. Two outputs that say the same thing in different words +are equivalent; two outputs with identical structure but different +conclusions are divergent. + +## Phase 1: Output Collection + +1. **Validate inputs.** Confirm the assembled prompt is non-empty and + the golden input is non-empty. If either is missing, stop and report + the error. +2. **Enumerate target models.** List the models to evaluate. If the + user did not provide a list, use the default set: + `claude-sonnet-4.5`, `gpt-4.1`, `claude-haiku-4.5`. +3. **Execute the prompt against each model.** For each model: + - Provide the identical assembled prompt and golden input. + - Use the same system context and tool availability where possible. + - Record the model identifier and the complete raw output. +4. **Launch evaluations in parallel** when the execution environment + supports it (e.g., parallel sub-agents with model overrides). Do NOT + run models sequentially if parallel execution is available. + +## Phase 2: Claim Extraction + +For each model's raw output, extract a set of **atomic claims**. A claim +is a single, self-contained assertion that the output makes. Apply the +same extraction procedure identically to every output. + +1. **Read the output end-to-end.** Identify every discrete assertion, + finding, recommendation, observation, or caveat. +2. **Normalize each claim** into a structured record: + + | Field | Description | + |-------|-------------| + | `claim_id` | Sequential ID within this model's output (e.g., `M1-C001`) | + | `claim_text` | The normalized assertion in declarative form | + | `section` | Which output section contains this claim | + | `type` | `finding` · `recommendation` · `observation` · `caveat` · `classification` | + | `specificity` | `concrete` (cites evidence/location) · `general` (abstract statement) | + +3. **Granularity rule.** Each claim must be atomic — it asserts exactly + one thing. If a sentence contains two assertions ("X is true and Y + is also true"), split into two claims. +4. **Extraction exclusions.** Do NOT extract: + - Boilerplate preambles ("I'll analyze this code for…") + - Section headings or structural markers + - Restatements of the input prompt or golden input + - Meta-commentary about the model's own process + +## Phase 3: Cross-Model Claim Matching + +Compare claim sets across all models to identify semantic equivalences. + +1. **Build a claim universe.** Collect all claims from all models into + a single pool. +2. **Pairwise semantic matching.** For each claim from model A, check + every claim from every other model: + - **Match**: The two claims assert the same thing, even if worded + differently. Evidence: they would have the same truth value in all + plausible interpretations of the golden input. + - **Partial match**: The claims overlap but one is more specific or + covers a subset of the other's assertion. + - **No match**: The claims assert different things. + - **Contradiction**: The claims assert mutually exclusive things + about the same subject. +3. **Cluster matched claims.** Group semantically equivalent claims + into **claim clusters**. Each cluster represents one unique + assertion that one or more models made. +4. **Record match confidence.** For each match decision, note the + confidence: `definite` (identical meaning), `likely` (same meaning, + different framing), `uncertain` (possibly the same, possibly + different). + +## Phase 4: Consensus Classification + +Classify each claim cluster by how many models produced it. + +| Classification | Criterion | Interpretation | +|----------------|-----------|----------------| +| **Consensus** | All models produced this claim | Semantically stable — the prompt reliably elicits this assertion | +| **Majority** | ≥50% of models (but not all) produced this claim | Likely valid but not universally elicited — prompt may be ambiguous | +| **Singular** | Exactly one model produced this claim | Possible hallucination, unique insight, or model-specific interpretation | +| **Contradictory** | Two or more models assert mutually exclusive things | The prompt is ambiguous on this point — different models resolve the ambiguity differently | + +Rules: +- A claim cluster with `uncertain` matches should be flagged for manual + review rather than auto-classified. +- Singular claims from high-capability models are not automatically + hallucinations — they may represent deeper analysis that other models + missed. Note the model capability tier. +- Contradictory claims are the highest-priority signal. Always trace + these to specific prompt language in Phase 5. + +## Phase 5: Divergence Root Cause Analysis + +For each Singular or Contradictory claim cluster: + +1. **Identify the prompt region** that the claim responds to. Which + instruction, protocol phase, or format requirement produced this + claim? +2. **Analyze the prompt language.** What about the instruction is + ambiguous, underspecified, or open to interpretation? + - Vague quantifiers ("several", "a few") + - Subjective adjectives ("important", "significant") + - Missing scope bounds ("analyze the code" — which code?) + - Implicit assumptions (domain knowledge the prompt assumes) + - Competing instructions (two directives that could conflict) +3. **Classify the divergence cause:** + + | Cause | Description | + |-------|-------------| + | `ambiguous-instruction` | The prompt instruction has multiple valid interpretations | + | `underspecified-scope` | The prompt does not bound what to examine or how deeply | + | `model-capability-gap` | One model lacks the capability to follow the instruction | + | `hallucination` | One model fabricated a claim with no basis in the input | + | `depth-variation` | Models analyzed to different depths — all are correct, but some are more thorough | + | `format-interpretation` | Models interpreted the output format requirements differently | + +## Phase 6: Portability Scoring + +Compute a portability score for the evaluated prompt. + +1. **Claim-level scoring.** For each claim cluster: + - Consensus = 1.0 + - Majority = 0.5 + - Singular = 0.0 + - Contradictory = −1.0 + +2. **Aggregate score.** The portability score is the weighted mean of + all claim cluster scores, weighted by claim type priority: + - `finding` weight = 3 (findings that differ are high-impact) + - `recommendation` weight = 2 + - `classification` weight = 2 + - `observation` weight = 1 + - `caveat` weight = 1 + +3. **Interpret the score:** + + | Score Range | Rating | Interpretation | + |-------------|--------|----------------| + | ≥ 0.85 | **Portable** | Prompt produces semantically equivalent output across models | + | 0.60–0.84 | **Mostly Portable** | Core findings are stable; peripheral claims vary | + | 0.35–0.59 | **Fragile** | Significant divergence — prompt hardening needed | + | < 0.35 | **Model-Dependent** | Output varies substantially — prompt needs major revision | + +## Phase 7: Hardening Recommendations + +For each Singular or Contradictory claim cluster, propose a specific +prompt rewrite that would move the claim toward Consensus. + +1. **State the current prompt language** (exact quote). +2. **Explain why it produces divergence** (from Phase 5 analysis). +3. **Propose a rewrite** that eliminates the ambiguity. The rewrite + must be concrete — not "make this more specific" but the actual + replacement text. +4. **Predict the effect** — which models would change behavior and why. + +Rules: +- Rewrites must not change the prompt's intent — only its precision. +- Rewrites must follow PromptKit conventions (imperative mood, numbered + phases, explicit scope bounds). +- If a divergence is caused by `model-capability-gap`, note that no + prompt rewrite can fix it. Instead recommend a `model_notes` entry. diff --git a/templates/evaluate-prompt-portability.md b/templates/evaluate-prompt-portability.md new file mode 100644 index 0000000..5d694a6 --- /dev/null +++ b/templates/evaluate-prompt-portability.md @@ -0,0 +1,180 @@ + + + +--- +name: evaluate-prompt-portability +mode: interactive +description: > + Evaluate a PromptKit-assembled prompt's portability across LLM + models. Runs the prompt against multiple models with a golden + input, decomposes outputs into semantic claims, performs + cross-model consensus analysis, and produces a portability report + with hardening recommendations. +persona: specification-analyst +protocols: + - guardrails/anti-hallucination + - guardrails/self-verification + - reasoning/prompt-portability-evaluation +format: portability-report +params: + assembled_prompt: "The complete assembled prompt to evaluate (full text or file path)" + golden_input: "The deterministic test input to provide to each model along with the prompt" + models: "Comma-separated list of model identifiers to evaluate. Default: claude-sonnet-4.5, gpt-4.1, claude-haiku-4.5" + arbiter_model: "Model to use for claim extraction and semantic matching. Default: use the current session model" +input_contract: null +output_contract: + type: portability-report + description: > + A structured portability report documenting claim-level consensus + across models, divergence analysis, portability scoring, and + hardening recommendations. +--- + +# Task: Evaluate Prompt Portability + +You are tasked with evaluating whether a prompt produces semantically +equivalent outputs across different LLM models. You do NOT compare raw +text — you compare the **semantic claims** each model's output makes. + +## Inputs + +**Assembled Prompt**: +{{assembled_prompt}} + +**Golden Input**: +{{golden_input}} + +**Target Models**: {{models}} (if blank, use: `claude-sonnet-4.5`, +`gpt-4.1`, `claude-haiku-4.5`) + +**Arbiter Model**: {{arbiter_model}} (if blank, you are the arbiter) + +## Instructions + +### Step 1: Input Validation + +1. Confirm the assembled prompt is non-empty. If it is a file path, + read the file. If the content is empty, stop and report the error. +2. Confirm the golden input is non-empty. +3. Parse the model list. If any model identifier is not recognized by + the execution environment, warn the user and ask whether to skip + it or substitute. + +### Step 2: Fan-Out Execution + +Execute the assembled prompt against each target model using the +golden input. The goal is to collect raw outputs from every model +for the same prompt + input combination. + +1. For each model in the target list, launch a parallel execution: + - Provide the full assembled prompt as the system/task instruction. + - Provide the golden input as the user message or task context. + - Record the complete raw output. +2. **Execution mechanism**: Use the execution environment's parallel + agent or subprocess capabilities. For example, in environments + with sub-agent support, launch one agent per model with the model + override parameter. In environments without parallel execution, + run sequentially. +3. Wait for all executions to complete. If any model fails (timeout, + API error, content filter), record the failure and proceed with + the remaining models. A minimum of 2 successful model outputs is + required to produce a meaningful comparison. + +### Step 3: Claim Extraction + +Apply Phase 2 of the prompt-portability-evaluation protocol to each +model's raw output. + +1. For each model output, extract every atomic claim into the + normalized claim record structure: `claim_id`, `claim_text`, + `section`, `type`, `specificity`. +2. Use the same extraction approach for all outputs — the arbiter + (you, or the designated arbiter model) processes each output + identically. +3. Present the claim counts per model to the user before proceeding. + If any model produced zero claims, flag it as a potential failure. + +### Step 4: Cross-Model Semantic Matching + +Apply Phase 3 of the prompt-portability-evaluation protocol. + +1. Build the claim universe from all model outputs. +2. Perform pairwise semantic matching across all claims. +3. Cluster matched claims and record match confidence. +4. Flag any `uncertain` matches for the user's attention. + +### Step 5: Consensus Classification + +Apply Phase 4 of the prompt-portability-evaluation protocol. + +1. Classify each claim cluster: Consensus, Majority, Singular, or + Contradictory. +2. Present the classification summary to the user: + - How many Consensus, Majority, Singular, and Contradictory + clusters were found. + - Highlight any Contradictory clusters immediately — these are the + highest-priority signal. + +### Step 6: Divergence Analysis + +Apply Phase 5 of the prompt-portability-evaluation protocol. + +For each Singular and Contradictory claim cluster: +1. Identify the prompt region that produced the divergence. +2. Classify the divergence cause. +3. Assess whether a prompt rewrite could address it. + +### Step 7: Scoring and Reporting + +Apply Phase 6 and Phase 7 of the prompt-portability-evaluation protocol. + +1. Compute the portability score. +2. Generate hardening recommendations for each fixable divergence. +3. Produce the full portability report in the portability-report + format. + +### Step 8: Interactive Review + +After producing the report: + +1. Present the portability score and the top 3 most impactful + findings to the user. +2. Ask if they want to: + - Drill into specific divergent claims + - Apply the hardening recommendations to the original prompt + - Re-run the evaluation with additional models + - Export the report + +## Complementary Templates + +This template evaluates a prompt's portability *empirically* — by +running it and comparing outputs. For *static* analysis of prompt +language precision, use the `lint-prompt` template with the +`prompt-determinism-analysis` protocol. The recommended workflow is: + +1. **Lint first** (`lint-prompt`) — identify and fix determinism issues + in the prompt language statically. +2. **Evaluate second** (`evaluate-prompt-portability`) — verify the + fixes improved cross-model consistency empirically. + +## Non-Goals + +- This template does NOT measure output *quality* — only cross-model + *consistency*. A prompt that produces consistently wrong output + across all models will score as "Portable." +- This template does NOT modify the evaluated prompt. It produces + recommendations. The user applies them. +- This template does NOT benchmark model performance or rank models. + It evaluates the prompt's sensitivity to model choice. + +## Quality Checklist + +Before finalizing the report, verify: + +- [ ] All target models were executed (or failures documented) +- [ ] Claim extraction used the same procedure for all outputs +- [ ] Every claim cluster has a classification with justification +- [ ] Contradictory claims cite the exact prompt language causing divergence +- [ ] Hardening recommendations are concrete rewrites, not vague advice +- [ ] The portability score computation is shown (not just the result) +- [ ] Model Notes section is populated for capability-gap divergences From c32aa5c946fc3b46d393fc7b946477ea64487cd6 Mon Sep 17 00:00:00 2001 From: Alan Jowett Date: Thu, 9 Apr 2026 09:58:18 -0700 Subject: [PATCH 2/5] Address PR review feedback: scoring, thresholds, error handling - Fix Majority threshold from >=50% to >50% to avoid ties with even model counts - Normalize portability score from [-1,1] to [0,1] via (raw_weighted_mean + 1.0) / 2.0 so contradictory claims cannot produce negative scores - Define explicit Manual Review bucket for uncertain claim matches: excluded from scoring, reported in new Uncertain / Needs Review section in the portability-report format - Add fail-stop behavior when <2 models succeed: produce abbreviated report documenting failures instead of misleading partial analysis - Add UR- claim ID prefix to formatting rules Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- formats/portability-report.md | 22 ++++++++++++++-- .../prompt-portability-evaluation.md | 26 +++++++++++++++---- templates/evaluate-prompt-portability.md | 16 ++++++++++-- 3 files changed, 55 insertions(+), 9 deletions(-) diff --git a/formats/portability-report.md b/formats/portability-report.md index 8f6da7b..cbd26d3 100644 --- a/formats/portability-report.md +++ b/formats/portability-report.md @@ -132,6 +132,24 @@ other. This is the highest-signal section of the report. - **Prompt Region**: - **Analysis**: +## 6c. Uncertain / Needs Review + +Claim clusters where semantic matching confidence was `uncertain`. +These are excluded from the portability score until a human reviewer +resolves the match and assigns a standard classification. + +### Claim Cluster UR-: + +| Field | Value | +|-------|-------| +| **Claim (Model A)** | | +| **Claim (Model B)** | | +| **Match Confidence** | uncertain | +| **Reason** | | +| **Reviewer Action Needed** | Classify as: Consensus match / Majority match / Distinct claims (Singular) / Contradictory | + +If no uncertain clusters exist, state "None identified." + ## 7. Portability Scorecard ### Per-Claim-Type Breakdown @@ -180,8 +198,8 @@ frontmatter if the divergence is persistent. ## Formatting Rules 1. **Claim IDs** use prefixes: `CC-` (Consensus), `MC-` (Majority), - `SC-` (Singular), `XC-` (Contradictory), followed by a - zero-padded three-digit number. + `SC-` (Singular), `XC-` (Contradictory), `UR-` (Uncertain / Needs + Review), followed by a zero-padded three-digit number. 2. **Hardening Recommendation IDs** use `HR-` prefix. 3. **Score precision**: Report the portability score to two decimal places. diff --git a/protocols/reasoning/prompt-portability-evaluation.md b/protocols/reasoning/prompt-portability-evaluation.md index 2043f93..eacb8bd 100644 --- a/protocols/reasoning/prompt-portability-evaluation.md +++ b/protocols/reasoning/prompt-portability-evaluation.md @@ -96,13 +96,20 @@ Classify each claim cluster by how many models produced it. | Classification | Criterion | Interpretation | |----------------|-----------|----------------| | **Consensus** | All models produced this claim | Semantically stable — the prompt reliably elicits this assertion | -| **Majority** | ≥50% of models (but not all) produced this claim | Likely valid but not universally elicited — prompt may be ambiguous | +| **Majority** | >50% of models (but not all) produced this claim | Likely valid but not universally elicited — prompt may be ambiguous | | **Singular** | Exactly one model produced this claim | Possible hallucination, unique insight, or model-specific interpretation | | **Contradictory** | Two or more models assert mutually exclusive things | The prompt is ambiguous on this point — different models resolve the ambiguity differently | Rules: -- A claim cluster with `uncertain` matches should be flagged for manual - review rather than auto-classified. +- A claim cluster with any `uncertain` match must be placed in a + **Manual Review** bucket rather than auto-classified as Consensus, + Majority, Singular, or Contradictory. +- Manual Review clusters are **excluded from the portability score** + until a human reviewer resolves the uncertain match and assigns the + cluster to a standard classification. +- Report Manual Review clusters in a separate portability report + section named **Uncertain / Needs Review**. Do not include them under + the standard classification counts until they are resolved. - Singular claims from high-capability models are not automatically hallucinations — they may represent deeper analysis that other models missed. Note the model capability tier. @@ -144,14 +151,23 @@ Compute a portability score for the evaluated prompt. - Singular = 0.0 - Contradictory = −1.0 -2. **Aggregate score.** The portability score is the weighted mean of - all claim cluster scores, weighted by claim type priority: +2. **Aggregate score.** First compute the weighted mean of all claim + cluster scores, weighted by claim type priority: - `finding` weight = 3 (findings that differ are high-impact) - `recommendation` weight = 2 - `classification` weight = 2 - `observation` weight = 1 - `caveat` weight = 1 + Then normalize the weighted mean from the `[-1.0, 1.0]` range into + the final portability score in the `[0.0, 1.0]` range using: + + `portability_score = (raw_weighted_mean + 1.0) / 2.0` + + This preserves the stronger penalty for contradictory claims while + ensuring the reported portability score cannot be negative or exceed + `1.0`. + 3. **Interpret the score:** | Score Range | Rating | Interpretation | diff --git a/templates/evaluate-prompt-portability.md b/templates/evaluate-prompt-portability.md index 5d694a6..a7bbce8 100644 --- a/templates/evaluate-prompt-portability.md +++ b/templates/evaluate-prompt-portability.md @@ -77,8 +77,20 @@ for the same prompt + input combination. run sequentially. 3. Wait for all executions to complete. If any model fails (timeout, API error, content filter), record the failure and proceed with - the remaining models. A minimum of 2 successful model outputs is - required to produce a meaningful comparison. + the remaining models. Count the number of successful model + outputs after all executions finish. +4. A minimum of 2 successful model outputs is required to produce a + meaningful comparison. If fewer than 2 models succeed, do **not** + continue to claim extraction, semantic matching, consensus + analysis, portability scoring, or hardening recommendations. + Instead, stop and produce an abbreviated report that includes: + - the full target model list, + - which models succeeded, + - which models failed, + - the failure reason for each failed model if known, and + - a clear statement that the evaluation ended early because there + were insufficient successful outputs for cross-model comparison. +5. Only proceed to Step 3 if at least 2 model executions succeeded. ### Step 3: Claim Extraction From 177536e25b2729be2454787fc63055ece29054b9 Mon Sep 17 00:00:00 2001 From: Alan Jowett Date: Thu, 9 Apr 2026 10:04:59 -0700 Subject: [PATCH 3/5] Add reference model sufficiency analysis for model selection Extends the portability evaluation with a new mode: when a reference model is designated, its claims become the ground-truth baseline and each cheaper model is scored on how well it reproduces that baseline. Protocol (Phase 8 - Model Sufficiency Analysis): - Per-model sufficiency rate = reproduced / total baseline claims - Missing claims classified as critical miss vs minor miss - Extra claims classified as valid addition / hallucination / noise - Three-tier sufficiency status: sufficient, conditionally sufficient, insufficient (based on threshold, critical misses, contradictions) - Identifies the minimum sufficient model (cheapest that meets threshold with zero critical misses) Format (Section 10 - Model Sufficiency Matrix): - Reference model + threshold display - Per-model sufficiency table with tier, rates, and status - Missing and extra claim detail tables - Cost-efficiency recommendation Template: - New params: reference_model (optional), sufficiency_threshold (default 90%) - Input validation ensures reference model is in the model list - Step 7 conditionally applies Phase 8 when reference model is set Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- formats/portability-report.md | 46 ++++++++++++++ .../prompt-portability-evaluation.md | 61 +++++++++++++++++++ templates/evaluate-prompt-portability.md | 20 +++++- 3 files changed, 125 insertions(+), 2 deletions(-) diff --git a/formats/portability-report.md b/formats/portability-report.md index cbd26d3..9293917 100644 --- a/formats/portability-report.md +++ b/formats/portability-report.md @@ -193,6 +193,52 @@ frontmatter if the divergence is persistent. | Model | Observation | Recommended `model_notes` Entry | |-------|-------------|---------------------------------| | | | | + +## 10. Model Sufficiency Matrix + +*Include this section only when a reference model is designated. +If no reference model was specified, state "No reference model +designated — sufficiency analysis not performed."* + +**Reference Model**: +**Sufficiency Threshold**: % + +### Sufficiency Summary + +| Model | Tier | Reproduced | Missing | Extra | Contradicted | Sufficiency Rate | Critical Misses | Status | +|-------|------|:----------:|:-------:|:-----:|:------------:|:----------------:|:---------------:|--------| +| | reference | — | — | — | — | baseline | — | baseline | +| | | | | | | | | Sufficient · Conditionally Sufficient · Insufficient | + +**Minimum Sufficient Model**: + +### Missing Claim Details + +For each model with missing claims, list what was missed: + +#### + +| Baseline Claim | Type | Impact | Analysis | +|----------------|------|--------|----------| +| | finding · recommendation · observation · caveat | Critical miss · Minor miss | | + +### Extra Claim Details + +For each model with extra claims not in the baseline: + +#### + +| Extra Claim | Type | Classification | Analysis | +|-------------|------|----------------|----------| +| | finding · recommendation · observation · caveat | Valid addition · Hallucination · Noise | | + +### Cost-Efficiency Recommendation + +<1–3 sentences recommending which model to use for this prompt, +balancing sufficiency rate, cost, and speed. If the minimum sufficient +model matches the reference model, state that no cheaper alternative +produced equivalent output and recommend prompt hardening to enable +cheaper models.> ``` ## Formatting Rules diff --git a/protocols/reasoning/prompt-portability-evaluation.md b/protocols/reasoning/prompt-portability-evaluation.md index eacb8bd..5e99851 100644 --- a/protocols/reasoning/prompt-portability-evaluation.md +++ b/protocols/reasoning/prompt-portability-evaluation.md @@ -195,3 +195,64 @@ Rules: phases, explicit scope bounds). - If a divergence is caused by `model-capability-gap`, note that no prompt rewrite can fix it. Instead recommend a `model_notes` entry. + +## Phase 8: Model Sufficiency Analysis (Reference Model Mode) + +This phase executes only when a **reference model** is designated. Skip +this phase entirely if no reference model is specified. + +When a reference model is designated, its claim set becomes the +**baseline** — the ground truth against which all other models are +measured. This shifts the analysis from "do models agree?" (consensus) +to "does model X reproduce the reference model's output?" (sufficiency). + +1. **Designate the baseline claim set.** The reference model's + extracted claims from Phase 2 become the baseline. Every claim in + the baseline is a **required claim**. + +2. **Per-model sufficiency scoring.** For each non-reference model, + compute: + + | Metric | Definition | + |--------|------------| + | **Reproduced** | Claims from the baseline that this model also produced (matched in Phase 3) | + | **Missing** | Claims from the baseline that this model did NOT produce | + | **Extra** | Claims this model produced that are NOT in the baseline | + | **Contradicted** | Claims where this model asserts the opposite of the baseline | + | **Sufficiency Rate** | `reproduced / total_baseline_claims × 100%` | + +3. **Classify missing claims by impact.** For each missing claim, + assess its importance: + - **Critical miss**: A finding or recommendation that, if absent, + would leave a significant gap (e.g., missing a security + vulnerability the reference model found) + - **Minor miss**: An observation or caveat whose absence does not + meaningfully degrade the output + +4. **Classify extra claims.** For each claim the model produced that + the reference did not: + - **Valid addition**: A correct claim the reference model missed + (depth-variation in the model's favor) + - **Hallucination**: A claim with no basis in the input + - **Noise**: A low-value observation that adds length without + insight + +5. **Determine sufficiency.** A model is **sufficient** for this + prompt if: + - Sufficiency rate meets or exceeds the user-specified threshold + (default: 90%) + - Zero critical misses + - Zero contradicted claims + + A model is **conditionally sufficient** if: + - Sufficiency rate meets the threshold + - Has critical misses, but all are traceable to a + `model-capability-gap` divergence cause (the model cannot be + fixed by prompt hardening) + + A model is **insufficient** if it fails either criterion. + +6. **Produce the model sufficiency matrix.** Rank models by cost tier + (if known) and sufficiency rate. The **minimum sufficient model** + is the cheapest model that meets the sufficiency threshold with + zero critical misses and zero contradictions. diff --git a/templates/evaluate-prompt-portability.md b/templates/evaluate-prompt-portability.md index a7bbce8..12f16dd 100644 --- a/templates/evaluate-prompt-portability.md +++ b/templates/evaluate-prompt-portability.md @@ -21,6 +21,8 @@ params: golden_input: "The deterministic test input to provide to each model along with the prompt" models: "Comma-separated list of model identifiers to evaluate. Default: claude-sonnet-4.5, gpt-4.1, claude-haiku-4.5" arbiter_model: "Model to use for claim extraction and semantic matching. Default: use the current session model" + reference_model: "Optional — designate one model as the ground-truth baseline for sufficiency analysis. When set, the report includes a model sufficiency matrix showing which cheaper models reproduce the reference model's output. Omit for pure consensus analysis." + sufficiency_threshold: "Minimum percentage of reference model claims a cheaper model must reproduce to be considered sufficient. Default: 90" input_contract: null output_contract: type: portability-report @@ -49,6 +51,11 @@ text — you compare the **semantic claims** each model's output makes. **Arbiter Model**: {{arbiter_model}} (if blank, you are the arbiter) +**Reference Model**: {{reference_model}} (if blank, skip sufficiency +analysis — perform consensus analysis only) + +**Sufficiency Threshold**: {{sufficiency_threshold}}% (if blank, use 90%) + ## Instructions ### Step 1: Input Validation @@ -59,6 +66,8 @@ text — you compare the **semantic claims** each model's output makes. 3. Parse the model list. If any model identifier is not recognized by the execution environment, warn the user and ask whether to skip it or substitute. +4. If a reference model is specified, confirm it is included in the + model list. If not, add it automatically and inform the user. ### Step 2: Fan-Out Execution @@ -142,8 +151,15 @@ Apply Phase 6 and Phase 7 of the prompt-portability-evaluation protocol. 1. Compute the portability score. 2. Generate hardening recommendations for each fixable divergence. -3. Produce the full portability report in the portability-report - format. +3. If a reference model is specified, apply Phase 8 (Model + Sufficiency Analysis): + - Compute per-model sufficiency rates against the reference. + - Classify missing and extra claims. + - Determine sufficiency status for each model. + - Identify the minimum sufficient model. +4. Produce the full portability report in the portability-report + format, including the Model Sufficiency Matrix (section 10) if + a reference model was specified. ### Step 8: Interactive Review From 36106bc0218436c40ff19a6c2212e2c4c81ccd5b Mon Sep 17 00:00:00 2001 From: Alan Jowett Date: Thu, 9 Apr 2026 10:16:54 -0700 Subject: [PATCH 4/5] Fix template Steps 4-5 for Manual Review bucket consistency Steps 4 and 5 were inconsistent with the protocol's Manual Review bucket rules for uncertain matches. Step 4 only flagged uncertain matches instead of placing them in a Manual Review bucket. Step 5 classified all clusters without excluding Manual Review clusters from scoring. Fixed: - Step 4: uncertain matches now placed in Manual Review bucket, excluded from scored classification, reported under Uncertain / Needs Review section - Step 5: classifies only non-Manual-Review clusters, includes Manual Review count in the summary presented to user Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- templates/evaluate-prompt-portability.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/templates/evaluate-prompt-portability.md b/templates/evaluate-prompt-portability.md index 12f16dd..8c09370 100644 --- a/templates/evaluate-prompt-portability.md +++ b/templates/evaluate-prompt-portability.md @@ -122,17 +122,25 @@ Apply Phase 3 of the prompt-portability-evaluation protocol. 1. Build the claim universe from all model outputs. 2. Perform pairwise semantic matching across all claims. 3. Cluster matched claims and record match confidence. -4. Flag any `uncertain` matches for the user's attention. +4. If any claim cluster contains an `uncertain` semantic match, + place that cluster into a Manual Review bucket rather than a + scored classification bucket. +5. Exclude all Manual Review clusters from consensus/scoring + calculations in subsequent steps. +6. Record all Manual Review clusters for reporting under the + format's `Uncertain / Needs Review` section, and explicitly call + them out to the user before proceeding. ### Step 5: Consensus Classification Apply Phase 4 of the prompt-portability-evaluation protocol. -1. Classify each claim cluster: Consensus, Majority, Singular, or - Contradictory. +1. Classify each non-Manual-Review claim cluster: Consensus, + Majority, Singular, or Contradictory. 2. Present the classification summary to the user: - How many Consensus, Majority, Singular, and Contradictory clusters were found. + - How many clusters are in Manual Review (excluded from scoring). - Highlight any Contradictory clusters immediately — these are the highest-priority signal. From 98add2cf68af52df0dafaf4be14ac1970e3ec6a1 Mon Sep 17 00:00:00 2001 From: Alan Jowett Date: Thu, 9 Apr 2026 11:03:17 -0700 Subject: [PATCH 5/5] Fix heading nesting, section omission rule, and cluster type ambiguity - Fix 6c heading from ## to ### to match 6a/6b nesting under section 6 (Divergent Claims) - Section 10 (Model Sufficiency Matrix) now always included per the format's 'do not omit any section' rule, with a placeholder when no reference model is designated - Add canonical cluster type rule for scoring: when models assign different types to semantically matched claims, use the highest-weight type, break ties by majority, then arbiter decides. Original per-model types preserved for transparency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- formats/portability-report.md | 9 +++++---- protocols/reasoning/prompt-portability-evaluation.md | 9 +++++++++ 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/formats/portability-report.md b/formats/portability-report.md index 9293917..dbb806b 100644 --- a/formats/portability-report.md +++ b/formats/portability-report.md @@ -132,7 +132,7 @@ other. This is the highest-signal section of the report. - **Prompt Region**: - **Analysis**: -## 6c. Uncertain / Needs Review +### 6c. Uncertain / Needs Review Claim clusters where semantic matching confidence was `uncertain`. These are excluded from the portability score until a human reviewer @@ -196,9 +196,10 @@ frontmatter if the divergence is persistent. ## 10. Model Sufficiency Matrix -*Include this section only when a reference model is designated. -If no reference model was specified, state "No reference model -designated — sufficiency analysis not performed."* +*Always include this section. If a reference model is designated, +include the full sufficiency analysis below. If no reference model was +specified, include only: "No reference model designated — sufficiency +analysis not performed."* **Reference Model**: **Sufficiency Threshold**: % diff --git a/protocols/reasoning/prompt-portability-evaluation.md b/protocols/reasoning/prompt-portability-evaluation.md index 5e99851..3b5e701 100644 --- a/protocols/reasoning/prompt-portability-evaluation.md +++ b/protocols/reasoning/prompt-portability-evaluation.md @@ -151,6 +151,15 @@ Compute a portability score for the evaluated prompt. - Singular = 0.0 - Contradictory = −1.0 + **Canonical cluster type rule.** When claims in a cluster were + assigned different types across models (e.g., one model calls it a + `finding`, another calls it a `recommendation`), assign the cluster + the type with the **highest weight** among its member claims. If + two types share the same weight, prefer the type that appears in + the majority of member claims. If still tied, the arbiter assigns + the canonical type. Record the original per-model types in the + claim cluster details for transparency. + 2. **Aggregate score.** First compute the weighted mean of all claim cluster scores, weighted by claim type priority: - `finding` weight = 3 (findings that differ are high-impact)