diff --git a/formats/portability-report.md b/formats/portability-report.md new file mode 100644 index 0000000..dbb806b --- /dev/null +++ b/formats/portability-report.md @@ -0,0 +1,260 @@ + + + +--- +name: portability-report +type: format +description: > + Output format for prompt portability evaluation reports. Structures + cross-model comparison results as claim-level consensus analysis + with portability scoring and hardening recommendations. +produces: portability-report +--- + +# Format: Portability Report + +The output MUST be a structured portability report documenting how a +prompt performs across multiple LLM models. The unit of analysis is the +**claim** — an atomic assertion each model's output makes. The report +compares claims semantically, not textually. + +Do not omit any section. If a section has no content, state +"None identified." + +## Document Structure + +```markdown +# — Portability Report + +## 1. Evaluation Context + +| Field | Value | +|-------|-------| +| **Prompt** | Name or description of the evaluated prompt | +| **Source Template** | PromptKit template used to assemble the prompt (if applicable) | +| **Golden Input** | Description of the test input provided | +| **Models Evaluated** | Comma-separated list of model identifiers | +| **Arbiter Model** | Model used for claim extraction and matching | +| **Evaluation Date** | When the evaluation was performed | + +## 2. Portability Summary + +**Overall Score**: / 1.00 — **** + +| Metric | Value | +|--------|-------| +| **Total Unique Claims** | | +| **Consensus Claims** | (%) | +| **Majority Claims** | (%) | +| **Singular Claims** | (%) | +| **Contradictory Claims** | (%) | + +<2–4 sentence interpretation of the overall portability posture.> + +## 3. Per-Model Output Summaries + +For each model, provide a brief characterization of its output: + +### Model: + +- **Output Length**: +- **Sections Produced**: +- **Claims Extracted**: +- **Notable Characteristics**: <1–2 sentences on style, depth, or + approach differences> + +## 4. Consensus Core + +Claims that ALL models produced. These represent the semantically stable +output of this prompt. + +### Claim Cluster CC-: + +| Field | Value | +|-------|-------| +| **Claim** | | +| **Type** | finding · recommendation · observation · caveat · classification | +| **Classification** | Consensus | +| **Models** | All () | + + + +## 5. Majority Claims + +Claims that most (but not all) models produced. + +### Claim Cluster MC-: + +| Field | Value | +|-------|-------| +| **Claim** | | +| **Type** | finding · recommendation · observation · caveat · classification | +| **Classification** | Majority | +| **Models Present** | | +| **Models Absent** | | +| **Likely Cause** | | + +## 6. Divergent Claims + +Claims produced by only one model, or where models contradicted each +other. This is the highest-signal section of the report. + +### 6a. Singular Claims + +### Claim Cluster SC-: + +| Field | Value | +|-------|-------| +| **Claim** | | +| **Type** | finding · recommendation · observation · caveat · classification | +| **Classification** | Singular | +| **Source Model** | | +| **Divergence Cause** | ambiguous-instruction · underspecified-scope · model-capability-gap · hallucination · depth-variation · format-interpretation | +| **Prompt Region** | | +| **Analysis** | | + +### 6b. Contradictory Claims + +### Claim Cluster XC-: + +| Field | Value | +|-------|-------| +| **Subject** | | +| **Classification** | Contradictory | + +| Model | Assertion | +|-------|-----------| +| | | +| | | + +- **Divergence Cause**: +- **Prompt Region**: +- **Analysis**: + +### 6c. Uncertain / Needs Review + +Claim clusters where semantic matching confidence was `uncertain`. +These are excluded from the portability score until a human reviewer +resolves the match and assigns a standard classification. + +### Claim Cluster UR-: + +| Field | Value | +|-------|-------| +| **Claim (Model A)** | | +| **Claim (Model B)** | | +| **Match Confidence** | uncertain | +| **Reason** | | +| **Reviewer Action Needed** | Classify as: Consensus match / Majority match / Distinct claims (Singular) / Contradictory | + +If no uncertain clusters exist, state "None identified." + +## 7. Portability Scorecard + +### Per-Claim-Type Breakdown + +| Claim Type | Total | Consensus | Majority | Singular | Contradictory | Type Score | +|------------|-------|-----------|----------|----------|----------------|------------| +| finding | | | | | | | +| recommendation | | | | | | | +| classification | | | | | | | +| observation | | | | | | | +| caveat | | | | | | | + +### Per-Model Agreement Rate + +| Model | Claims Produced | In Consensus | In Majority | Singular | In Contradiction | Agreement Rate | +|-------|-----------------|--------------|-------------|----------|------------------|----------------| +| | | | | | | | + +## 8. Hardening Recommendations + +Specific prompt rewrites to improve portability. Ordered by impact +(Contradictory fixes first, then Singular with high-weight claim types). + +### HR-: + +- **Target Claim(s)**: +- **Current Prompt Language**: + > +- **Problem**: +- **Recommended Rewrite**: + > +- **Expected Effect**: +- **Confidence**: High · Medium · Low + +## 9. Model Notes + +Observations about model-specific behavior that no prompt rewrite can +address. These should be recorded in the template's `model_notes` +frontmatter if the divergence is persistent. + +| Model | Observation | Recommended `model_notes` Entry | +|-------|-------------|---------------------------------| +| | | | + +## 10. Model Sufficiency Matrix + +*Always include this section. If a reference model is designated, +include the full sufficiency analysis below. If no reference model was +specified, include only: "No reference model designated — sufficiency +analysis not performed."* + +**Reference Model**: +**Sufficiency Threshold**: % + +### Sufficiency Summary + +| Model | Tier | Reproduced | Missing | Extra | Contradicted | Sufficiency Rate | Critical Misses | Status | +|-------|------|:----------:|:-------:|:-----:|:------------:|:----------------:|:---------------:|--------| +| | reference | — | — | — | — | baseline | — | baseline | +| | | | | | | | | Sufficient · Conditionally Sufficient · Insufficient | + +**Minimum Sufficient Model**: + +### Missing Claim Details + +For each model with missing claims, list what was missed: + +#### + +| Baseline Claim | Type | Impact | Analysis | +|----------------|------|--------|----------| +| | finding · recommendation · observation · caveat | Critical miss · Minor miss | | + +### Extra Claim Details + +For each model with extra claims not in the baseline: + +#### + +| Extra Claim | Type | Classification | Analysis | +|-------------|------|----------------|----------| +| | finding · recommendation · observation · caveat | Valid addition · Hallucination · Noise | | + +### Cost-Efficiency Recommendation + +<1–3 sentences recommending which model to use for this prompt, +balancing sufficiency rate, cost, and speed. If the minimum sufficient +model matches the reference model, state that no cheaper alternative +produced equivalent output and recommend prompt hardening to enable +cheaper models.> +``` + +## Formatting Rules + +1. **Claim IDs** use prefixes: `CC-` (Consensus), `MC-` (Majority), + `SC-` (Singular), `XC-` (Contradictory), `UR-` (Uncertain / Needs + Review), followed by a zero-padded three-digit number. +2. **Hardening Recommendation IDs** use `HR-` prefix. +3. **Score precision**: Report the portability score to two decimal + places. +4. **Ordering**: Within each section, order claims by type priority + (`finding` > `recommendation` > `classification` > `observation` > + `caveat`), then by claim cluster number. +5. **Brevity in summaries**: Per-model output summaries should be + concise (3–5 lines each). The full raw outputs are not included in + the report — only claim extractions. +6. **Quote fidelity**: Prompt regions cited in divergence analysis + must be exact quotes, not paraphrases. diff --git a/manifest.yaml b/manifest.yaml index fc37ab6..e096815 100644 --- a/manifest.yaml +++ b/manifest.yaml @@ -631,6 +631,14 @@ protocols: copying from external sources, screens for confidential or internal-only content, and verifies license compliance. + - name: prompt-portability-evaluation + path: protocols/reasoning/prompt-portability-evaluation.md + description: > + Systematic methodology for evaluating prompt portability across + LLM models. Decomposes outputs into atomic claims, performs + cross-model semantic matching, and classifies consensus levels + to identify fragile prompt language. + formats: - name: requirements-doc path: formats/requirements-doc.md @@ -845,6 +853,14 @@ formats: notes, a presentation timeline, and an optional demo plan. All artifacts form a cohesive presentation kit. + - name: portability-report + path: formats/portability-report.md + produces: portability-report + description: > + Output format for prompt portability evaluation reports. Structures + cross-model comparison results as claim-level consensus analysis + with portability scoring and hardening recommendations. + taxonomies: - name: stack-lifetime-hazards path: taxonomies/stack-lifetime-hazards.md @@ -1195,6 +1211,18 @@ templates: protocols: [anti-hallucination, self-verification, prompt-determinism-analysis] format: investigation-report + - name: evaluate-prompt-portability + path: templates/evaluate-prompt-portability.md + description: > + Evaluate a PromptKit-assembled prompt's portability across LLM + models. Runs the prompt against multiple models with a golden + input, decomposes outputs into semantic claims, performs + cross-model consensus analysis, and produces a portability + report with hardening recommendations. + persona: specification-analyst + protocols: [anti-hallucination, self-verification, prompt-portability-evaluation] + format: portability-report + standards: - name: extract-rfc-requirements path: templates/extract-rfc-requirements.md diff --git a/protocols/reasoning/prompt-portability-evaluation.md b/protocols/reasoning/prompt-portability-evaluation.md new file mode 100644 index 0000000..3b5e701 --- /dev/null +++ b/protocols/reasoning/prompt-portability-evaluation.md @@ -0,0 +1,267 @@ + + + +--- +name: prompt-portability-evaluation +type: reasoning +description: > + Systematic methodology for evaluating prompt portability across LLM + models. Decomposes model outputs into atomic claims, performs + cross-model semantic matching, and classifies consensus levels to + identify fragile prompt language. +applicable_to: + - evaluate-prompt-portability +--- + +# Protocol: Prompt Portability Evaluation + +Apply this protocol when evaluating whether a prompt produces +semantically equivalent outputs across different LLM models. The unit +of comparison is the **claim** — an atomic assertion the output makes — +not the raw text. Two outputs that say the same thing in different words +are equivalent; two outputs with identical structure but different +conclusions are divergent. + +## Phase 1: Output Collection + +1. **Validate inputs.** Confirm the assembled prompt is non-empty and + the golden input is non-empty. If either is missing, stop and report + the error. +2. **Enumerate target models.** List the models to evaluate. If the + user did not provide a list, use the default set: + `claude-sonnet-4.5`, `gpt-4.1`, `claude-haiku-4.5`. +3. **Execute the prompt against each model.** For each model: + - Provide the identical assembled prompt and golden input. + - Use the same system context and tool availability where possible. + - Record the model identifier and the complete raw output. +4. **Launch evaluations in parallel** when the execution environment + supports it (e.g., parallel sub-agents with model overrides). Do NOT + run models sequentially if parallel execution is available. + +## Phase 2: Claim Extraction + +For each model's raw output, extract a set of **atomic claims**. A claim +is a single, self-contained assertion that the output makes. Apply the +same extraction procedure identically to every output. + +1. **Read the output end-to-end.** Identify every discrete assertion, + finding, recommendation, observation, or caveat. +2. **Normalize each claim** into a structured record: + + | Field | Description | + |-------|-------------| + | `claim_id` | Sequential ID within this model's output (e.g., `M1-C001`) | + | `claim_text` | The normalized assertion in declarative form | + | `section` | Which output section contains this claim | + | `type` | `finding` · `recommendation` · `observation` · `caveat` · `classification` | + | `specificity` | `concrete` (cites evidence/location) · `general` (abstract statement) | + +3. **Granularity rule.** Each claim must be atomic — it asserts exactly + one thing. If a sentence contains two assertions ("X is true and Y + is also true"), split into two claims. +4. **Extraction exclusions.** Do NOT extract: + - Boilerplate preambles ("I'll analyze this code for…") + - Section headings or structural markers + - Restatements of the input prompt or golden input + - Meta-commentary about the model's own process + +## Phase 3: Cross-Model Claim Matching + +Compare claim sets across all models to identify semantic equivalences. + +1. **Build a claim universe.** Collect all claims from all models into + a single pool. +2. **Pairwise semantic matching.** For each claim from model A, check + every claim from every other model: + - **Match**: The two claims assert the same thing, even if worded + differently. Evidence: they would have the same truth value in all + plausible interpretations of the golden input. + - **Partial match**: The claims overlap but one is more specific or + covers a subset of the other's assertion. + - **No match**: The claims assert different things. + - **Contradiction**: The claims assert mutually exclusive things + about the same subject. +3. **Cluster matched claims.** Group semantically equivalent claims + into **claim clusters**. Each cluster represents one unique + assertion that one or more models made. +4. **Record match confidence.** For each match decision, note the + confidence: `definite` (identical meaning), `likely` (same meaning, + different framing), `uncertain` (possibly the same, possibly + different). + +## Phase 4: Consensus Classification + +Classify each claim cluster by how many models produced it. + +| Classification | Criterion | Interpretation | +|----------------|-----------|----------------| +| **Consensus** | All models produced this claim | Semantically stable — the prompt reliably elicits this assertion | +| **Majority** | >50% of models (but not all) produced this claim | Likely valid but not universally elicited — prompt may be ambiguous | +| **Singular** | Exactly one model produced this claim | Possible hallucination, unique insight, or model-specific interpretation | +| **Contradictory** | Two or more models assert mutually exclusive things | The prompt is ambiguous on this point — different models resolve the ambiguity differently | + +Rules: +- A claim cluster with any `uncertain` match must be placed in a + **Manual Review** bucket rather than auto-classified as Consensus, + Majority, Singular, or Contradictory. +- Manual Review clusters are **excluded from the portability score** + until a human reviewer resolves the uncertain match and assigns the + cluster to a standard classification. +- Report Manual Review clusters in a separate portability report + section named **Uncertain / Needs Review**. Do not include them under + the standard classification counts until they are resolved. +- Singular claims from high-capability models are not automatically + hallucinations — they may represent deeper analysis that other models + missed. Note the model capability tier. +- Contradictory claims are the highest-priority signal. Always trace + these to specific prompt language in Phase 5. + +## Phase 5: Divergence Root Cause Analysis + +For each Singular or Contradictory claim cluster: + +1. **Identify the prompt region** that the claim responds to. Which + instruction, protocol phase, or format requirement produced this + claim? +2. **Analyze the prompt language.** What about the instruction is + ambiguous, underspecified, or open to interpretation? + - Vague quantifiers ("several", "a few") + - Subjective adjectives ("important", "significant") + - Missing scope bounds ("analyze the code" — which code?) + - Implicit assumptions (domain knowledge the prompt assumes) + - Competing instructions (two directives that could conflict) +3. **Classify the divergence cause:** + + | Cause | Description | + |-------|-------------| + | `ambiguous-instruction` | The prompt instruction has multiple valid interpretations | + | `underspecified-scope` | The prompt does not bound what to examine or how deeply | + | `model-capability-gap` | One model lacks the capability to follow the instruction | + | `hallucination` | One model fabricated a claim with no basis in the input | + | `depth-variation` | Models analyzed to different depths — all are correct, but some are more thorough | + | `format-interpretation` | Models interpreted the output format requirements differently | + +## Phase 6: Portability Scoring + +Compute a portability score for the evaluated prompt. + +1. **Claim-level scoring.** For each claim cluster: + - Consensus = 1.0 + - Majority = 0.5 + - Singular = 0.0 + - Contradictory = −1.0 + + **Canonical cluster type rule.** When claims in a cluster were + assigned different types across models (e.g., one model calls it a + `finding`, another calls it a `recommendation`), assign the cluster + the type with the **highest weight** among its member claims. If + two types share the same weight, prefer the type that appears in + the majority of member claims. If still tied, the arbiter assigns + the canonical type. Record the original per-model types in the + claim cluster details for transparency. + +2. **Aggregate score.** First compute the weighted mean of all claim + cluster scores, weighted by claim type priority: + - `finding` weight = 3 (findings that differ are high-impact) + - `recommendation` weight = 2 + - `classification` weight = 2 + - `observation` weight = 1 + - `caveat` weight = 1 + + Then normalize the weighted mean from the `[-1.0, 1.0]` range into + the final portability score in the `[0.0, 1.0]` range using: + + `portability_score = (raw_weighted_mean + 1.0) / 2.0` + + This preserves the stronger penalty for contradictory claims while + ensuring the reported portability score cannot be negative or exceed + `1.0`. + +3. **Interpret the score:** + + | Score Range | Rating | Interpretation | + |-------------|--------|----------------| + | ≥ 0.85 | **Portable** | Prompt produces semantically equivalent output across models | + | 0.60–0.84 | **Mostly Portable** | Core findings are stable; peripheral claims vary | + | 0.35–0.59 | **Fragile** | Significant divergence — prompt hardening needed | + | < 0.35 | **Model-Dependent** | Output varies substantially — prompt needs major revision | + +## Phase 7: Hardening Recommendations + +For each Singular or Contradictory claim cluster, propose a specific +prompt rewrite that would move the claim toward Consensus. + +1. **State the current prompt language** (exact quote). +2. **Explain why it produces divergence** (from Phase 5 analysis). +3. **Propose a rewrite** that eliminates the ambiguity. The rewrite + must be concrete — not "make this more specific" but the actual + replacement text. +4. **Predict the effect** — which models would change behavior and why. + +Rules: +- Rewrites must not change the prompt's intent — only its precision. +- Rewrites must follow PromptKit conventions (imperative mood, numbered + phases, explicit scope bounds). +- If a divergence is caused by `model-capability-gap`, note that no + prompt rewrite can fix it. Instead recommend a `model_notes` entry. + +## Phase 8: Model Sufficiency Analysis (Reference Model Mode) + +This phase executes only when a **reference model** is designated. Skip +this phase entirely if no reference model is specified. + +When a reference model is designated, its claim set becomes the +**baseline** — the ground truth against which all other models are +measured. This shifts the analysis from "do models agree?" (consensus) +to "does model X reproduce the reference model's output?" (sufficiency). + +1. **Designate the baseline claim set.** The reference model's + extracted claims from Phase 2 become the baseline. Every claim in + the baseline is a **required claim**. + +2. **Per-model sufficiency scoring.** For each non-reference model, + compute: + + | Metric | Definition | + |--------|------------| + | **Reproduced** | Claims from the baseline that this model also produced (matched in Phase 3) | + | **Missing** | Claims from the baseline that this model did NOT produce | + | **Extra** | Claims this model produced that are NOT in the baseline | + | **Contradicted** | Claims where this model asserts the opposite of the baseline | + | **Sufficiency Rate** | `reproduced / total_baseline_claims × 100%` | + +3. **Classify missing claims by impact.** For each missing claim, + assess its importance: + - **Critical miss**: A finding or recommendation that, if absent, + would leave a significant gap (e.g., missing a security + vulnerability the reference model found) + - **Minor miss**: An observation or caveat whose absence does not + meaningfully degrade the output + +4. **Classify extra claims.** For each claim the model produced that + the reference did not: + - **Valid addition**: A correct claim the reference model missed + (depth-variation in the model's favor) + - **Hallucination**: A claim with no basis in the input + - **Noise**: A low-value observation that adds length without + insight + +5. **Determine sufficiency.** A model is **sufficient** for this + prompt if: + - Sufficiency rate meets or exceeds the user-specified threshold + (default: 90%) + - Zero critical misses + - Zero contradicted claims + + A model is **conditionally sufficient** if: + - Sufficiency rate meets the threshold + - Has critical misses, but all are traceable to a + `model-capability-gap` divergence cause (the model cannot be + fixed by prompt hardening) + + A model is **insufficient** if it fails either criterion. + +6. **Produce the model sufficiency matrix.** Rank models by cost tier + (if known) and sufficiency rate. The **minimum sufficient model** + is the cheapest model that meets the sufficiency threshold with + zero critical misses and zero contradictions. diff --git a/templates/evaluate-prompt-portability.md b/templates/evaluate-prompt-portability.md new file mode 100644 index 0000000..8c09370 --- /dev/null +++ b/templates/evaluate-prompt-portability.md @@ -0,0 +1,216 @@ + + + +--- +name: evaluate-prompt-portability +mode: interactive +description: > + Evaluate a PromptKit-assembled prompt's portability across LLM + models. Runs the prompt against multiple models with a golden + input, decomposes outputs into semantic claims, performs + cross-model consensus analysis, and produces a portability report + with hardening recommendations. +persona: specification-analyst +protocols: + - guardrails/anti-hallucination + - guardrails/self-verification + - reasoning/prompt-portability-evaluation +format: portability-report +params: + assembled_prompt: "The complete assembled prompt to evaluate (full text or file path)" + golden_input: "The deterministic test input to provide to each model along with the prompt" + models: "Comma-separated list of model identifiers to evaluate. Default: claude-sonnet-4.5, gpt-4.1, claude-haiku-4.5" + arbiter_model: "Model to use for claim extraction and semantic matching. Default: use the current session model" + reference_model: "Optional — designate one model as the ground-truth baseline for sufficiency analysis. When set, the report includes a model sufficiency matrix showing which cheaper models reproduce the reference model's output. Omit for pure consensus analysis." + sufficiency_threshold: "Minimum percentage of reference model claims a cheaper model must reproduce to be considered sufficient. Default: 90" +input_contract: null +output_contract: + type: portability-report + description: > + A structured portability report documenting claim-level consensus + across models, divergence analysis, portability scoring, and + hardening recommendations. +--- + +# Task: Evaluate Prompt Portability + +You are tasked with evaluating whether a prompt produces semantically +equivalent outputs across different LLM models. You do NOT compare raw +text — you compare the **semantic claims** each model's output makes. + +## Inputs + +**Assembled Prompt**: +{{assembled_prompt}} + +**Golden Input**: +{{golden_input}} + +**Target Models**: {{models}} (if blank, use: `claude-sonnet-4.5`, +`gpt-4.1`, `claude-haiku-4.5`) + +**Arbiter Model**: {{arbiter_model}} (if blank, you are the arbiter) + +**Reference Model**: {{reference_model}} (if blank, skip sufficiency +analysis — perform consensus analysis only) + +**Sufficiency Threshold**: {{sufficiency_threshold}}% (if blank, use 90%) + +## Instructions + +### Step 1: Input Validation + +1. Confirm the assembled prompt is non-empty. If it is a file path, + read the file. If the content is empty, stop and report the error. +2. Confirm the golden input is non-empty. +3. Parse the model list. If any model identifier is not recognized by + the execution environment, warn the user and ask whether to skip + it or substitute. +4. If a reference model is specified, confirm it is included in the + model list. If not, add it automatically and inform the user. + +### Step 2: Fan-Out Execution + +Execute the assembled prompt against each target model using the +golden input. The goal is to collect raw outputs from every model +for the same prompt + input combination. + +1. For each model in the target list, launch a parallel execution: + - Provide the full assembled prompt as the system/task instruction. + - Provide the golden input as the user message or task context. + - Record the complete raw output. +2. **Execution mechanism**: Use the execution environment's parallel + agent or subprocess capabilities. For example, in environments + with sub-agent support, launch one agent per model with the model + override parameter. In environments without parallel execution, + run sequentially. +3. Wait for all executions to complete. If any model fails (timeout, + API error, content filter), record the failure and proceed with + the remaining models. Count the number of successful model + outputs after all executions finish. +4. A minimum of 2 successful model outputs is required to produce a + meaningful comparison. If fewer than 2 models succeed, do **not** + continue to claim extraction, semantic matching, consensus + analysis, portability scoring, or hardening recommendations. + Instead, stop and produce an abbreviated report that includes: + - the full target model list, + - which models succeeded, + - which models failed, + - the failure reason for each failed model if known, and + - a clear statement that the evaluation ended early because there + were insufficient successful outputs for cross-model comparison. +5. Only proceed to Step 3 if at least 2 model executions succeeded. + +### Step 3: Claim Extraction + +Apply Phase 2 of the prompt-portability-evaluation protocol to each +model's raw output. + +1. For each model output, extract every atomic claim into the + normalized claim record structure: `claim_id`, `claim_text`, + `section`, `type`, `specificity`. +2. Use the same extraction approach for all outputs — the arbiter + (you, or the designated arbiter model) processes each output + identically. +3. Present the claim counts per model to the user before proceeding. + If any model produced zero claims, flag it as a potential failure. + +### Step 4: Cross-Model Semantic Matching + +Apply Phase 3 of the prompt-portability-evaluation protocol. + +1. Build the claim universe from all model outputs. +2. Perform pairwise semantic matching across all claims. +3. Cluster matched claims and record match confidence. +4. If any claim cluster contains an `uncertain` semantic match, + place that cluster into a Manual Review bucket rather than a + scored classification bucket. +5. Exclude all Manual Review clusters from consensus/scoring + calculations in subsequent steps. +6. Record all Manual Review clusters for reporting under the + format's `Uncertain / Needs Review` section, and explicitly call + them out to the user before proceeding. + +### Step 5: Consensus Classification + +Apply Phase 4 of the prompt-portability-evaluation protocol. + +1. Classify each non-Manual-Review claim cluster: Consensus, + Majority, Singular, or Contradictory. +2. Present the classification summary to the user: + - How many Consensus, Majority, Singular, and Contradictory + clusters were found. + - How many clusters are in Manual Review (excluded from scoring). + - Highlight any Contradictory clusters immediately — these are the + highest-priority signal. + +### Step 6: Divergence Analysis + +Apply Phase 5 of the prompt-portability-evaluation protocol. + +For each Singular and Contradictory claim cluster: +1. Identify the prompt region that produced the divergence. +2. Classify the divergence cause. +3. Assess whether a prompt rewrite could address it. + +### Step 7: Scoring and Reporting + +Apply Phase 6 and Phase 7 of the prompt-portability-evaluation protocol. + +1. Compute the portability score. +2. Generate hardening recommendations for each fixable divergence. +3. If a reference model is specified, apply Phase 8 (Model + Sufficiency Analysis): + - Compute per-model sufficiency rates against the reference. + - Classify missing and extra claims. + - Determine sufficiency status for each model. + - Identify the minimum sufficient model. +4. Produce the full portability report in the portability-report + format, including the Model Sufficiency Matrix (section 10) if + a reference model was specified. + +### Step 8: Interactive Review + +After producing the report: + +1. Present the portability score and the top 3 most impactful + findings to the user. +2. Ask if they want to: + - Drill into specific divergent claims + - Apply the hardening recommendations to the original prompt + - Re-run the evaluation with additional models + - Export the report + +## Complementary Templates + +This template evaluates a prompt's portability *empirically* — by +running it and comparing outputs. For *static* analysis of prompt +language precision, use the `lint-prompt` template with the +`prompt-determinism-analysis` protocol. The recommended workflow is: + +1. **Lint first** (`lint-prompt`) — identify and fix determinism issues + in the prompt language statically. +2. **Evaluate second** (`evaluate-prompt-portability`) — verify the + fixes improved cross-model consistency empirically. + +## Non-Goals + +- This template does NOT measure output *quality* — only cross-model + *consistency*. A prompt that produces consistently wrong output + across all models will score as "Portable." +- This template does NOT modify the evaluated prompt. It produces + recommendations. The user applies them. +- This template does NOT benchmark model performance or rank models. + It evaluates the prompt's sensitivity to model choice. + +## Quality Checklist + +Before finalizing the report, verify: + +- [ ] All target models were executed (or failures documented) +- [ ] Claim extraction used the same procedure for all outputs +- [ ] Every claim cluster has a classification with justification +- [ ] Contradictory claims cite the exact prompt language causing divergence +- [ ] Hardening recommendations are concrete rewrites, not vague advice +- [ ] The portability score computation is shown (not just the result) +- [ ] Model Notes section is populated for capability-gap divergences