diff --git a/formats/portability-report.md b/formats/portability-report.md
new file mode 100644
index 0000000..dbb806b
--- /dev/null
+++ b/formats/portability-report.md
@@ -0,0 +1,260 @@
+<!-- SPDX-License-Identifier: MIT -->
+<!-- Copyright (c) PromptKit Contributors -->
+
+---
+name: portability-report
+type: format
+description: >
+  Output format for prompt portability evaluation reports. Structures
+  cross-model comparison results as claim-level consensus analysis
+  with portability scoring and hardening recommendations.
+produces: portability-report
+---
+
+# Format: Portability Report
+
+The output MUST be a structured portability report documenting how a
+prompt performs across multiple LLM models. The unit of analysis is the
+**claim** — an atomic assertion each model's output makes. The report
+compares claims semantically, not textually.
+
+Do not omit any section. If a section has no content, state
+"None identified."
+
+## Document Structure
+
+```markdown
+# <Prompt Name> — Portability Report
+
+## 1. Evaluation Context
+
+| Field | Value |
+|-------|-------|
+| **Prompt** | Name or description of the evaluated prompt |
+| **Source Template** | PromptKit template used to assemble the prompt (if applicable) |
+| **Golden Input** | Description of the test input provided |
+| **Models Evaluated** | Comma-separated list of model identifiers |
+| **Arbiter Model** | Model used for claim extraction and matching |
+| **Evaluation Date** | When the evaluation was performed |
+
+## 2. Portability Summary
+
+**Overall Score**: <score> / 1.00 — **<Rating>**
+
+| Metric | Value |
+|--------|-------|
+| **Total Unique Claims** | <count of claim clusters> |
+| **Consensus Claims** | <count> (<percentage>%) |
+| **Majority Claims** | <count> (<percentage>%) |
+| **Singular Claims** | <count> (<percentage>%) |
+| **Contradictory Claims** | <count> (<percentage>%) |
+
+<2–4 sentence interpretation of the overall portability posture.>
+
+## 3. Per-Model Output Summaries
+
+For each model, provide a brief characterization of its output:
+
+### Model: <model-identifier>
+
+- **Output Length**: <word count or approximate size>
+- **Sections Produced**: <list of output sections the model generated>
+- **Claims Extracted**: <count>
+- **Notable Characteristics**: <1–2 sentences on style, depth, or
+  approach differences>
+
+## 4. Consensus Core
+
+Claims that ALL models produced. These represent the semantically stable
+output of this prompt.
+
+### Claim Cluster CC-<NNN>: <Short Claim Title>
+
+| Field | Value |
+|-------|-------|
+| **Claim** | <normalized claim text> |
+| **Type** | finding · recommendation · observation · caveat · classification |
+| **Classification** | Consensus |
+| **Models** | All (<list>) |
+
+<If models expressed this claim with notably different emphasis or
+framing, note the variation briefly.>
+
+## 5. Majority Claims
+
+Claims that most (but not all) models produced.
+
+### Claim Cluster MC-<NNN>: <Short Claim Title>
+
+| Field | Value |
+|-------|-------|
+| **Claim** | <normalized claim text> |
+| **Type** | finding · recommendation · observation · caveat · classification |
+| **Classification** | Majority |
+| **Models Present** | <list of models that produced this claim> |
+| **Models Absent** | <list of models that did NOT produce this claim> |
+| **Likely Cause** | <why absent models omitted this — depth variation, scope interpretation, etc.> |
+
+## 6. Divergent Claims
+
+Claims produced by only one model, or where models contradicted each
+other. This is the highest-signal section of the report.
+
+### 6a. Singular Claims
+
+### Claim Cluster SC-<NNN>: <Short Claim Title>
+
+| Field | Value |
+|-------|-------|
+| **Claim** | <normalized claim text> |
+| **Type** | finding · recommendation · observation · caveat · classification |
+| **Classification** | Singular |
+| **Source Model** | <the one model that produced this claim> |
+| **Divergence Cause** | ambiguous-instruction · underspecified-scope · model-capability-gap · hallucination · depth-variation · format-interpretation |
+| **Prompt Region** | <exact quote of the prompt language that produced this claim> |
+| **Analysis** | <why this model produced the claim and others didn't> |
+
+### 6b. Contradictory Claims
+
+### Claim Cluster XC-<NNN>: <Short Claim Title>
+
+| Field | Value |
+|-------|-------|
+| **Subject** | <what the contradiction is about> |
+| **Classification** | Contradictory |
+
+| Model | Assertion |
+|-------|-----------|
+| <model-A> | <what model A asserts> |
+| <model-B> | <what model B asserts (contradicts A)> |
+
+- **Divergence Cause**: <cause classification>
+- **Prompt Region**: <exact quote of the ambiguous prompt language>
+- **Analysis**: <why the prompt produces contradictory interpretations>
+
+### 6c. Uncertain / Needs Review
+
+Claim clusters where semantic matching confidence was `uncertain`.
+These are excluded from the portability score until a human reviewer
+resolves the match and assigns a standard classification.
+
+### Claim Cluster UR-<NNN>: <Short Claim Title>
+
+| Field | Value |
+|-------|-------|
+| **Claim (Model A)** | <claim text from one model> |
+| **Claim (Model B)** | <claim text from another model> |
+| **Match Confidence** | uncertain |
+| **Reason** | <why the match is uncertain — similar but not clearly equivalent> |
+| **Reviewer Action Needed** | Classify as: Consensus match / Majority match / Distinct claims (Singular) / Contradictory |
+
+If no uncertain clusters exist, state "None identified."
+
+## 7. Portability Scorecard
+
+### Per-Claim-Type Breakdown
+
+| Claim Type | Total | Consensus | Majority | Singular | Contradictory | Type Score |
+|------------|-------|-----------|----------|----------|----------------|------------|
+| finding | | | | | | |
+| recommendation | | | | | | |
+| classification | | | | | | |
+| observation | | | | | | |
+| caveat | | | | | | |
+
+### Per-Model Agreement Rate
+
+| Model | Claims Produced | In Consensus | In Majority | Singular | In Contradiction | Agreement Rate |
+|-------|-----------------|--------------|-------------|----------|------------------|----------------|
+| <model> | | | | | | |
+
+## 8. Hardening Recommendations
+
+Specific prompt rewrites to improve portability. Ordered by impact
+(Contradictory fixes first, then Singular with high-weight claim types).
+
+### HR-<NNN>: <Short Description>
+
+- **Target Claim(s)**: <claim cluster IDs affected>
+- **Current Prompt Language**:
+  > <exact quote from the prompt>
+- **Problem**: <why this language produces divergence>
+- **Recommended Rewrite**:
+  > <concrete replacement text>
+- **Expected Effect**: <which models would change behavior and how>
+- **Confidence**: High · Medium · Low
+
+## 9. Model Notes
+
+Observations about model-specific behavior that no prompt rewrite can
+address. These should be recorded in the template's `model_notes`
+frontmatter if the divergence is persistent.
+
+| Model | Observation | Recommended `model_notes` Entry |
+|-------|-------------|---------------------------------|
+| <model> | <behavior> | <suggested YAML> |
+
+## 10. Model Sufficiency Matrix
+
+*Always include this section. If a reference model is designated,
+include the full sufficiency analysis below. If no reference model was
+specified, include only: "No reference model designated — sufficiency
+analysis not performed."*
+
+**Reference Model**: <model identifier>
+**Sufficiency Threshold**: <threshold>%
+
+### Sufficiency Summary
+
+| Model | Tier | Reproduced | Missing | Extra | Contradicted | Sufficiency Rate | Critical Misses | Status |
+|-------|------|:----------:|:-------:|:-----:|:------------:|:----------------:|:---------------:|--------|
+| <reference> | reference | — | — | — | — | baseline | — | baseline |
+| <model> | <tier> | | | | | | | Sufficient · Conditionally Sufficient · Insufficient |
+
+**Minimum Sufficient Model**: <model identifier> — <brief justification>
+
+### Missing Claim Details
+
+For each model with missing claims, list what was missed:
+
+#### <model-identifier>
+
+| Baseline Claim | Type | Impact | Analysis |
+|----------------|------|--------|----------|
+| <claim text from reference> | finding · recommendation · observation · caveat | Critical miss · Minor miss | <why this model missed it> |
+
+### Extra Claim Details
+
+For each model with extra claims not in the baseline:
+
+#### <model-identifier>
+
+| Extra Claim | Type | Classification | Analysis |
+|-------------|------|----------------|----------|
+| <claim text> | finding · recommendation · observation · caveat | Valid addition · Hallucination · Noise | <assessment> |
+
+### Cost-Efficiency Recommendation
+
+<1–3 sentences recommending which model to use for this prompt,
+balancing sufficiency rate, cost, and speed. If the minimum sufficient
+model matches the reference model, state that no cheaper alternative
+produced equivalent output and recommend prompt hardening to enable
+cheaper models.>
+```
+
+## Formatting Rules
+
+1. **Claim IDs** use prefixes: `CC-` (Consensus), `MC-` (Majority),
+   `SC-` (Singular), `XC-` (Contradictory), `UR-` (Uncertain / Needs
+   Review), followed by a zero-padded three-digit number.
+2. **Hardening Recommendation IDs** use `HR-` prefix.
+3. **Score precision**: Report the portability score to two decimal
+   places.
+4. **Ordering**: Within each section, order claims by type priority
+   (`finding` > `recommendation` > `classification` > `observation` >
+   `caveat`), then by claim cluster number.
+5. **Brevity in summaries**: Per-model output summaries should be
+   concise (3–5 lines each). The full raw outputs are not included in
+   the report — only claim extractions.
+6. **Quote fidelity**: Prompt regions cited in divergence analysis
+   must be exact quotes, not paraphrases.
diff --git a/manifest.yaml b/manifest.yaml
index fc37ab6..e096815 100644
--- a/manifest.yaml
+++ b/manifest.yaml
@@ -631,6 +631,14 @@ protocols:
         copying from external sources, screens for confidential or
         internal-only content, and verifies license compliance.
 
+    - name: prompt-portability-evaluation
+      path: protocols/reasoning/prompt-portability-evaluation.md
+      description: >
+        Systematic methodology for evaluating prompt portability across
+        LLM models. Decomposes outputs into atomic claims, performs
+        cross-model semantic matching, and classifies consensus levels
+        to identify fragile prompt language.
+
 formats:
   - name: requirements-doc
     path: formats/requirements-doc.md
@@ -845,6 +853,14 @@ formats:
       notes, a presentation timeline, and an optional demo plan.
       All artifacts form a cohesive presentation kit.
 
+  - name: portability-report
+    path: formats/portability-report.md
+    produces: portability-report
+    description: >
+      Output format for prompt portability evaluation reports. Structures
+      cross-model comparison results as claim-level consensus analysis
+      with portability scoring and hardening recommendations.
+
 taxonomies:
   - name: stack-lifetime-hazards
     path: taxonomies/stack-lifetime-hazards.md
@@ -1195,6 +1211,18 @@ templates:
       protocols: [anti-hallucination, self-verification, prompt-determinism-analysis]
       format: investigation-report
 
+    - name: evaluate-prompt-portability
+      path: templates/evaluate-prompt-portability.md
+      description: >
+        Evaluate a PromptKit-assembled prompt's portability across LLM
+        models. Runs the prompt against multiple models with a golden
+        input, decomposes outputs into semantic claims, performs
+        cross-model consensus analysis, and produces a portability
+        report with hardening recommendations.
+      persona: specification-analyst
+      protocols: [anti-hallucination, self-verification, prompt-portability-evaluation]
+      format: portability-report
+
   standards:
     - name: extract-rfc-requirements
       path: templates/extract-rfc-requirements.md
diff --git a/protocols/reasoning/prompt-portability-evaluation.md b/protocols/reasoning/prompt-portability-evaluation.md
new file mode 100644
index 0000000..3b5e701
--- /dev/null
+++ b/protocols/reasoning/prompt-portability-evaluation.md
@@ -0,0 +1,267 @@
+<!-- SPDX-License-Identifier: MIT -->
+<!-- Copyright (c) PromptKit Contributors -->
+
+---
+name: prompt-portability-evaluation
+type: reasoning
+description: >
+  Systematic methodology for evaluating prompt portability across LLM
+  models. Decomposes model outputs into atomic claims, performs
+  cross-model semantic matching, and classifies consensus levels to
+  identify fragile prompt language.
+applicable_to:
+  - evaluate-prompt-portability
+---
+
+# Protocol: Prompt Portability Evaluation
+
+Apply this protocol when evaluating whether a prompt produces
+semantically equivalent outputs across different LLM models. The unit
+of comparison is the **claim** — an atomic assertion the output makes —
+not the raw text. Two outputs that say the same thing in different words
+are equivalent; two outputs with identical structure but different
+conclusions are divergent.
+
+## Phase 1: Output Collection
+
+1. **Validate inputs.** Confirm the assembled prompt is non-empty and
+   the golden input is non-empty. If either is missing, stop and report
+   the error.
+2. **Enumerate target models.** List the models to evaluate. If the
+   user did not provide a list, use the default set:
+   `claude-sonnet-4.5`, `gpt-4.1`, `claude-haiku-4.5`.
+3. **Execute the prompt against each model.** For each model:
+   - Provide the identical assembled prompt and golden input.
+   - Use the same system context and tool availability where possible.
+   - Record the model identifier and the complete raw output.
+4. **Launch evaluations in parallel** when the execution environment
+   supports it (e.g., parallel sub-agents with model overrides). Do NOT
+   run models sequentially if parallel execution is available.
+
+## Phase 2: Claim Extraction
+
+For each model's raw output, extract a set of **atomic claims**. A claim
+is a single, self-contained assertion that the output makes. Apply the
+same extraction procedure identically to every output.
+
+1. **Read the output end-to-end.** Identify every discrete assertion,
+   finding, recommendation, observation, or caveat.
+2. **Normalize each claim** into a structured record:
+
+   | Field | Description |
+   |-------|-------------|
+   | `claim_id` | Sequential ID within this model's output (e.g., `M1-C001`) |
+   | `claim_text` | The normalized assertion in declarative form |
+   | `section` | Which output section contains this claim |
+   | `type` | `finding` · `recommendation` · `observation` · `caveat` · `classification` |
+   | `specificity` | `concrete` (cites evidence/location) · `general` (abstract statement) |
+
+3. **Granularity rule.** Each claim must be atomic — it asserts exactly
+   one thing. If a sentence contains two assertions ("X is true and Y
+   is also true"), split into two claims.
+4. **Extraction exclusions.** Do NOT extract:
+   - Boilerplate preambles ("I'll analyze this code for…")
+   - Section headings or structural markers
+   - Restatements of the input prompt or golden input
+   - Meta-commentary about the model's own process
+
+## Phase 3: Cross-Model Claim Matching
+
+Compare claim sets across all models to identify semantic equivalences.
+
+1. **Build a claim universe.** Collect all claims from all models into
+   a single pool.
+2. **Pairwise semantic matching.** For each claim from model A, check
+   every claim from every other model:
+   - **Match**: The two claims assert the same thing, even if worded
+     differently. Evidence: they would have the same truth value in all
+     plausible interpretations of the golden input.
+   - **Partial match**: The claims overlap but one is more specific or
+     covers a subset of the other's assertion.
+   - **No match**: The claims assert different things.
+   - **Contradiction**: The claims assert mutually exclusive things
+     about the same subject.
+3. **Cluster matched claims.** Group semantically equivalent claims
+   into **claim clusters**. Each cluster represents one unique
+   assertion that one or more models made.
+4. **Record match confidence.** For each match decision, note the
+   confidence: `definite` (identical meaning), `likely` (same meaning,
+   different framing), `uncertain` (possibly the same, possibly
+   different).
+
+## Phase 4: Consensus Classification
+
+Classify each claim cluster by how many models produced it.
+
+| Classification | Criterion | Interpretation |
+|----------------|-----------|----------------|
+| **Consensus** | All models produced this claim | Semantically stable — the prompt reliably elicits this assertion |
+| **Majority** | >50% of models (but not all) produced this claim | Likely valid but not universally elicited — prompt may be ambiguous |
+| **Singular** | Exactly one model produced this claim | Possible hallucination, unique insight, or model-specific interpretation |
+| **Contradictory** | Two or more models assert mutually exclusive things | The prompt is ambiguous on this point — different models resolve the ambiguity differently |
+
+Rules:
+- A claim cluster with any `uncertain` match must be placed in a
+  **Manual Review** bucket rather than auto-classified as Consensus,
+  Majority, Singular, or Contradictory.
+- Manual Review clusters are **excluded from the portability score**
+  until a human reviewer resolves the uncertain match and assigns the
+  cluster to a standard classification.
+- Report Manual Review clusters in a separate portability report
+  section named **Uncertain / Needs Review**. Do not include them under
+  the standard classification counts until they are resolved.
+- Singular claims from high-capability models are not automatically
+  hallucinations — they may represent deeper analysis that other models
+  missed. Note the model capability tier.
+- Contradictory claims are the highest-priority signal. Always trace
+  these to specific prompt language in Phase 5.
+
+## Phase 5: Divergence Root Cause Analysis
+
+For each Singular or Contradictory claim cluster:
+
+1. **Identify the prompt region** that the claim responds to. Which
+   instruction, protocol phase, or format requirement produced this
+   claim?
+2. **Analyze the prompt language.** What about the instruction is
+   ambiguous, underspecified, or open to interpretation?
+   - Vague quantifiers ("several", "a few")
+   - Subjective adjectives ("important", "significant")
+   - Missing scope bounds ("analyze the code" — which code?)
+   - Implicit assumptions (domain knowledge the prompt assumes)
+   - Competing instructions (two directives that could conflict)
+3. **Classify the divergence cause:**
+
+   | Cause | Description |
+   |-------|-------------|
+   | `ambiguous-instruction` | The prompt instruction has multiple valid interpretations |
+   | `underspecified-scope` | The prompt does not bound what to examine or how deeply |
+   | `model-capability-gap` | One model lacks the capability to follow the instruction |
+   | `hallucination` | One model fabricated a claim with no basis in the input |
+   | `depth-variation` | Models analyzed to different depths — all are correct, but some are more thorough |
+   | `format-interpretation` | Models interpreted the output format requirements differently |
+
+## Phase 6: Portability Scoring
+
+Compute a portability score for the evaluated prompt.
+
+1. **Claim-level scoring.** For each claim cluster:
+   - Consensus = 1.0
+   - Majority = 0.5
+   - Singular = 0.0
+   - Contradictory = −1.0
+
+   **Canonical cluster type rule.** When claims in a cluster were
+   assigned different types across models (e.g., one model calls it a
+   `finding`, another calls it a `recommendation`), assign the cluster
+   the type with the **highest weight** among its member claims. If
+   two types share the same weight, prefer the type that appears in
+   the majority of member claims. If still tied, the arbiter assigns
+   the canonical type. Record the original per-model types in the
+   claim cluster details for transparency.
+
+2. **Aggregate score.** First compute the weighted mean of all claim
+   cluster scores, weighted by claim type priority:
+   - `finding` weight = 3 (findings that differ are high-impact)
+   - `recommendation` weight = 2
+   - `classification` weight = 2
+   - `observation` weight = 1
+   - `caveat` weight = 1
+
+   Then normalize the weighted mean from the `[-1.0, 1.0]` range into
+   the final portability score in the `[0.0, 1.0]` range using:
+
+   `portability_score = (raw_weighted_mean + 1.0) / 2.0`
+
+   This preserves the stronger penalty for contradictory claims while
+   ensuring the reported portability score cannot be negative or exceed
+   `1.0`.
+
+3. **Interpret the score:**
+
+   | Score Range | Rating | Interpretation |
+   |-------------|--------|----------------|
+   | ≥ 0.85 | **Portable** | Prompt produces semantically equivalent output across models |
+   | 0.60–0.84 | **Mostly Portable** | Core findings are stable; peripheral claims vary |
+   | 0.35–0.59 | **Fragile** | Significant divergence — prompt hardening needed |
+   | < 0.35 | **Model-Dependent** | Output varies substantially — prompt needs major revision |
+
+## Phase 7: Hardening Recommendations
+
+For each Singular or Contradictory claim cluster, propose a specific
+prompt rewrite that would move the claim toward Consensus.
+
+1. **State the current prompt language** (exact quote).
+2. **Explain why it produces divergence** (from Phase 5 analysis).
+3. **Propose a rewrite** that eliminates the ambiguity. The rewrite
+   must be concrete — not "make this more specific" but the actual
+   replacement text.
+4. **Predict the effect** — which models would change behavior and why.
+
+Rules:
+- Rewrites must not change the prompt's intent — only its precision.
+- Rewrites must follow PromptKit conventions (imperative mood, numbered
+  phases, explicit scope bounds).
+- If a divergence is caused by `model-capability-gap`, note that no
+  prompt rewrite can fix it. Instead recommend a `model_notes` entry.
+
+## Phase 8: Model Sufficiency Analysis (Reference Model Mode)
+
+This phase executes only when a **reference model** is designated. Skip
+this phase entirely if no reference model is specified.
+
+When a reference model is designated, its claim set becomes the
+**baseline** — the ground truth against which all other models are
+measured. This shifts the analysis from "do models agree?" (consensus)
+to "does model X reproduce the reference model's output?" (sufficiency).
+
+1. **Designate the baseline claim set.** The reference model's
+   extracted claims from Phase 2 become the baseline. Every claim in
+   the baseline is a **required claim**.
+
+2. **Per-model sufficiency scoring.** For each non-reference model,
+   compute:
+
+   | Metric | Definition |
+   |--------|------------|
+   | **Reproduced** | Claims from the baseline that this model also produced (matched in Phase 3) |
+   | **Missing** | Claims from the baseline that this model did NOT produce |
+   | **Extra** | Claims this model produced that are NOT in the baseline |
+   | **Contradicted** | Claims where this model asserts the opposite of the baseline |
+   | **Sufficiency Rate** | `reproduced / total_baseline_claims × 100%` |
+
+3. **Classify missing claims by impact.** For each missing claim,
+   assess its importance:
+   - **Critical miss**: A finding or recommendation that, if absent,
+     would leave a significant gap (e.g., missing a security
+     vulnerability the reference model found)
+   - **Minor miss**: An observation or caveat whose absence does not
+     meaningfully degrade the output
+
+4. **Classify extra claims.** For each claim the model produced that
+   the reference did not:
+   - **Valid addition**: A correct claim the reference model missed
+     (depth-variation in the model's favor)
+   - **Hallucination**: A claim with no basis in the input
+   - **Noise**: A low-value observation that adds length without
+     insight
+
+5. **Determine sufficiency.** A model is **sufficient** for this
+   prompt if:
+   - Sufficiency rate meets or exceeds the user-specified threshold
+     (default: 90%)
+   - Zero critical misses
+   - Zero contradicted claims
+
+   A model is **conditionally sufficient** if:
+   - Sufficiency rate meets the threshold
+   - Has critical misses, but all are traceable to a
+     `model-capability-gap` divergence cause (the model cannot be
+     fixed by prompt hardening)
+
+   A model is **insufficient** if it fails either criterion.
+
+6. **Produce the model sufficiency matrix.** Rank models by cost tier
+   (if known) and sufficiency rate. The **minimum sufficient model**
+   is the cheapest model that meets the sufficiency threshold with
+   zero critical misses and zero contradictions.
diff --git a/templates/evaluate-prompt-portability.md b/templates/evaluate-prompt-portability.md
new file mode 100644
index 0000000..8c09370
--- /dev/null
+++ b/templates/evaluate-prompt-portability.md
@@ -0,0 +1,216 @@
+<!-- SPDX-License-Identifier: MIT -->
+<!-- Copyright (c) PromptKit Contributors -->
+
+---
+name: evaluate-prompt-portability
+mode: interactive
+description: >
+  Evaluate a PromptKit-assembled prompt's portability across LLM
+  models. Runs the prompt against multiple models with a golden
+  input, decomposes outputs into semantic claims, performs
+  cross-model consensus analysis, and produces a portability report
+  with hardening recommendations.
+persona: specification-analyst
+protocols:
+  - guardrails/anti-hallucination
+  - guardrails/self-verification
+  - reasoning/prompt-portability-evaluation
+format: portability-report
+params:
+  assembled_prompt: "The complete assembled prompt to evaluate (full text or file path)"
+  golden_input: "The deterministic test input to provide to each model along with the prompt"
+  models: "Comma-separated list of model identifiers to evaluate. Default: claude-sonnet-4.5, gpt-4.1, claude-haiku-4.5"
+  arbiter_model: "Model to use for claim extraction and semantic matching. Default: use the current session model"
+  reference_model: "Optional — designate one model as the ground-truth baseline for sufficiency analysis. When set, the report includes a model sufficiency matrix showing which cheaper models reproduce the reference model's output. Omit for pure consensus analysis."
+  sufficiency_threshold: "Minimum percentage of reference model claims a cheaper model must reproduce to be considered sufficient. Default: 90"
+input_contract: null
+output_contract:
+  type: portability-report
+  description: >
+    A structured portability report documenting claim-level consensus
+    across models, divergence analysis, portability scoring, and
+    hardening recommendations.
+---
+
+# Task: Evaluate Prompt Portability
+
+You are tasked with evaluating whether a prompt produces semantically
+equivalent outputs across different LLM models. You do NOT compare raw
+text — you compare the **semantic claims** each model's output makes.
+
+## Inputs
+
+**Assembled Prompt**:
+{{assembled_prompt}}
+
+**Golden Input**:
+{{golden_input}}
+
+**Target Models**: {{models}} (if blank, use: `claude-sonnet-4.5`,
+`gpt-4.1`, `claude-haiku-4.5`)
+
+**Arbiter Model**: {{arbiter_model}} (if blank, you are the arbiter)
+
+**Reference Model**: {{reference_model}} (if blank, skip sufficiency
+analysis — perform consensus analysis only)
+
+**Sufficiency Threshold**: {{sufficiency_threshold}}% (if blank, use 90%)
+
+## Instructions
+
+### Step 1: Input Validation
+
+1. Confirm the assembled prompt is non-empty. If it is a file path,
+   read the file. If the content is empty, stop and report the error.
+2. Confirm the golden input is non-empty.
+3. Parse the model list. If any model identifier is not recognized by
+   the execution environment, warn the user and ask whether to skip
+   it or substitute.
+4. If a reference model is specified, confirm it is included in the
+   model list. If not, add it automatically and inform the user.
+
+### Step 2: Fan-Out Execution
+
+Execute the assembled prompt against each target model using the
+golden input. The goal is to collect raw outputs from every model
+for the same prompt + input combination.
+
+1. For each model in the target list, launch a parallel execution:
+   - Provide the full assembled prompt as the system/task instruction.
+   - Provide the golden input as the user message or task context.
+   - Record the complete raw output.
+2. **Execution mechanism**: Use the execution environment's parallel
+   agent or subprocess capabilities. For example, in environments
+   with sub-agent support, launch one agent per model with the model
+   override parameter. In environments without parallel execution,
+   run sequentially.
+3. Wait for all executions to complete. If any model fails (timeout,
+   API error, content filter), record the failure and proceed with
+   the remaining models. Count the number of successful model
+   outputs after all executions finish.
+4. A minimum of 2 successful model outputs is required to produce a
+   meaningful comparison. If fewer than 2 models succeed, do **not**
+   continue to claim extraction, semantic matching, consensus
+   analysis, portability scoring, or hardening recommendations.
+   Instead, stop and produce an abbreviated report that includes:
+   - the full target model list,
+   - which models succeeded,
+   - which models failed,
+   - the failure reason for each failed model if known, and
+   - a clear statement that the evaluation ended early because there
+     were insufficient successful outputs for cross-model comparison.
+5. Only proceed to Step 3 if at least 2 model executions succeeded.
+
+### Step 3: Claim Extraction
+
+Apply Phase 2 of the prompt-portability-evaluation protocol to each
+model's raw output.
+
+1. For each model output, extract every atomic claim into the
+   normalized claim record structure: `claim_id`, `claim_text`,
+   `section`, `type`, `specificity`.
+2. Use the same extraction approach for all outputs — the arbiter
+   (you, or the designated arbiter model) processes each output
+   identically.
+3. Present the claim counts per model to the user before proceeding.
+   If any model produced zero claims, flag it as a potential failure.
+
+### Step 4: Cross-Model Semantic Matching
+
+Apply Phase 3 of the prompt-portability-evaluation protocol.
+
+1. Build the claim universe from all model outputs.
+2. Perform pairwise semantic matching across all claims.
+3. Cluster matched claims and record match confidence.
+4. If any claim cluster contains an `uncertain` semantic match,
+   place that cluster into a Manual Review bucket rather than a
+   scored classification bucket.
+5. Exclude all Manual Review clusters from consensus/scoring
+   calculations in subsequent steps.
+6. Record all Manual Review clusters for reporting under the
+   format's `Uncertain / Needs Review` section, and explicitly call
+   them out to the user before proceeding.
+
+### Step 5: Consensus Classification
+
+Apply Phase 4 of the prompt-portability-evaluation protocol.
+
+1. Classify each non-Manual-Review claim cluster: Consensus,
+   Majority, Singular, or Contradictory.
+2. Present the classification summary to the user:
+   - How many Consensus, Majority, Singular, and Contradictory
+     clusters were found.
+   - How many clusters are in Manual Review (excluded from scoring).
+   - Highlight any Contradictory clusters immediately — these are the
+     highest-priority signal.
+
+### Step 6: Divergence Analysis
+
+Apply Phase 5 of the prompt-portability-evaluation protocol.
+
+For each Singular and Contradictory claim cluster:
+1. Identify the prompt region that produced the divergence.
+2. Classify the divergence cause.
+3. Assess whether a prompt rewrite could address it.
+
+### Step 7: Scoring and Reporting
+
+Apply Phase 6 and Phase 7 of the prompt-portability-evaluation protocol.
+
+1. Compute the portability score.
+2. Generate hardening recommendations for each fixable divergence.
+3. If a reference model is specified, apply Phase 8 (Model
+   Sufficiency Analysis):
+   - Compute per-model sufficiency rates against the reference.
+   - Classify missing and extra claims.
+   - Determine sufficiency status for each model.
+   - Identify the minimum sufficient model.
+4. Produce the full portability report in the portability-report
+   format, including the Model Sufficiency Matrix (section 10) if
+   a reference model was specified.
+
+### Step 8: Interactive Review
+
+After producing the report:
+
+1. Present the portability score and the top 3 most impactful
+   findings to the user.
+2. Ask if they want to:
+   - Drill into specific divergent claims
+   - Apply the hardening recommendations to the original prompt
+   - Re-run the evaluation with additional models
+   - Export the report
+
+## Complementary Templates
+
+This template evaluates a prompt's portability *empirically* — by
+running it and comparing outputs. For *static* analysis of prompt
+language precision, use the `lint-prompt` template with the
+`prompt-determinism-analysis` protocol. The recommended workflow is:
+
+1. **Lint first** (`lint-prompt`) — identify and fix determinism issues
+   in the prompt language statically.
+2. **Evaluate second** (`evaluate-prompt-portability`) — verify the
+   fixes improved cross-model consistency empirically.
+
+## Non-Goals
+
+- This template does NOT measure output *quality* — only cross-model
+  *consistency*. A prompt that produces consistently wrong output
+  across all models will score as "Portable."
+- This template does NOT modify the evaluated prompt. It produces
+  recommendations. The user applies them.
+- This template does NOT benchmark model performance or rank models.
+  It evaluates the prompt's sensitivity to model choice.
+
+## Quality Checklist
+
+Before finalizing the report, verify:
+
+- [ ] All target models were executed (or failures documented)
+- [ ] Claim extraction used the same procedure for all outputs
+- [ ] Every claim cluster has a classification with justification
+- [ ] Contradictory claims cite the exact prompt language causing divergence
+- [ ] Hardening recommendations are concrete rewrites, not vague advice
+- [ ] The portability score computation is shown (not just the result)
+- [ ] Model Notes section is populated for capability-gap divergences