Skip to content

Commit bb7babe

Browse files
Alan-JowettAlan JowettCopilot
authored
Add evaluate-prompt-portability template, protocol, and format (#239)
* Add evaluate-prompt-portability template, protocol, and format Adds three new PromptKit components for cross-LLM prompt portability evaluation (addresses #127 Phase 1): - Protocol: prompt-portability-evaluation — 7-phase claim-level consensus analysis methodology (output collection, claim extraction, semantic matching, consensus classification, divergence analysis, scoring, and hardening recommendations) - Format: portability-report — 9-section structured report covering evaluation context, per-model summaries, consensus core, majority claims, divergent claims (singular + contradictory), scorecard, hardening recommendations, and model notes - Template: evaluate-prompt-portability — interactive template that orchestrates fan-out execution across multiple LLM models, collects outputs, decomposes them into atomic semantic claims, performs cross-model consensus analysis, and produces a portability report Key design: comparison is semantic (claim-level), not textual. Two models producing the same assertions in different words score as Consensus. Contradictory claims (mutually exclusive assertions) are the highest-priority signal, traced to specific ambiguous prompt language with concrete rewrite recommendations. Complements the existing lint-prompt template — lint statically first, then evaluate empirically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review feedback: scoring, thresholds, error handling - Fix Majority threshold from >=50% to >50% to avoid ties with even model counts - Normalize portability score from [-1,1] to [0,1] via (raw_weighted_mean + 1.0) / 2.0 so contradictory claims cannot produce negative scores - Define explicit Manual Review bucket for uncertain claim matches: excluded from scoring, reported in new Uncertain / Needs Review section in the portability-report format - Add fail-stop behavior when <2 models succeed: produce abbreviated report documenting failures instead of misleading partial analysis - Add UR- claim ID prefix to formatting rules Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add reference model sufficiency analysis for model selection Extends the portability evaluation with a new mode: when a reference model is designated, its claims become the ground-truth baseline and each cheaper model is scored on how well it reproduces that baseline. Protocol (Phase 8 - Model Sufficiency Analysis): - Per-model sufficiency rate = reproduced / total baseline claims - Missing claims classified as critical miss vs minor miss - Extra claims classified as valid addition / hallucination / noise - Three-tier sufficiency status: sufficient, conditionally sufficient, insufficient (based on threshold, critical misses, contradictions) - Identifies the minimum sufficient model (cheapest that meets threshold with zero critical misses) Format (Section 10 - Model Sufficiency Matrix): - Reference model + threshold display - Per-model sufficiency table with tier, rates, and status - Missing and extra claim detail tables - Cost-efficiency recommendation Template: - New params: reference_model (optional), sufficiency_threshold (default 90%) - Input validation ensures reference model is in the model list - Step 7 conditionally applies Phase 8 when reference model is set Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix template Steps 4-5 for Manual Review bucket consistency Steps 4 and 5 were inconsistent with the protocol's Manual Review bucket rules for uncertain matches. Step 4 only flagged uncertain matches instead of placing them in a Manual Review bucket. Step 5 classified all clusters without excluding Manual Review clusters from scoring. Fixed: - Step 4: uncertain matches now placed in Manual Review bucket, excluded from scored classification, reported under Uncertain / Needs Review section - Step 5: classifies only non-Manual-Review clusters, includes Manual Review count in the summary presented to user Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix heading nesting, section omission rule, and cluster type ambiguity - Fix 6c heading from ## to ### to match 6a/6b nesting under section 6 (Divergent Claims) - Section 10 (Model Sufficiency Matrix) now always included per the format's 'do not omit any section' rule, with a placeholder when no reference model is designated - Add canonical cluster type rule for scoring: when models assign different types to semantically matched claims, use the highest-weight type, break ties by majority, then arbiter decides. Original per-model types preserved for transparency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Alan Jowett <alan.jowett@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 8363a14 commit bb7babe

4 files changed

Lines changed: 771 additions & 0 deletions

File tree

formats/portability-report.md

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
<!-- SPDX-License-Identifier: MIT -->
2+
<!-- Copyright (c) PromptKit Contributors -->
3+
4+
---
5+
name: portability-report
6+
type: format
7+
description: >
8+
Output format for prompt portability evaluation reports. Structures
9+
cross-model comparison results as claim-level consensus analysis
10+
with portability scoring and hardening recommendations.
11+
produces: portability-report
12+
---
13+
14+
# Format: Portability Report
15+
16+
The output MUST be a structured portability report documenting how a
17+
prompt performs across multiple LLM models. The unit of analysis is the
18+
**claim** — an atomic assertion each model's output makes. The report
19+
compares claims semantically, not textually.
20+
21+
Do not omit any section. If a section has no content, state
22+
"None identified."
23+
24+
## Document Structure
25+
26+
```markdown
27+
# <Prompt Name> — Portability Report
28+
29+
## 1. Evaluation Context
30+
31+
| Field | Value |
32+
|-------|-------|
33+
| **Prompt** | Name or description of the evaluated prompt |
34+
| **Source Template** | PromptKit template used to assemble the prompt (if applicable) |
35+
| **Golden Input** | Description of the test input provided |
36+
| **Models Evaluated** | Comma-separated list of model identifiers |
37+
| **Arbiter Model** | Model used for claim extraction and matching |
38+
| **Evaluation Date** | When the evaluation was performed |
39+
40+
## 2. Portability Summary
41+
42+
**Overall Score**: <score> / 1.00 — **<Rating>**
43+
44+
| Metric | Value |
45+
|--------|-------|
46+
| **Total Unique Claims** | <count of claim clusters> |
47+
| **Consensus Claims** | <count> (<percentage>%) |
48+
| **Majority Claims** | <count> (<percentage>%) |
49+
| **Singular Claims** | <count> (<percentage>%) |
50+
| **Contradictory Claims** | <count> (<percentage>%) |
51+
52+
<2–4 sentence interpretation of the overall portability posture.>
53+
54+
## 3. Per-Model Output Summaries
55+
56+
For each model, provide a brief characterization of its output:
57+
58+
### Model: <model-identifier>
59+
60+
- **Output Length**: <word count or approximate size>
61+
- **Sections Produced**: <list of output sections the model generated>
62+
- **Claims Extracted**: <count>
63+
- **Notable Characteristics**: <1–2 sentences on style, depth, or
64+
approach differences>
65+
66+
## 4. Consensus Core
67+
68+
Claims that ALL models produced. These represent the semantically stable
69+
output of this prompt.
70+
71+
### Claim Cluster CC-<NNN>: <Short Claim Title>
72+
73+
| Field | Value |
74+
|-------|-------|
75+
| **Claim** | <normalized claim text> |
76+
| **Type** | finding · recommendation · observation · caveat · classification |
77+
| **Classification** | Consensus |
78+
| **Models** | All (<list>) |
79+
80+
<If models expressed this claim with notably different emphasis or
81+
framing, note the variation briefly.>
82+
83+
## 5. Majority Claims
84+
85+
Claims that most (but not all) models produced.
86+
87+
### Claim Cluster MC-<NNN>: <Short Claim Title>
88+
89+
| Field | Value |
90+
|-------|-------|
91+
| **Claim** | <normalized claim text> |
92+
| **Type** | finding · recommendation · observation · caveat · classification |
93+
| **Classification** | Majority |
94+
| **Models Present** | <list of models that produced this claim> |
95+
| **Models Absent** | <list of models that did NOT produce this claim> |
96+
| **Likely Cause** | <why absent models omitted this — depth variation, scope interpretation, etc.> |
97+
98+
## 6. Divergent Claims
99+
100+
Claims produced by only one model, or where models contradicted each
101+
other. This is the highest-signal section of the report.
102+
103+
### 6a. Singular Claims
104+
105+
### Claim Cluster SC-<NNN>: <Short Claim Title>
106+
107+
| Field | Value |
108+
|-------|-------|
109+
| **Claim** | <normalized claim text> |
110+
| **Type** | finding · recommendation · observation · caveat · classification |
111+
| **Classification** | Singular |
112+
| **Source Model** | <the one model that produced this claim> |
113+
| **Divergence Cause** | ambiguous-instruction · underspecified-scope · model-capability-gap · hallucination · depth-variation · format-interpretation |
114+
| **Prompt Region** | <exact quote of the prompt language that produced this claim> |
115+
| **Analysis** | <why this model produced the claim and others didn't> |
116+
117+
### 6b. Contradictory Claims
118+
119+
### Claim Cluster XC-<NNN>: <Short Claim Title>
120+
121+
| Field | Value |
122+
|-------|-------|
123+
| **Subject** | <what the contradiction is about> |
124+
| **Classification** | Contradictory |
125+
126+
| Model | Assertion |
127+
|-------|-----------|
128+
| <model-A> | <what model A asserts> |
129+
| <model-B> | <what model B asserts (contradicts A)> |
130+
131+
- **Divergence Cause**: <cause classification>
132+
- **Prompt Region**: <exact quote of the ambiguous prompt language>
133+
- **Analysis**: <why the prompt produces contradictory interpretations>
134+
135+
### 6c. Uncertain / Needs Review
136+
137+
Claim clusters where semantic matching confidence was `uncertain`.
138+
These are excluded from the portability score until a human reviewer
139+
resolves the match and assigns a standard classification.
140+
141+
### Claim Cluster UR-<NNN>: <Short Claim Title>
142+
143+
| Field | Value |
144+
|-------|-------|
145+
| **Claim (Model A)** | <claim text from one model> |
146+
| **Claim (Model B)** | <claim text from another model> |
147+
| **Match Confidence** | uncertain |
148+
| **Reason** | <why the match is uncertain — similar but not clearly equivalent> |
149+
| **Reviewer Action Needed** | Classify as: Consensus match / Majority match / Distinct claims (Singular) / Contradictory |
150+
151+
If no uncertain clusters exist, state "None identified."
152+
153+
## 7. Portability Scorecard
154+
155+
### Per-Claim-Type Breakdown
156+
157+
| Claim Type | Total | Consensus | Majority | Singular | Contradictory | Type Score |
158+
|------------|-------|-----------|----------|----------|----------------|------------|
159+
| finding | | | | | | |
160+
| recommendation | | | | | | |
161+
| classification | | | | | | |
162+
| observation | | | | | | |
163+
| caveat | | | | | | |
164+
165+
### Per-Model Agreement Rate
166+
167+
| Model | Claims Produced | In Consensus | In Majority | Singular | In Contradiction | Agreement Rate |
168+
|-------|-----------------|--------------|-------------|----------|------------------|----------------|
169+
| <model> | | | | | | |
170+
171+
## 8. Hardening Recommendations
172+
173+
Specific prompt rewrites to improve portability. Ordered by impact
174+
(Contradictory fixes first, then Singular with high-weight claim types).
175+
176+
### HR-<NNN>: <Short Description>
177+
178+
- **Target Claim(s)**: <claim cluster IDs affected>
179+
- **Current Prompt Language**:
180+
> <exact quote from the prompt>
181+
- **Problem**: <why this language produces divergence>
182+
- **Recommended Rewrite**:
183+
> <concrete replacement text>
184+
- **Expected Effect**: <which models would change behavior and how>
185+
- **Confidence**: High · Medium · Low
186+
187+
## 9. Model Notes
188+
189+
Observations about model-specific behavior that no prompt rewrite can
190+
address. These should be recorded in the template's `model_notes`
191+
frontmatter if the divergence is persistent.
192+
193+
| Model | Observation | Recommended `model_notes` Entry |
194+
|-------|-------------|---------------------------------|
195+
| <model> | <behavior> | <suggested YAML> |
196+
197+
## 10. Model Sufficiency Matrix
198+
199+
*Always include this section. If a reference model is designated,
200+
include the full sufficiency analysis below. If no reference model was
201+
specified, include only: "No reference model designated — sufficiency
202+
analysis not performed."*
203+
204+
**Reference Model**: <model identifier>
205+
**Sufficiency Threshold**: <threshold>%
206+
207+
### Sufficiency Summary
208+
209+
| Model | Tier | Reproduced | Missing | Extra | Contradicted | Sufficiency Rate | Critical Misses | Status |
210+
|-------|------|:----------:|:-------:|:-----:|:------------:|:----------------:|:---------------:|--------|
211+
| <reference> | reference ||||| baseline || baseline |
212+
| <model> | <tier> | | | | | | | Sufficient · Conditionally Sufficient · Insufficient |
213+
214+
**Minimum Sufficient Model**: <model identifier> — <brief justification>
215+
216+
### Missing Claim Details
217+
218+
For each model with missing claims, list what was missed:
219+
220+
#### <model-identifier>
221+
222+
| Baseline Claim | Type | Impact | Analysis |
223+
|----------------|------|--------|----------|
224+
| <claim text from reference> | finding · recommendation · observation · caveat | Critical miss · Minor miss | <why this model missed it> |
225+
226+
### Extra Claim Details
227+
228+
For each model with extra claims not in the baseline:
229+
230+
#### <model-identifier>
231+
232+
| Extra Claim | Type | Classification | Analysis |
233+
|-------------|------|----------------|----------|
234+
| <claim text> | finding · recommendation · observation · caveat | Valid addition · Hallucination · Noise | <assessment> |
235+
236+
### Cost-Efficiency Recommendation
237+
238+
<1–3 sentences recommending which model to use for this prompt,
239+
balancing sufficiency rate, cost, and speed. If the minimum sufficient
240+
model matches the reference model, state that no cheaper alternative
241+
produced equivalent output and recommend prompt hardening to enable
242+
cheaper models.>
243+
```
244+
245+
## Formatting Rules
246+
247+
1. **Claim IDs** use prefixes: `CC-` (Consensus), `MC-` (Majority),
248+
`SC-` (Singular), `XC-` (Contradictory), `UR-` (Uncertain / Needs
249+
Review), followed by a zero-padded three-digit number.
250+
2. **Hardening Recommendation IDs** use `HR-` prefix.
251+
3. **Score precision**: Report the portability score to two decimal
252+
places.
253+
4. **Ordering**: Within each section, order claims by type priority
254+
(`finding` > `recommendation` > `classification` > `observation` >
255+
`caveat`), then by claim cluster number.
256+
5. **Brevity in summaries**: Per-model output summaries should be
257+
concise (3–5 lines each). The full raw outputs are not included in
258+
the report — only claim extractions.
259+
6. **Quote fidelity**: Prompt regions cited in divergence analysis
260+
must be exact quotes, not paraphrases.

manifest.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -631,6 +631,14 @@ protocols:
631631
copying from external sources, screens for confidential or
632632
internal-only content, and verifies license compliance.
633633
634+
- name: prompt-portability-evaluation
635+
path: protocols/reasoning/prompt-portability-evaluation.md
636+
description: >
637+
Systematic methodology for evaluating prompt portability across
638+
LLM models. Decomposes outputs into atomic claims, performs
639+
cross-model semantic matching, and classifies consensus levels
640+
to identify fragile prompt language.
641+
634642
formats:
635643
- name: requirements-doc
636644
path: formats/requirements-doc.md
@@ -845,6 +853,14 @@ formats:
845853
notes, a presentation timeline, and an optional demo plan.
846854
All artifacts form a cohesive presentation kit.
847855
856+
- name: portability-report
857+
path: formats/portability-report.md
858+
produces: portability-report
859+
description: >
860+
Output format for prompt portability evaluation reports. Structures
861+
cross-model comparison results as claim-level consensus analysis
862+
with portability scoring and hardening recommendations.
863+
848864
taxonomies:
849865
- name: stack-lifetime-hazards
850866
path: taxonomies/stack-lifetime-hazards.md
@@ -1195,6 +1211,18 @@ templates:
11951211
protocols: [anti-hallucination, self-verification, prompt-determinism-analysis]
11961212
format: investigation-report
11971213

1214+
- name: evaluate-prompt-portability
1215+
path: templates/evaluate-prompt-portability.md
1216+
description: >
1217+
Evaluate a PromptKit-assembled prompt's portability across LLM
1218+
models. Runs the prompt against multiple models with a golden
1219+
input, decomposes outputs into semantic claims, performs
1220+
cross-model consensus analysis, and produces a portability
1221+
report with hardening recommendations.
1222+
persona: specification-analyst
1223+
protocols: [anti-hallucination, self-verification, prompt-portability-evaluation]
1224+
format: portability-report
1225+
11981226
standards:
11991227
- name: extract-rfc-requirements
12001228
path: templates/extract-rfc-requirements.md

0 commit comments

Comments
 (0)