Add evaluate-prompt-portability template, protocol, and format#239
Merged
Alan-Jowett merged 5 commits intomicrosoft:mainfrom Apr 9, 2026
Merged
Conversation
Adds three new PromptKit components for cross-LLM prompt portability evaluation (addresses microsoft#127 Phase 1): - Protocol: prompt-portability-evaluation — 7-phase claim-level consensus analysis methodology (output collection, claim extraction, semantic matching, consensus classification, divergence analysis, scoring, and hardening recommendations) - Format: portability-report — 9-section structured report covering evaluation context, per-model summaries, consensus core, majority claims, divergent claims (singular + contradictory), scorecard, hardening recommendations, and model notes - Template: evaluate-prompt-portability — interactive template that orchestrates fan-out execution across multiple LLM models, collects outputs, decomposes them into atomic semantic claims, performs cross-model consensus analysis, and produces a portability report Key design: comparison is semantic (claim-level), not textual. Two models producing the same assertions in different words score as Consensus. Contradictory claims (mutually exclusive assertions) are the highest-priority signal, traced to specific ambiguous prompt language with concrete rewrite recommendations. Complements the existing lint-prompt template — lint statically first, then evaluate empirically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new PromptKit evaluation workflow to empirically assess cross-LLM prompt portability by running the same assembled prompt + golden input across multiple models, extracting semantic claims, and reporting consensus/divergence with a portability score and hardening rewrites (Phase 1 of #127’s evaluation framework).
Changes:
- Introduces an interactive
evaluate-prompt-portabilitytemplate that orchestrates multi-model execution, claim extraction, semantic matching, scoring, and reporting. - Adds a
prompt-portability-evaluationreasoning protocol defining the 7-phase methodology and scoring rubric. - Adds a
portability-reportoutput format and registers the new components inmanifest.yaml.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| templates/evaluate-prompt-portability.md | New interactive template to execute the evaluation workflow end-to-end. |
| protocols/reasoning/prompt-portability-evaluation.md | New protocol specifying claim extraction/matching, consensus classification, scoring, and hardening steps. |
| formats/portability-report.md | New structured 9-section report format for portability evaluation outputs. |
| manifest.yaml | Registers the new protocol, format, and template in the PromptKit manifest. |
- Fix Majority threshold from >=50% to >50% to avoid ties with even model counts - Normalize portability score from [-1,1] to [0,1] via (raw_weighted_mean + 1.0) / 2.0 so contradictory claims cannot produce negative scores - Define explicit Manual Review bucket for uncertain claim matches: excluded from scoring, reported in new Uncertain / Needs Review section in the portability-report format - Add fail-stop behavior when <2 models succeed: produce abbreviated report documenting failures instead of misleading partial analysis - Add UR- claim ID prefix to formatting rules Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extends the portability evaluation with a new mode: when a reference model is designated, its claims become the ground-truth baseline and each cheaper model is scored on how well it reproduces that baseline. Protocol (Phase 8 - Model Sufficiency Analysis): - Per-model sufficiency rate = reproduced / total baseline claims - Missing claims classified as critical miss vs minor miss - Extra claims classified as valid addition / hallucination / noise - Three-tier sufficiency status: sufficient, conditionally sufficient, insufficient (based on threshold, critical misses, contradictions) - Identifies the minimum sufficient model (cheapest that meets threshold with zero critical misses) Format (Section 10 - Model Sufficiency Matrix): - Reference model + threshold display - Per-model sufficiency table with tier, rates, and status - Missing and extra claim detail tables - Cost-efficiency recommendation Template: - New params: reference_model (optional), sufficiency_threshold (default 90%) - Input validation ensures reference model is in the model list - Step 7 conditionally applies Phase 8 when reference model is set Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Steps 4 and 5 were inconsistent with the protocol's Manual Review bucket rules for uncertain matches. Step 4 only flagged uncertain matches instead of placing them in a Manual Review bucket. Step 5 classified all clusters without excluding Manual Review clusters from scoring. Fixed: - Step 4: uncertain matches now placed in Manual Review bucket, excluded from scored classification, reported under Uncertain / Needs Review section - Step 5: classifies only non-Manual-Review clusters, includes Manual Review count in the summary presented to user Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix 6c heading from ## to ### to match 6a/6b nesting under section 6 (Divergent Claims) - Section 10 (Model Sufficiency Matrix) now always included per the format's 'do not omit any section' rule, with a placeholder when no reference model is designated - Add canonical cluster type rule for scoring: when models assign different types to semantically matched claims, use the highest-weight type, break ties by majority, then arbiter decides. Original per-model types preserved for transparency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds three new PromptKit components for cross-LLM prompt portability evaluation, addressing Phase 1 of #127 (Evaluation Framework).
Problem
PromptKit assembles prompts that are model-agnostic in theory, but different LLMs interpret the same instructions with different fidelity. There is no mechanism to measure, identify, or fix these cross-model divergences.
Solution
A new interactive template that evaluates prompt portability empirically by running the same prompt against multiple models and comparing outputs at the semantic claim level (not textual level).
New Components
prompt-portability-evaluationportability-reportevaluate-prompt-portabilityMethodology
Complementary Workflow
Designed to pair with
lint-prompt: lint statically first, then evaluate empirically.Validation
python tests/validate-manifest.pypasses.