Add evaluate-prompt-portability template, protocol, and format by Alan-Jowett · Pull Request #239 · microsoft/PromptKit

Alan-Jowett · 2026-04-09T16:37:58Z

Summary

Adds three new PromptKit components for cross-LLM prompt portability evaluation, addressing Phase 1 of #127 (Evaluation Framework).

Problem

PromptKit assembles prompts that are model-agnostic in theory, but different LLMs interpret the same instructions with different fidelity. There is no mechanism to measure, identify, or fix these cross-model divergences.

Solution

A new interactive template that evaluates prompt portability empirically by running the same prompt against multiple models and comparing outputs at the semantic claim level (not textual level).

New Components

Component	Type	Purpose
`prompt-portability-evaluation`	Protocol (reasoning)	8-phase claim-level consensus analysis methodology
`portability-report`	Format	10-section comparison report with scoring, sufficiency matrix, and hardening recommendations
`evaluate-prompt-portability`	Template (interactive)	Orchestrates fan-out, claim extraction, consensus analysis, and reporting

Methodology

Fan out the prompt + golden input to N models
Extract claims — decompose each output into atomic assertions
Semantic matching — pairwise match claims across models
Consensus classification — Consensus, Majority, Singular, Contradictory
Divergence root cause — trace divergent claims to specific prompt language
Portability score — weighted mean, normalized to [0, 1]
Hardening recommendations — concrete prompt rewrites
Model sufficiency analysis (optional) — when a reference model is designated, identifies the minimum sufficient cheaper model

Complementary Workflow

Designed to pair with lint-prompt: lint statically first, then evaluate empirically.

Validation

python tests/validate-manifest.py passes.

Adds three new PromptKit components for cross-LLM prompt portability evaluation (addresses microsoft#127 Phase 1): - Protocol: prompt-portability-evaluation — 7-phase claim-level consensus analysis methodology (output collection, claim extraction, semantic matching, consensus classification, divergence analysis, scoring, and hardening recommendations) - Format: portability-report — 9-section structured report covering evaluation context, per-model summaries, consensus core, majority claims, divergent claims (singular + contradictory), scorecard, hardening recommendations, and model notes - Template: evaluate-prompt-portability — interactive template that orchestrates fan-out execution across multiple LLM models, collects outputs, decomposes them into atomic semantic claims, performs cross-model consensus analysis, and produces a portability report Key design: comparison is semantic (claim-level), not textual. Two models producing the same assertions in different words score as Consensus. Contradictory claims (mutually exclusive assertions) are the highest-priority signal, traced to specific ambiguous prompt language with concrete rewrite recommendations. Complements the existing lint-prompt template — lint statically first, then evaluate empirically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a new PromptKit evaluation workflow to empirically assess cross-LLM prompt portability by running the same assembled prompt + golden input across multiple models, extracting semantic claims, and reporting consensus/divergence with a portability score and hardening rewrites (Phase 1 of #127’s evaluation framework).

Changes:

Introduces an interactive evaluate-prompt-portability template that orchestrates multi-model execution, claim extraction, semantic matching, scoring, and reporting.
Adds a prompt-portability-evaluation reasoning protocol defining the 7-phase methodology and scoring rubric.
Adds a portability-report output format and registers the new components in manifest.yaml.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
templates/evaluate-prompt-portability.md	New interactive template to execute the evaluation workflow end-to-end.
protocols/reasoning/prompt-portability-evaluation.md	New protocol specifying claim extraction/matching, consensus classification, scoring, and hardening steps.
formats/portability-report.md	New structured 9-section report format for portability evaluation outputs.
manifest.yaml	Registers the new protocol, format, and template in the PromptKit manifest.

protocols/reasoning/prompt-portability-evaluation.md

templates/evaluate-prompt-portability.md

- Fix Majority threshold from >=50% to >50% to avoid ties with even model counts - Normalize portability score from [-1,1] to [0,1] via (raw_weighted_mean + 1.0) / 2.0 so contradictory claims cannot produce negative scores - Define explicit Manual Review bucket for uncertain claim matches: excluded from scoring, reported in new Uncertain / Needs Review section in the portability-report format - Add fail-stop behavior when <2 models succeed: produce abbreviated report documenting failures instead of misleading partial analysis - Add UR- claim ID prefix to formatting rules Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extends the portability evaluation with a new mode: when a reference model is designated, its claims become the ground-truth baseline and each cheaper model is scored on how well it reproduces that baseline. Protocol (Phase 8 - Model Sufficiency Analysis): - Per-model sufficiency rate = reproduced / total baseline claims - Missing claims classified as critical miss vs minor miss - Extra claims classified as valid addition / hallucination / noise - Three-tier sufficiency status: sufficient, conditionally sufficient, insufficient (based on threshold, critical misses, contradictions) - Identifies the minimum sufficient model (cheapest that meets threshold with zero critical misses) Format (Section 10 - Model Sufficiency Matrix): - Reference model + threshold display - Per-model sufficiency table with tier, rates, and status - Missing and extra claim detail tables - Cost-efficiency recommendation Template: - New params: reference_model (optional), sufficiency_threshold (default 90%) - Input validation ensures reference model is in the model list - Step 7 conditionally applies Phase 8 when reference model is set Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

templates/evaluate-prompt-portability.md

formats/portability-report.md

protocols/reasoning/prompt-portability-evaluation.md

Steps 4 and 5 were inconsistent with the protocol's Manual Review bucket rules for uncertain matches. Step 4 only flagged uncertain matches instead of placing them in a Manual Review bucket. Step 5 classified all clusters without excluding Manual Review clusters from scoring. Fixed: - Step 4: uncertain matches now placed in Manual Review bucket, excluded from scored classification, reported under Uncertain / Needs Review section - Step 5: classifies only non-Manual-Review clusters, includes Manual Review count in the summary presented to user Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

formats/portability-report.md

protocols/reasoning/prompt-portability-evaluation.md

- Fix 6c heading from ## to ### to match 6a/6b nesting under section 6 (Divergent Claims) - Section 10 (Model Sufficiency Matrix) now always included per the format's 'do not omit any section' rule, with a placeholder when no reference model is designated - Add canonical cluster type rule for scoring: when models assign different types to semantically matched claims, use the highest-weight type, break ties by majority, then arbiter decides. Original per-model types preserved for transparency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 9, 2026 16:37

Copilot started reviewing on behalf of Alan-Jowett April 9, 2026 16:39 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Alan Jowett and others added 2 commits April 9, 2026 09:58

Copilot AI review requested due to automatic review settings April 9, 2026 17:05

Copilot started reviewing on behalf of Alan-Jowett April 9, 2026 17:08 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Alan-Jowett requested a review from Copilot April 9, 2026 17:49

Copilot started reviewing on behalf of Alan-Jowett April 9, 2026 17:50 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

formats/portability-report.md Outdated Show resolved Hide resolved

formats/portability-report.md Outdated Show resolved Hide resolved

protocols/reasoning/prompt-portability-evaluation.md Show resolved Hide resolved

Alan-Jowett merged commit bb7babe into microsoft:main Apr 9, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluate-prompt-portability template, protocol, and format#239

Add evaluate-prompt-portability template, protocol, and format#239
Alan-Jowett merged 5 commits intomicrosoft:mainfrom
Alan-Jowett:add-prompt-portability-evaluation

Alan-Jowett commented Apr 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alan-Jowett commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

New Components

Methodology

Complementary Workflow

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alan-Jowett commented Apr 9, 2026 •

edited

Loading