Skip to content

Add evaluate-prompt-portability template, protocol, and format#239

Merged
Alan-Jowett merged 5 commits intomicrosoft:mainfrom
Alan-Jowett:add-prompt-portability-evaluation
Apr 9, 2026
Merged

Add evaluate-prompt-portability template, protocol, and format#239
Alan-Jowett merged 5 commits intomicrosoft:mainfrom
Alan-Jowett:add-prompt-portability-evaluation

Conversation

@Alan-Jowett
Copy link
Copy Markdown
Member

@Alan-Jowett Alan-Jowett commented Apr 9, 2026

Summary

Adds three new PromptKit components for cross-LLM prompt portability evaluation, addressing Phase 1 of #127 (Evaluation Framework).

Problem

PromptKit assembles prompts that are model-agnostic in theory, but different LLMs interpret the same instructions with different fidelity. There is no mechanism to measure, identify, or fix these cross-model divergences.

Solution

A new interactive template that evaluates prompt portability empirically by running the same prompt against multiple models and comparing outputs at the semantic claim level (not textual level).

New Components

Component Type Purpose
prompt-portability-evaluation Protocol (reasoning) 8-phase claim-level consensus analysis methodology
portability-report Format 10-section comparison report with scoring, sufficiency matrix, and hardening recommendations
evaluate-prompt-portability Template (interactive) Orchestrates fan-out, claim extraction, consensus analysis, and reporting

Methodology

  1. Fan out the prompt + golden input to N models
  2. Extract claims — decompose each output into atomic assertions
  3. Semantic matching — pairwise match claims across models
  4. Consensus classification — Consensus, Majority, Singular, Contradictory
  5. Divergence root cause — trace divergent claims to specific prompt language
  6. Portability score — weighted mean, normalized to [0, 1]
  7. Hardening recommendations — concrete prompt rewrites
  8. Model sufficiency analysis (optional) — when a reference model is designated, identifies the minimum sufficient cheaper model

Complementary Workflow

Designed to pair with lint-prompt: lint statically first, then evaluate empirically.

Validation

python tests/validate-manifest.py passes.

Adds three new PromptKit components for cross-LLM prompt portability
evaluation (addresses microsoft#127 Phase 1):

- Protocol: prompt-portability-evaluation — 7-phase claim-level
  consensus analysis methodology (output collection, claim extraction,
  semantic matching, consensus classification, divergence analysis,
  scoring, and hardening recommendations)

- Format: portability-report — 9-section structured report covering
  evaluation context, per-model summaries, consensus core, majority
  claims, divergent claims (singular + contradictory), scorecard,
  hardening recommendations, and model notes

- Template: evaluate-prompt-portability — interactive template that
  orchestrates fan-out execution across multiple LLM models, collects
  outputs, decomposes them into atomic semantic claims, performs
  cross-model consensus analysis, and produces a portability report

Key design: comparison is semantic (claim-level), not textual. Two
models producing the same assertions in different words score as
Consensus. Contradictory claims (mutually exclusive assertions) are
the highest-priority signal, traced to specific ambiguous prompt
language with concrete rewrite recommendations.

Complements the existing lint-prompt template — lint statically first,
then evaluate empirically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 9, 2026 16:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new PromptKit evaluation workflow to empirically assess cross-LLM prompt portability by running the same assembled prompt + golden input across multiple models, extracting semantic claims, and reporting consensus/divergence with a portability score and hardening rewrites (Phase 1 of #127’s evaluation framework).

Changes:

  • Introduces an interactive evaluate-prompt-portability template that orchestrates multi-model execution, claim extraction, semantic matching, scoring, and reporting.
  • Adds a prompt-portability-evaluation reasoning protocol defining the 7-phase methodology and scoring rubric.
  • Adds a portability-report output format and registers the new components in manifest.yaml.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
templates/evaluate-prompt-portability.md New interactive template to execute the evaluation workflow end-to-end.
protocols/reasoning/prompt-portability-evaluation.md New protocol specifying claim extraction/matching, consensus classification, scoring, and hardening steps.
formats/portability-report.md New structured 9-section report format for portability evaluation outputs.
manifest.yaml Registers the new protocol, format, and template in the PromptKit manifest.

Alan Jowett and others added 2 commits April 9, 2026 09:58
- Fix Majority threshold from >=50% to >50% to avoid ties with even
  model counts
- Normalize portability score from [-1,1] to [0,1] via
  (raw_weighted_mean + 1.0) / 2.0 so contradictory claims cannot
  produce negative scores
- Define explicit Manual Review bucket for uncertain claim matches:
  excluded from scoring, reported in new Uncertain / Needs Review
  section in the portability-report format
- Add fail-stop behavior when <2 models succeed: produce abbreviated
  report documenting failures instead of misleading partial analysis
- Add UR- claim ID prefix to formatting rules

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extends the portability evaluation with a new mode: when a reference
model is designated, its claims become the ground-truth baseline and
each cheaper model is scored on how well it reproduces that baseline.

Protocol (Phase 8 - Model Sufficiency Analysis):
- Per-model sufficiency rate = reproduced / total baseline claims
- Missing claims classified as critical miss vs minor miss
- Extra claims classified as valid addition / hallucination / noise
- Three-tier sufficiency status: sufficient, conditionally sufficient,
  insufficient (based on threshold, critical misses, contradictions)
- Identifies the minimum sufficient model (cheapest that meets
  threshold with zero critical misses)

Format (Section 10 - Model Sufficiency Matrix):
- Reference model + threshold display
- Per-model sufficiency table with tier, rates, and status
- Missing and extra claim detail tables
- Cost-efficiency recommendation

Template:
- New params: reference_model (optional), sufficiency_threshold
  (default 90%)
- Input validation ensures reference model is in the model list
- Step 7 conditionally applies Phase 8 when reference model is set

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 9, 2026 17:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Steps 4 and 5 were inconsistent with the protocol's Manual Review
bucket rules for uncertain matches. Step 4 only flagged uncertain
matches instead of placing them in a Manual Review bucket. Step 5
classified all clusters without excluding Manual Review clusters
from scoring.

Fixed:
- Step 4: uncertain matches now placed in Manual Review bucket,
  excluded from scored classification, reported under Uncertain /
  Needs Review section
- Step 5: classifies only non-Manual-Review clusters, includes
  Manual Review count in the summary presented to user

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

- Fix 6c heading from ## to ### to match 6a/6b nesting under
  section 6 (Divergent Claims)
- Section 10 (Model Sufficiency Matrix) now always included per
  the format's 'do not omit any section' rule, with a placeholder
  when no reference model is designated
- Add canonical cluster type rule for scoring: when models assign
  different types to semantically matched claims, use the
  highest-weight type, break ties by majority, then arbiter decides.
  Original per-model types preserved for transparency.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Alan-Jowett Alan-Jowett merged commit bb7babe into microsoft:main Apr 9, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants