Skip to content

Add metric versionning#113

Merged
gabegma merged 8 commits into
mainfrom
ggm/add-metric-versionning
May 13, 2026
Merged

Add metric versionning#113
gabegma merged 8 commits into
mainfrom
ggm/add-metric-versionning

Conversation

@gabegma
Copy link
Copy Markdown
Collaborator

@gabegma gabegma commented May 12, 2026

This will help when rerunning to track if all records have the latest version.

gabegma and others added 2 commits May 12, 2026 11:17
Add an optional `version` field to MetricScore and wire turn_taking to
populate it from a `version = "v0.1"` class variable at every output site
(main score, missed-turn early return, sub-metrics). This lets us tell,
across partial metric reruns, which computation logic produced a given
row — bump the class var when the algorithm changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalize the turn_taking-only version stamp to all metrics, and add an
automatic per-judge prompt_hash so prompt edits are detectable even
without a manual version bump. Stamping happens centrally via a Pydantic
model_validator that reads two contextvars set by MetricsRunner before
each metric.compute() call — metric authors only declare
`version = "v0.1"` on the class and the rest is automatic at every
MetricScore call site (no per-site `version=self.version` plumbing).

The contextvar approach is per-asyncio-task, so concurrent metrics in
the same record don't bleed values into each other. On partial reruns,
metrics that aren't recomputed keep whatever version/prompt_hash was on
disk — the validator only fills when the field is unset, so deserialized
historical rows are preserved.

prompt_hash is the sha256[:12] of the *unrendered* template (so per-
record variable substitutions don't change the hash). PromptManager
gains `get_template(path)` to expose the raw YAML template; BaseMetric.
get_judge_prompt() pushes the hash into the contextvar each call.

Drift test (tests/unit/metrics/test_metric_signatures.py) compares each
concrete metric class's (version, source_hash, prompt_hash) against
tests/fixtures/metric_signatures.json. Authors run
`python scripts/regen_metric_signatures.py` to refresh the fixture after
a deliberate version bump or prompt edit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from ggm/bug-fixes-following-paper to main May 13, 2026 17:22
Comment thread src/eva/utils/prompt_manager.py
Comment thread src/eva/models/versioning.py Outdated
Comment thread tests/unit/metrics/test_metric_signatures.py
Copy link
Copy Markdown
Collaborator

@JosephMarinier JosephMarinier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks for taking care of this!

The implementation seems a bit complex, but I don't expect this code to change often, so I think it's fine.

@gabegma
Copy link
Copy Markdown
Collaborator Author

gabegma commented May 13, 2026

The implementation seems a bit complex, but I don't expect this code to change often, so I think it's fine.

I agree, it's somewhat ugly. Thanks for your cleanups!!

Comment thread src/eva/metrics/versioning.py
@gabegma gabegma added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit 598f7c1 May 13, 2026
1 check passed
@gabegma gabegma deleted the ggm/add-metric-versionning branch May 13, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants