fix(core): capture judge/evaluator token usage in reports by cemde · Pull Request #63 · parameterlab/MASEval

cemde · 2026-05-23T16:12:27Z

Description

Fixes #60. In Benchmark._execute_task_repetition, collect_all_usage() was called before evaluate(). Models registered in setup_evaluators (LLM judges and other evaluator-owned models) were in the registry but had empty _usage_records at snapshot time, so their tokens showed up as zero in report["usage"] and were never folded into Benchmark.usage / Benchmark.usage_by_component. Cost totals reported $0.0000 when agents used a model with no LiteLLM price entry (for example self-hosted vLLM) and judges produced the only billable tokens.

The fix moves the usage snapshot to run after evaluate(). Configs and traces stay where they are, because evaluate() consumes traces. The change is a pure lifecycle reordering inside one worker thread, with no API, schema, or threading-model changes. Affects every benchmark that registers a model in setup_evaluators, including ConVerse and MultiAgentBench.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

I have read the CONTRIBUTING.md guide.
Commits follow "How to write a good git commit message"

Documentation

Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
Updated relevant documentation in docs/ (if applicable)
Tag github issue with this PR (if applicable)

Changelog

Added entry to CHANGELOG.md under [Unreleased] section
- Use Added section for new features
- Use Changed section for modifications to existing functionality
- Use Fixed section for bug fixes
- Use Removed section for deprecated/removed features
OR this is a documentation-only change (no changelog needed)

Architecture (if applicable)

Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

New tests in TestBenchmarkJudgeUsage (file tests/test_core/test_benchmark/test_usage_collection.py):

2 bug-reproduction tests that fail on main and pass on this branch.
2 regression guards covering the evaluation-failure and execution-failure paths, asserting that pre-failure agent-side usage still lands in the report.

Verified with just all (ruff format and check, ty check, full pytest). Result: 2635 passed, 1 skipped, 2 xfailed.

Closes #60.

Tests assert that a judge model registered in setup_evaluators and invoked during evaluate() has non-zero tokens in both report["usage"] and Benchmark.usage. Both tests fail on current main because the usage snapshot is taken before evaluate() runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address code review on ad115b9: remove implementation-narrative comments, drop unused state from _JudgeEvaluator, move conftest import to module scope, condense docstrings, assert total_tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#60) In Benchmark._execute_task_repetition the usage snapshot was taken in step 3 — before evaluate() ran. Models registered in setup_evaluators (LLM judges) were in the registry but had empty _usage_records at that point, so their tokens showed up as zero in report["usage"] and were never folded into Benchmark.usage / usage_by_component. Move collect_all_usage() into its own step 4b that runs after evaluate(). Configs and traces remain in step 3 because evaluate() consumes traces. The change is a pure lifecycle reordering inside one worker thread; no new shared state, no new locks, no API/schema change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address code review on da99acd: shorten verbose step-3 and step-5 header comments (now single lines describing only the load-bearing "why"), and renumber the new usage-collection block from "4b" to "5", shifting build-report to "6", so the step sequence is uniform. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Locks in that the post-evaluate usage collection (step 5) still runs on the eval-failure and execution-failure paths, so pre-failure agent-side usage lands in the report rather than the usage field becoming an error dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address code review on a928c54: assert output_tokens (not just input_tokens), drop the redundant isinstance(usage, dict) check, mark RaisingEvaluator's unused constructor args explicit with _, and move the AgentError import to module scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-23T16:19:00Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
maseval/core
benchmark.py					1264, 1311-1312
Project Total

_{This report was generated by python-coverage-comment-action}

cemde and others added 8 commits May 23, 2026 17:19

docs: changelog entry for judge usage fix (#60)

cfb0eea

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: fill in PR number for changelog entry

a5fa704

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cemde added bug Something isn't working core In regards to the core package `maseval/core` labels May 23, 2026

cemde merged commit 93171ca into main May 23, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): capture judge/evaluator token usage in reports#63

fix(core): capture judge/evaluator token usage in reports#63
cemde merged 8 commits into
mainfrom
fix-eval-cost-tracking

cemde commented May 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cemde commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist

Contribution

Documentation

Changelog

Architecture (if applicable)

Additional Notes

Uh oh!

github-actions Bot commented May 23, 2026

Coverage report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cemde commented May 23, 2026 •

edited

Loading