Skip to content

fix(core): capture judge/evaluator token usage in reports#63

Merged
cemde merged 8 commits into
mainfrom
fix-eval-cost-tracking
May 23, 2026
Merged

fix(core): capture judge/evaluator token usage in reports#63
cemde merged 8 commits into
mainfrom
fix-eval-cost-tracking

Conversation

@cemde
Copy link
Copy Markdown
Collaborator

@cemde cemde commented May 23, 2026

Description

Fixes #60. In Benchmark._execute_task_repetition, collect_all_usage() was called before evaluate(). Models registered in setup_evaluators (LLM judges and other evaluator-owned models) were in the registry but had empty _usage_records at snapshot time, so their tokens showed up as zero in report["usage"] and were never folded into Benchmark.usage / Benchmark.usage_by_component. Cost totals reported $0.0000 when agents used a model with no LiteLLM price entry (for example self-hosted vLLM) and judges produced the only billable tokens.

The fix moves the usage snapshot to run after evaluate(). Configs and traces stay where they are, because evaluate() consumes traces. The change is a pure lifecycle reordering inside one worker thread, with no API, schema, or threading-model changes. Affects every benchmark that registers a model in setup_evaluators, including ConVerse and MultiAgentBench.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

Documentation

  • Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
  • Updated relevant documentation in docs/ (if applicable)
  • Tag github issue with this PR (if applicable)

Changelog

  • Added entry to CHANGELOG.md under [Unreleased] section
    • Use Added section for new features
    • Use Changed section for modifications to existing functionality
    • Use Fixed section for bug fixes
    • Use Removed section for deprecated/removed features
  • OR this is a documentation-only change (no changelog needed)

Architecture (if applicable)

  • Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
  • Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

New tests in TestBenchmarkJudgeUsage (file tests/test_core/test_benchmark/test_usage_collection.py):

  • 2 bug-reproduction tests that fail on main and pass on this branch.
  • 2 regression guards covering the evaluation-failure and execution-failure paths, asserting that pre-failure agent-side usage still lands in the report.

Verified with just all (ruff format and check, ty check, full pytest). Result: 2635 passed, 1 skipped, 2 xfailed.

Closes #60.

cemde and others added 8 commits May 23, 2026 17:19
Tests assert that a judge model registered in setup_evaluators and
invoked during evaluate() has non-zero tokens in both report["usage"]
and Benchmark.usage. Both tests fail on current main because the usage
snapshot is taken before evaluate() runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code review on ad115b9: remove implementation-narrative
comments, drop unused state from _JudgeEvaluator, move conftest
import to module scope, condense docstrings, assert total_tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#60)

In Benchmark._execute_task_repetition the usage snapshot was taken in
step 3 — before evaluate() ran. Models registered in setup_evaluators
(LLM judges) were in the registry but had empty _usage_records at that
point, so their tokens showed up as zero in report["usage"] and were
never folded into Benchmark.usage / usage_by_component.

Move collect_all_usage() into its own step 4b that runs after
evaluate(). Configs and traces remain in step 3 because evaluate()
consumes traces. The change is a pure lifecycle reordering inside one
worker thread; no new shared state, no new locks, no API/schema change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code review on da99acd: shorten verbose step-3 and step-5
header comments (now single lines describing only the load-bearing
"why"), and renumber the new usage-collection block from "4b" to
"5", shifting build-report to "6", so the step sequence is uniform.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in that the post-evaluate usage collection (step 5) still runs
on the eval-failure and execution-failure paths, so pre-failure
agent-side usage lands in the report rather than the usage field
becoming an error dict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code review on a928c54: assert output_tokens (not just
input_tokens), drop the redundant isinstance(usage, dict) check,
mark RaisingEvaluator's unused constructor args explicit with _,
and move the AgentError import to module scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cemde cemde added bug Something isn't working core In regards to the core package `maseval/core` labels May 23, 2026
@github-actions
Copy link
Copy Markdown

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  maseval/core
  benchmark.py 1264, 1311-1312
Project Total  

This report was generated by python-coverage-comment-action

@cemde cemde merged commit 93171ca into main May 23, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core In regards to the core package `maseval/core`

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: evaluator token usage is silently dropped from per-task reports

1 participant