fix(core): capture judge/evaluator token usage in reports#63
Merged
Conversation
Tests assert that a judge model registered in setup_evaluators and invoked during evaluate() has non-zero tokens in both report["usage"] and Benchmark.usage. Both tests fail on current main because the usage snapshot is taken before evaluate() runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code review on ad115b9: remove implementation-narrative comments, drop unused state from _JudgeEvaluator, move conftest import to module scope, condense docstrings, assert total_tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#60) In Benchmark._execute_task_repetition the usage snapshot was taken in step 3 — before evaluate() ran. Models registered in setup_evaluators (LLM judges) were in the registry but had empty _usage_records at that point, so their tokens showed up as zero in report["usage"] and were never folded into Benchmark.usage / usage_by_component. Move collect_all_usage() into its own step 4b that runs after evaluate(). Configs and traces remain in step 3 because evaluate() consumes traces. The change is a pure lifecycle reordering inside one worker thread; no new shared state, no new locks, no API/schema change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code review on da99acd: shorten verbose step-3 and step-5 header comments (now single lines describing only the load-bearing "why"), and renumber the new usage-collection block from "4b" to "5", shifting build-report to "6", so the step sequence is uniform. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in that the post-evaluate usage collection (step 5) still runs on the eval-failure and execution-failure paths, so pre-failure agent-side usage lands in the report rather than the usage field becoming an error dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address code review on a928c54: assert output_tokens (not just input_tokens), drop the redundant isinstance(usage, dict) check, mark RaisingEvaluator's unused constructor args explicit with _, and move the AgentError import to module scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #60. In
Benchmark._execute_task_repetition,collect_all_usage()was called beforeevaluate(). Models registered insetup_evaluators(LLM judges and other evaluator-owned models) were in the registry but had empty_usage_recordsat snapshot time, so their tokens showed up as zero inreport["usage"]and were never folded intoBenchmark.usage/Benchmark.usage_by_component. Cost totals reported$0.0000when agents used a model with no LiteLLM price entry (for example self-hosted vLLM) and judges produced the only billable tokens.The fix moves the usage snapshot to run after
evaluate(). Configs and traces stay where they are, becauseevaluate()consumes traces. The change is a pure lifecycle reordering inside one worker thread, with no API, schema, or threading-model changes. Affects every benchmark that registers a model insetup_evaluators, including ConVerse and MultiAgentBench.Type of Change
Checklist
Contribution
Documentation
docs/(if applicable)Changelog
CHANGELOG.mdunder[Unreleased]sectionArchitecture (if applicable)
maseval/core/do NOT import frommaseval/interface/Additional Notes
New tests in
TestBenchmarkJudgeUsage(filetests/test_core/test_benchmark/test_usage_collection.py):mainand pass on this branch.Verified with
just all(ruff format and check, ty check, full pytest). Result: 2635 passed, 1 skipped, 2 xfailed.Closes #60.