feat(evaluation): add static JSON scorer#369
Conversation
DhavalRepo18
left a comment
There was a problem hiding this comment.
We need to see how these pipeline get invoked. it has be aligned with LLM-As-Judge pipeline for compatibility.
I see you are using .csv but in ideal case
- an agent will have a trajectroy
- we extract the answer from them and then scenario id
- we get the ground truth
- we run scoring on that
- we may aggregate at benchmark level
- we report.
So let us see what is current workflow.
|
@DhavalRepo18, The CSV script was only intended as a helper for the current manual scenario-verification sheet, which itself was created from trajectory outputs. I agree the main evaluator should follow the existing benchmark workflow - read saved agent trajectories, extract the final answer and scenario id, load the ground-truth answer for that scenario, run the static JSON scorer, then aggregate results at the benchmark/model level. I will update the PR to make |

Summary
Adds a DeepSynth-style static evaluator for structured AssetOpsBench answers.
Changes
Added
src/evaluation/static_json_score.pyAdded
src/evaluation/static_json_table_eval.pyAdded unit tests under
src/evaluation/tests/Motivation
Many AssetOpsBench scenarios expect structured answers such as JSON objects, arrays, dictionaries, tuple lists, or numeric counts. Exact string matching is too brittle when agents return correct structured content with different formatting, markdown fences, prefixes, or Python-style literals. This evaluator provides deterministic structured-output scoring without relying only on LLM-as-judge.
Usage
For a scenario verification CSV with columns such as:
Run:
This writes:
To explicitly choose model columns:
Testing