feat(evaluation): add static JSON scorer by ChathurangiShyalika · Pull Request #369 · IBM/AssetOpsBench

ChathurangiShyalika · 2026-06-13T23:23:20Z

Summary

Adds a DeepSynth-style static evaluator for structured AssetOpsBench answers.

Changes

Added src/evaluation/static_json_score.py
- Parses JSON objects, JSON arrays, Python literals, tuple lists, markdown code blocks, noisy answer prefixes, and count-only answers
- Normalizes scalar values
- Flattens nested structures into dot/index keys
- Computes strict exact match, partial exact match, partial similarity, precision, recall, F1, missing keys, extra keys, and detailed key-level comparisons
Added src/evaluation/static_json_table_eval.py
- Evaluates scenario verification CSV files
- Compares each model-output column against the final verified ground-truth answer
- Produces row-level and model-level performance tables
Added unit tests under src/evaluation/tests/

Motivation

Many AssetOpsBench scenarios expect structured answers such as JSON objects, arrays, dictionaries, tuple lists, or numeric counts. Exact string matching is too brittle when agents return correct structured content with different formatting, markdown fences, prefixes, or Python-style literals. This evaluator provides deterministic structured-output scoring without relying only on LLM-as-judge.

Usage

For a scenario verification CSV with columns such as:

id
Original question
final verified answer (ground truth)
Revised question
litellm_proxy/aws/claude-opus-4-8
Azure/gpt-5-mini-2025-08-07

Run:

uv run python -m evaluation.static_json_table_eval \
  --input scenario_verification.csv \
  --gold-column "final verified answer (ground truth)" \
  --id-column id \
  --output-dir reports/static_json_eval

This writes:

reports/static_json_eval/static_json_eval_details.csv
reports/static_json_eval/static_json_eval_summary.csv

To explicitly choose model columns:

uv run python -m evaluation.static_json_table_eval \
  --input scenario_verification.csv \
  --gold-column "final verified answer (ground truth)" \
  --id-column id \
  --model-column "litellm_proxy/aws/claude-opus-4-8" \
  --model-column "Azure/gpt-5-mini-2025-08-07" \
  --output-dir reports/static_json_eval

Testing

uv run pytest src/evaluation/tests/test_static_json_score.py src/evaluation/tests/test_static_json_table_eval.py -v

DhavalRepo18

We need to see how these pipeline get invoked. it has be aligned with LLM-As-Judge pipeline for compatibility.

I see you are using .csv but in ideal case

an agent will have a trajectroy
we extract the answer from them and then scenario id
we get the ground truth
we run scoring on that
we may aggregate at benchmark level
we report.

So let us see what is current workflow.

ChathurangiShyalika · 2026-06-13T23:48:10Z

@DhavalRepo18, The CSV script was only intended as a helper for the current manual scenario-verification sheet, which itself was created from trajectory outputs. I agree the main evaluator should follow the existing benchmark workflow - read saved agent trajectories, extract the final answer and scenario id, load the ground-truth answer for that scenario, run the static JSON scorer, then aggregate results at the benchmark/model level. I will update the PR to make static_json_score.py the reusable scorer and add a trajectory-based evaluation path aligned with the current LLM-as-Judge workflow. The CSV evaluator can remain optional or be removed if we want to keep the PR focused.

DhavalRepo18 · 2026-06-14T17:40:59Z

CSV is not intended to support, so we should either support across all the Eval or we should not. I suggest we drop it and open a separate issue for it.
The file structure also needs to be adjusted, especially where to put the new set of evaluators.

file name are not consistent and i am making an educated guess that it is also not in the right folder it shd go inside scores
Please check Evaluation in readme

export AGENT_TRAJECTORY_DIR=$(pwd)/traces/trajectories
uv run claude-agent "List all failure modes of asset Chiller." --scenario-id 101

# 2. Score with LLM-As-Judge

uv run evaluate \
  --trajectories traces/trajectories \
  --scenarios groundtruth/101.json \
  --scorer-default llm_judge \
  --judge-model litellm_proxy/aws/claude-opus-4-6

I would be able to specify the llm_judge ---> your incoming stuff..
it need a little bit thinking but please accomodate these changes.

feat(evaluation): add static JSON scorer

7083a91

ChathurangiShyalika requested a review from DhavalRepo18 June 13, 2026 23:23

DhavalRepo18 reviewed Jun 13, 2026

View reviewed changes

DhavalRepo18 requested a review from ShuxinLin June 13, 2026 23:42

docs(evaluation): add static JSON evaluation guide

6e898e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): add static JSON scorer#369

feat(evaluation): add static JSON scorer#369
ChathurangiShyalika wants to merge 2 commits into
IBM:mainfrom
ChathurangiShyalika:add-static-json-evaluator

ChathurangiShyalika commented Jun 13, 2026

Uh oh!

DhavalRepo18 left a comment

Uh oh!

ChathurangiShyalika commented Jun 13, 2026

Uh oh!

DhavalRepo18 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChathurangiShyalika commented Jun 13, 2026

Summary

Changes

Motivation

Usage

Run:

Testing

Uh oh!

DhavalRepo18 left a comment

Choose a reason for hiding this comment

Uh oh!

ChathurangiShyalika commented Jun 13, 2026

Uh oh!

DhavalRepo18 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants