Skip to content

feat(evaluation): add static JSON scorer#369

Open
ChathurangiShyalika wants to merge 2 commits into
IBM:mainfrom
ChathurangiShyalika:add-static-json-evaluator
Open

feat(evaluation): add static JSON scorer#369
ChathurangiShyalika wants to merge 2 commits into
IBM:mainfrom
ChathurangiShyalika:add-static-json-evaluator

Conversation

@ChathurangiShyalika

Copy link
Copy Markdown
Collaborator

Summary

Adds a DeepSynth-style static evaluator for structured AssetOpsBench answers.

Changes

  • Added src/evaluation/static_json_score.py

    • Parses JSON objects, JSON arrays, Python literals, tuple lists, markdown code blocks, noisy answer prefixes, and count-only answers
    • Normalizes scalar values
    • Flattens nested structures into dot/index keys
    • Computes strict exact match, partial exact match, partial similarity, precision, recall, F1, missing keys, extra keys, and detailed key-level comparisons
  • Added src/evaluation/static_json_table_eval.py

    • Evaluates scenario verification CSV files
    • Compares each model-output column against the final verified ground-truth answer
    • Produces row-level and model-level performance tables
  • Added unit tests under src/evaluation/tests/

Motivation

Many AssetOpsBench scenarios expect structured answers such as JSON objects, arrays, dictionaries, tuple lists, or numeric counts. Exact string matching is too brittle when agents return correct structured content with different formatting, markdown fences, prefixes, or Python-style literals. This evaluator provides deterministic structured-output scoring without relying only on LLM-as-judge.

Usage

For a scenario verification CSV with columns such as:

id
Original question
final verified answer (ground truth)
Revised question
litellm_proxy/aws/claude-opus-4-8
Azure/gpt-5-mini-2025-08-07

Run:

uv run python -m evaluation.static_json_table_eval \
  --input scenario_verification.csv \
  --gold-column "final verified answer (ground truth)" \
  --id-column id \
  --output-dir reports/static_json_eval

This writes:

reports/static_json_eval/static_json_eval_details.csv
reports/static_json_eval/static_json_eval_summary.csv

To explicitly choose model columns:

uv run python -m evaluation.static_json_table_eval \
  --input scenario_verification.csv \
  --gold-column "final verified answer (ground truth)" \
  --id-column id \
  --model-column "litellm_proxy/aws/claude-opus-4-8" \
  --model-column "Azure/gpt-5-mini-2025-08-07" \
  --output-dir reports/static_json_eval

Testing

uv run pytest src/evaluation/tests/test_static_json_score.py src/evaluation/tests/test_static_json_table_eval.py -v

@DhavalRepo18 DhavalRepo18 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to see how these pipeline get invoked. it has be aligned with LLM-As-Judge pipeline for compatibility.

I see you are using .csv but in ideal case

  • an agent will have a trajectroy
  • we extract the answer from them and then scenario id
  • we get the ground truth
  • we run scoring on that
  • we may aggregate at benchmark level
  • we report.

So let us see what is current workflow.

@DhavalRepo18 DhavalRepo18 requested a review from ShuxinLin June 13, 2026 23:42
@ChathurangiShyalika

Copy link
Copy Markdown
Collaborator Author

@DhavalRepo18, The CSV script was only intended as a helper for the current manual scenario-verification sheet, which itself was created from trajectory outputs. I agree the main evaluator should follow the existing benchmark workflow - read saved agent trajectories, extract the final answer and scenario id, load the ground-truth answer for that scenario, run the static JSON scorer, then aggregate results at the benchmark/model level. I will update the PR to make static_json_score.py the reusable scorer and add a trajectory-based evaluation path aligned with the current LLM-as-Judge workflow. The CSV evaluator can remain optional or be removed if we want to keep the PR focused.

@DhavalRepo18

Copy link
Copy Markdown
Collaborator
  • CSV is not intended to support, so we should either support across all the Eval or we should not. I suggest we drop it and open a separate issue for it.
  • The file structure also needs to be adjusted, especially where to put the new set of evaluators.
image
  • file name are not consistent and i am making an educated guess that it is also not in the right folder it shd go inside scores
  • Please check Evaluation in readme
export AGENT_TRAJECTORY_DIR=$(pwd)/traces/trajectories
uv run claude-agent "List all failure modes of asset Chiller." --scenario-id 101

# 2. Score with LLM-As-Judge

uv run evaluate \
  --trajectories traces/trajectories \
  --scenarios groundtruth/101.json \
  --scorer-default llm_judge \
  --judge-model litellm_proxy/aws/claude-opus-4-6

I would be able to specify the llm_judge ---> your incoming stuff..
it need a little bit thinking but please accomodate these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants