resume evals #803

mikasenghaas · 2026-01-29T15:17:25Z

Description

Based on #799. Do not merge before.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Changes evaluation persistence and progress tracking by introducing resume-from-disk and incremental JSONL/metadata writes, which could affect result completeness/order and file integrity if interrupted.

Overview
Adds resumeable evaluations: a new --resume-path/EvalConfig.resume_path lets runs continue from an existing results directory (validated via is_valid_eval_results_path), loading prior results.jsonl and filtering remaining inputs per example_id.

Reworks saving to be fully incremental (append new rollouts + rewrite metadata each completion) and removes the save_every batching option; final save now explicitly writes results.jsonl/metadata.json and optionally pushes to the HF Hub via push_results_to_hf_hub. Progress display/metadata computation are updated to account for already-completed rollouts/groups when resuming.

^{Written by Cursor Bugbot for commit 0aeb614. This will update automatically on new commits. Configure here.}

verifiers/utils/eval_utils.py

verifiers/envs/environment.py

verifiers/utils/save_utils.py

cursor · 2026-01-29T17:20:44Z

verifiers/envs/environment.py

+        if results_path is not None and is_valid_eval_results_path(results_path):
+            outputs = load_outputs(results_path)
+            builder.add_outputs(outputs)
+            inputs_list = filter_inputs(inputs_list, outputs, rollouts_per_example)


Resume path lacks validation against current config

Medium Severity

When resuming from a previous run, the code loads outputs without validating that they match the current evaluation config. The is_valid_eval_results_path function only checks that metadata.json exists but doesn't compare the stored env_id or model against the current run. If a user accidentally provides a resume path from a different evaluation, outputs from different models or environments are silently mixed together, producing incorrect results with no warning.

Additional Locations (1)

verifiers/utils/path_utils.py#L33-L41

yea took the easy route and put it on the user for now to not mess this up

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-29T17:38:07Z

verifiers/utils/eval_utils.py


        # run evaluation
-        results_path = get_eval_results_path(config)
+        results_path = config.resume_path or get_eval_results_path(config)


TUI displays incorrect averages when resuming evaluations

Medium Severity

When resuming an evaluation, the on_progress callback computes incorrect averages. The accumulators (reward_accum, metrics_accum, error_accum) only accumulate values from new_outputs, but completed = len(all_outputs) includes both pre-loaded and new outputs. This causes the displayed reward, metrics, and error rate to be incorrect (too low) during resumed runs because the denominator includes pre-loaded results while the numerator doesn't.

Additional Locations (1)

verifiers/envs/environment.py#L903-L907

cursor · 2026-01-29T17:38:07Z

verifiers/envs/environment.py

+        # initialize generate outputs builder
+        total_rollouts = len(inputs_list)
+        num_examples = len(set([i["example_id"] for i in inputs_list]))
+        rollouts_per_example = total_rollouts // num_examples


Division by zero crash with empty inputs

Medium Severity

The calculation rollouts_per_example = total_rollouts // num_examples crashes with ZeroDivisionError when inputs_list is empty (causing num_examples to be 0). The old _compute_metadata method had protection: rollouts_per_example = len(outputs) // num_examples if num_examples > 0 else 1. This protection was removed in the refactor, creating a regression where calling generate() with an empty dataset or list causes an unhandled crash.

mikasenghaas added 30 commits January 28, 2026 17:16

pick relevant changes from mika/env-worker

ba59032

runnable env server/client

0223d62

aligned interface

84433e2

integrate into vf-eval

ddc6b2f

minor

dd262de

do not double serialize

02715b3

fix retries

86f9276

pass state cols

a38022a

update pyrproject

377867e

update logging_utils

3cbdea8

change signatures from state -> output

9330c81

move extra env kwargs out of load_environment signatures

8584482

do not change signature

8b3dbc0

mini

0c8badd

name inner funcs

f5a5ef7

deprecate gen/score sem and move global sem into generate()

d6edb95

remove unnecesary module inti

4e11a27

fix error info in rollout output

846bcc0

run as daemon process

0eb5ed7

robustify task cleanup in env

409f580

graceful shutdowns

67c25d2

informative error

dc43e9a

revert

bc7422f

remove runner

d913d3e

fix tests

ca191bb

handle extra env kwargs

21ff163

remove gen/score concurrency limit from eval interface

85c7080

update docs

3d76f72

handle retries and state cols on server as well

eb9c1af

fix sampling args handling

775d0c0

mikasenghaas added 2 commits January 29, 2026 13:09

use kill on second attempt

2f0000d

address bugbot

a612201

mikasenghaas changed the base branch from main to env-server January 29, 2026 15:17

mikasenghaas force-pushed the resume-evals branch from 6aa6519 to c72f002 Compare January 29, 2026 15:18

attempt 1

f7485d4

mikasenghaas force-pushed the resume-evals branch from c72f002 to f7485d4 Compare January 29, 2026 16:04

mikasenghaas added 4 commits January 29, 2026 16:36

stateful load/save

379ad0c

functional

277a711

simpler

7c45e1d

remove old stuff

eeb5ed2

mikasenghaas marked this pull request as ready for review January 29, 2026 17:08

less git diff

b1c417d

mikasenghaas requested review from rasdani and willccbb January 29, 2026 17:17

cursor bot reviewed Jan 29, 2026

View reviewed changes

mikasenghaas added 2 commits January 29, 2026 17:28

fix

2243973

update toml config

0aeb614

cursor bot reviewed Jan 29, 2026

View reviewed changes

mikasenghaas marked this pull request as draft January 29, 2026 17:47

willccbb changed the base branch from env-server to main January 30, 2026 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume evals #803

resume evals #803

Uh oh!

mikasenghaas commented Jan 29, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 29, 2026

Uh oh!

mikasenghaas Jan 29, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 29, 2026

Uh oh!

cursor bot Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

resume evals #803

Are you sure you want to change the base?

resume evals #803

Uh oh!

Conversation

mikasenghaas commented Jan 29, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 29, 2026

Choose a reason for hiding this comment

Resume path lacks validation against current config

Uh oh!

mikasenghaas Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 29, 2026

Choose a reason for hiding this comment

TUI displays incorrect averages when resuming evaluations

Uh oh!

cursor bot Jan 29, 2026

Choose a reason for hiding this comment

Division by zero crash with empty inputs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Jan 29, 2026 •

edited by cursor bot

Loading