Skip to content

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Jan 29, 2026

Description

Based on #799. Do not merge before.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Changes evaluation persistence and progress tracking by introducing resume-from-disk and incremental JSONL/metadata writes, which could affect result completeness/order and file integrity if interrupted.

Overview
Adds resumeable evaluations: a new --resume-path/EvalConfig.resume_path lets runs continue from an existing results directory (validated via is_valid_eval_results_path), loading prior results.jsonl and filtering remaining inputs per example_id.

Reworks saving to be fully incremental (append new rollouts + rewrite metadata each completion) and removes the save_every batching option; final save now explicitly writes results.jsonl/metadata.json and optionally pushes to the HF Hub via push_results_to_hf_hub. Progress display/metadata computation are updated to account for already-completed rollouts/groups when resuming.

Written by Cursor Bugbot for commit 0aeb614. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas changed the base branch from main to env-server January 29, 2026 15:17
@mikasenghaas mikasenghaas marked this pull request as ready for review January 29, 2026 17:08
if results_path is not None and is_valid_eval_results_path(results_path):
outputs = load_outputs(results_path)
builder.add_outputs(outputs)
inputs_list = filter_inputs(inputs_list, outputs, rollouts_per_example)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resume path lacks validation against current config

Medium Severity

When resuming from a previous run, the code loads outputs without validating that they match the current evaluation config. The is_valid_eval_results_path function only checks that metadata.json exists but doesn't compare the stored env_id or model against the current run. If a user accidentally provides a resume path from a different evaluation, outputs from different models or environments are silently mixed together, producing incorrect results with no warning.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea took the easy route and put it on the user for now to not mess this up

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.


# run evaluation
results_path = get_eval_results_path(config)
results_path = config.resume_path or get_eval_results_path(config)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TUI displays incorrect averages when resuming evaluations

Medium Severity

When resuming an evaluation, the on_progress callback computes incorrect averages. The accumulators (reward_accum, metrics_accum, error_accum) only accumulate values from new_outputs, but completed = len(all_outputs) includes both pre-loaded and new outputs. This causes the displayed reward, metrics, and error rate to be incorrect (too low) during resumed runs because the denominator includes pre-loaded results while the numerator doesn't.

Additional Locations (1)

Fix in Cursor Fix in Web

# initialize generate outputs builder
total_rollouts = len(inputs_list)
num_examples = len(set([i["example_id"] for i in inputs_list]))
rollouts_per_example = total_rollouts // num_examples
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division by zero crash with empty inputs

Medium Severity

The calculation rollouts_per_example = total_rollouts // num_examples crashes with ZeroDivisionError when inputs_list is empty (causing num_examples to be 0). The old _compute_metadata method had protection: rollouts_per_example = len(outputs) // num_examples if num_examples > 0 else 1. This protection was removed in the refactor, creating a regression where calling generate() with an empty dataset or list causes an unhandled crash.

Fix in Cursor Fix in Web

@mikasenghaas mikasenghaas marked this pull request as draft January 29, 2026 17:47
@willccbb willccbb changed the base branch from env-server to main January 30, 2026 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants