-
Notifications
You must be signed in to change notification settings - Fork 485
resume evals #803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
resume evals #803
Conversation
6aa6519 to
c72f002
Compare
c72f002 to
f7485d4
Compare
| if results_path is not None and is_valid_eval_results_path(results_path): | ||
| outputs = load_outputs(results_path) | ||
| builder.add_outputs(outputs) | ||
| inputs_list = filter_inputs(inputs_list, outputs, rollouts_per_example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resume path lacks validation against current config
Medium Severity
When resuming from a previous run, the code loads outputs without validating that they match the current evaluation config. The is_valid_eval_results_path function only checks that metadata.json exists but doesn't compare the stored env_id or model against the current run. If a user accidentally provides a resume path from a different evaluation, outputs from different models or environments are silently mixed together, producing incorrect results with no warning.
Additional Locations (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea took the easy route and put it on the user for now to not mess this up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
|
||
| # run evaluation | ||
| results_path = get_eval_results_path(config) | ||
| results_path = config.resume_path or get_eval_results_path(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TUI displays incorrect averages when resuming evaluations
Medium Severity
When resuming an evaluation, the on_progress callback computes incorrect averages. The accumulators (reward_accum, metrics_accum, error_accum) only accumulate values from new_outputs, but completed = len(all_outputs) includes both pre-loaded and new outputs. This causes the displayed reward, metrics, and error rate to be incorrect (too low) during resumed runs because the denominator includes pre-loaded results while the numerator doesn't.
Additional Locations (1)
| # initialize generate outputs builder | ||
| total_rollouts = len(inputs_list) | ||
| num_examples = len(set([i["example_id"] for i in inputs_list])) | ||
| rollouts_per_example = total_rollouts // num_examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Division by zero crash with empty inputs
Medium Severity
The calculation rollouts_per_example = total_rollouts // num_examples crashes with ZeroDivisionError when inputs_list is empty (causing num_examples to be 0). The old _compute_metadata method had protection: rollouts_per_example = len(outputs) // num_examples if num_examples > 0 else 1. This protection was removed in the refactor, creating a regression where calling generate() with an empty dataset or list causes an unhandled crash.
Description
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Changes evaluation persistence and progress tracking by introducing resume-from-disk and incremental JSONL/metadata writes, which could affect result completeness/order and file integrity if interrupted.
Overview
Adds resumeable evaluations: a new
--resume-path/EvalConfig.resume_pathlets runs continue from an existing results directory (validated viais_valid_eval_results_path), loading priorresults.jsonland filtering remaining inputs perexample_id.Reworks saving to be fully incremental (append new rollouts + rewrite metadata each completion) and removes the
save_everybatching option; final save now explicitly writesresults.jsonl/metadata.jsonand optionally pushes to the HF Hub viapush_results_to_hf_hub. Progress display/metadata computation are updated to account for already-completed rollouts/groups when resuming.Written by Cursor Bugbot for commit 0aeb614. This will update automatically on new commits. Configure here.