vec_log per-agent population mean - average over agents instead of over completed episodes#459
Open
eugenevinitsky wants to merge 11 commits into
Open
vec_log per-agent population mean - average over agents instead of over completed episodes#459eugenevinitsky wants to merge 11 commits into
eugenevinitsky wants to merge 11 commits into
Conversation
Replaces the existing 'sum-over-completed-episodes / completed-episode-count' aggregation in vec_log with a two-stage per-agent mean: each agent slot contributes one window-mean (mean over its completions since last emit), then those means are averaged across agents. An agent that completes 10 short infraction-terminated episodes now has the same weight in the metric as an agent that completes one full clean episode. drive.h: - New per-env state: per_agent_log_sum[], per_agent_log_count[], per_agent_log_capacity. Grow-only realloc handles REPLAY scenarios where active_agent_count can vary across c_reset. - add_log now stamps the completing agent's slot instead of env->log, and bumps per_agent_log_count[i] instead of env->log.n. - New prepare_log: drains each agent's slot to (sum / count), sums across agents into env->log, sets env->log.n = #agents-with-data. Resets the per-agent buffers in place. - env_static_car_count and static_car_count are folded into each agent's slot so the population mean recovers the env's per-scenario value. drive/binding.c: - New vec_prepare_log binding (MY_METHODS) that calls prepare_log on every env in a VecEnv. drive/drive.py: - Calls binding.vec_prepare_log right before binding.vec_log every report_interval steps. env_binding.h: - Relaxes the vec_log gate from `aggregate.n < num_agents` (wait until enough completed episodes) to `aggregate.n < 1` (emit if anyone has data). For Drive this means "emit if any env has at least one agent with a window-mean"; for other ocean envs sharing this header it means smaller batches emitted more often, which the Python-side mean_and_log in pufferl.py already amortizes via its 0.25s rate-limit. - num_agents arg kept on the C signature for caller-API compatibility, but is now unused. Validated locally: `python setup.py build_ext --inplace --force` clean, existing scenario-length and single-agent-yaml tests pass, and a 64-agent smoke run emits one log dict per scenario_length window with n=num_agents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR changes how Drive environment metrics are aggregated in vec_log to remove completion-rate bias by switching from “mean over completed episodes” to a two-stage “per-agent window mean, then population mean across agent slots” approach. It also adds a Drive-specific vec_prepare_log step to drain per-agent accumulators into env->log immediately before vec_log.
Changes:
- Add per-agent log accumulators in Drive and a
prepare_log()drain step to produce per-agent population means. - Add a new C binding
vec_prepare_logand call it fromdrive.pybeforevec_logat eachreport_interval. - Relax
vec_log’s emission gate inenv_binding.hfromaggregate.n < num_agentstoaggregate.n < 1(API arg retained but unused).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pufferlib/ocean/env_binding.h | Changes vec_log emission gating to emit with any data (ignores num_agents arg). |
| pufferlib/ocean/drive/drive.py | Calls new binding.vec_prepare_log() prior to binding.vec_log(). |
| pufferlib/ocean/drive/drive.h | Adds per-agent log sum/count buffers; introduces ensure_per_agent_log_capacity() and prepare_log(); rewrites add_log() to accumulate per-agent. |
| pufferlib/ocean/drive/binding.c | Exposes vec_prepare_log Python binding to call prepare_log() across VecEnv envs. |
Comments suppressed due to low confidence (1)
pufferlib/ocean/drive/binding.c:55
- vec_prepare_log_py doesn’t validate its argument count. Most other vec_* entrypoints in env_binding.h explicitly check PyTuple_Size and raise a clear TypeError; without this, accidental extra args will be silently ignored and missing args will lead to confusing errors inside unpack_vecenv.
prepare_log((Drive *) vec->envs[i]);
}
Py_RETURN_NONE;
}
static int my_put(Env *env, PyObject *args, PyObject *kwargs) {
PyObject *obs = PyDict_GetItemString(kwargs, "observations");
if (!PyObject_TypeCheck(obs, &PyArray_Type)) {
PyErr_SetString(PyExc_TypeError, "Observations must be a NumPy array");
return 1;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2588
to
+2598
| static void ensure_per_agent_log_capacity(Drive *env, int needed) { | ||
| if (env->per_agent_log_capacity >= needed) { | ||
| return; | ||
| } | ||
| int old_cap = env->per_agent_log_capacity; | ||
| env->per_agent_log_sum = (Log *) realloc(env->per_agent_log_sum, needed * sizeof(Log)); | ||
| env->per_agent_log_count = (int *) realloc(env->per_agent_log_count, needed * sizeof(int)); | ||
| memset(&env->per_agent_log_sum[old_cap], 0, (needed - old_cap) * sizeof(Log)); | ||
| memset(&env->per_agent_log_count[old_cap], 0, (needed - old_cap) * sizeof(int)); | ||
| env->per_agent_log_capacity = needed; | ||
| } |
Comment on lines
+761
to
+769
| // Emit whenever any env has data. With Drive's per-agent prepare_log | ||
| // path, aggregate.n is the cross-env count of agents that contributed | ||
| // a window-mean (not completed-episode count), so the meaningful gate | ||
| // is "at least one contribution." Other ocean envs that don't run | ||
| // prepare_log retain completed-episode-count semantics and now emit | ||
| // smaller batches more often — the Python-side mean_and_log | ||
| // (pufferl.py) re-averages across emissions in its rate-limit window. | ||
| if (aggregate.n < 1) { | ||
| return dict; |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the window-mean per-agent state (sum + count, reset on every emit) with a per-agent EMA that persists across emissions. Each agent slot tracks slot = alpha*slot + (1 - alpha)*new_episode_log on every completion, with the first completion seeding the slot directly. prepare_log then emits the population mean across agents flagged has_data, never resetting the EMAs. Removes the residual completion-rate bias that the previous PR design still carried: fast-completers no longer dominate the multi-emission average that pufferl.py's mean_and_log produces, because every agent now contributes a single smoothed value to every emission rather than appearing in some windows and not others. New env config knob log_ema_alpha (default 0.95, half-life ~14 episodes per agent) controls smoothing. alpha=0 collapses to "most recent episode only", alpha->1 freezes the slot at its first observation. Files: - drive.h: rename per_agent_log_sum to per_agent_log_ema, add per_agent_has_data flag, add log_ema_alpha field. add_log builds a fresh episode_log snapshot, then seeds the slot or EMA-blends. prepare_log skips no-data slots and sums (no division) into env->log; vec_log divides by env->log.n = num_with_data downstream. - binding.c: unpack log_ema_alpha kwarg. - drive.py: __init__ takes log_ema_alpha=0.95, plumbs through env_kwargs. - drive.ini: log_ema_alpha = 0.95 under [env]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T1: gate skips emissions until the first agent completes an episode; first log dict appears exactly when the scenario truncates. T2: once every agent has completed at least one episode, dict["n"] equals num_agents (population fully represented in the cross-agent mean). T3: prepare_log preserves per-agent EMA state across emissions -- two consecutive emissions with no intervening completions produce identical metric values, locking in that the EMA buffers are not reset on emit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
active_agent_count is set once by set_active_agents (called only from init() and never from c_reset), so the grow-only ensure_per_agent_log_capacity helper only ever did real work on its first call and the comment justifying it as defense against varying populations was inaccurate. Allocate the three per-agent arrays in init() right after set_active_agents and drop the helper, the per-step ensure call in add_log, and the now-unused per_agent_log_capacity field. prepare_log iterates active_agent_count directly. Also trim the struct-field, prepare_log, episode_log-snapshot, and env-composition comments that were either describing wiring history or referencing downstream consumers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval branch of vec_log only checked env[0]'s log.n before dividing every env's log fields by their own n, so envs other than env[0] with no data would divide by zero. The empty-data short-circuit also returned the unused outer dict instead of the half-built list, leaking it to the GC. Rewrite the eval branch to per-env-check inside the loop: skip the divide-and-emit for envs with n=0 and leave their list slot as an empty dict, so the list shape stays vec->num_envs regardless of which envs have contributed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_drive_train called load_config which reads sys.argv; under pytest that contains pytest's own flags and the argparser bailed with SystemExit(2). Patch sys.argv to a single-element list inside the test so pytest discovery works (the existing train-ci.yml bare-script invocation is unaffected). Add tests/test_drive_per_agent_logging.py and tests/test_drive_train.py to the utest pytest pipeline. test_drive_train runs in its own pytest step because the test calls os._exit(0) on success to dodge worker-thread hangs, which would otherwise mask any earlier-test failures with exit 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returning vec->num_envs dicts with empty entries for no-data envs poisons downstream eval aggregation: pufferlib/ocean/benchmark/evaluators/base.py treats a missing "n" key as 1, so each empty dict inflates total_n and the weighted-mean denominator without contributing any numerator -- per-metric means get diluted. Worse, multi_scenario._should_stop counts emissions against num_scenarios, so cold-start steps with all-empty lists could terminate the eval loop early with garbage data. Return a list of only the populated dicts (length <= vec->num_envs). When no env has data the list is empty, drive.py's `if log: info.append(log)` guard skips it, and downstream counters keep ticking on real emissions only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
log_ema_alpha = 0.707 (was 0.95) so the per-agent EMA's effective half-life is ~2 episodes -- responsive enough to track training progress within a few emissions per agent. Drive.__init__ default tracks drive.ini. Also join a split f-string in test_drive_per_agent_logging.py to make ruff-format happy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously we were logging anytime N agents finished, which biased the metrics towards agents that finished frequently. Now we report an average over all the agent slots.
Summary
We use per-agent EMA-based aggregation in `vec_log` so the cross-agent mean reported in wandb has one weight per agent regardless of completion frequency — no per-emission bias and no multi-emission residual bias.
For each agent slot the env keeps a smoothed `Log` state:
```
slot ← α · slot + (1 − α) · episode_log # on every completion (after first)
slot ← episode_log # on the slot's first completion
```
`prepare_log` then sums slots flagged `has_data`, sets `env->log.n = num_with_data`, and does not reset. `vec_log`'s sum-across-envs / divide-by-`aggregate.n` step then produces a population mean: every agent that has ever completed contributes one term, every emission.
Why EMA?
We need some way of providing a good estimate of the average performance of the agents. We don't want to accumulate forever, as that'd be slow to update, and we could use a fixed size number of logs per agent, but this is a nice compromise.