Skip to content

vec_log per-agent population mean - average over agents instead of over completed episodes#459

Open
eugenevinitsky wants to merge 11 commits into
emerge/temp_trainingfrom
ev/per-agent-logging
Open

vec_log per-agent population mean - average over agents instead of over completed episodes#459
eugenevinitsky wants to merge 11 commits into
emerge/temp_trainingfrom
ev/per-agent-logging

Conversation

@eugenevinitsky
Copy link
Copy Markdown

@eugenevinitsky eugenevinitsky commented May 30, 2026

Previously we were logging anytime N agents finished, which biased the metrics towards agents that finished frequently. Now we report an average over all the agent slots.

Summary

We use per-agent EMA-based aggregation in `vec_log` so the cross-agent mean reported in wandb has one weight per agent regardless of completion frequency — no per-emission bias and no multi-emission residual bias.

For each agent slot the env keeps a smoothed `Log` state:

```
slot ← α · slot + (1 − α) · episode_log # on every completion (after first)
slot ← episode_log # on the slot's first completion
```

`prepare_log` then sums slots flagged `has_data`, sets `env->log.n = num_with_data`, and does not reset. `vec_log`'s sum-across-envs / divide-by-`aggregate.n` step then produces a population mean: every agent that has ever completed contributes one term, every emission.

Why EMA?

We need some way of providing a good estimate of the average performance of the agents. We don't want to accumulate forever, as that'd be slow to update, and we could use a fixed size number of logs per agent, but this is a nice compromise.

Replaces the existing 'sum-over-completed-episodes / completed-episode-count'
aggregation in vec_log with a two-stage per-agent mean: each agent slot
contributes one window-mean (mean over its completions since last emit), then
those means are averaged across agents. An agent that completes 10 short
infraction-terminated episodes now has the same weight in the metric as an
agent that completes one full clean episode.

drive.h:
- New per-env state: per_agent_log_sum[], per_agent_log_count[],
  per_agent_log_capacity. Grow-only realloc handles REPLAY scenarios where
  active_agent_count can vary across c_reset.
- add_log now stamps the completing agent's slot instead of env->log, and
  bumps per_agent_log_count[i] instead of env->log.n.
- New prepare_log: drains each agent's slot to (sum / count), sums across
  agents into env->log, sets env->log.n = #agents-with-data. Resets the
  per-agent buffers in place.
- env_static_car_count and static_car_count are folded into each agent's
  slot so the population mean recovers the env's per-scenario value.

drive/binding.c:
- New vec_prepare_log binding (MY_METHODS) that calls prepare_log on every
  env in a VecEnv.

drive/drive.py:
- Calls binding.vec_prepare_log right before binding.vec_log every
  report_interval steps.

env_binding.h:
- Relaxes the vec_log gate from `aggregate.n < num_agents` (wait until
  enough completed episodes) to `aggregate.n < 1` (emit if anyone has
  data). For Drive this means "emit if any env has at least one agent
  with a window-mean"; for other ocean envs sharing this header it means
  smaller batches emitted more often, which the Python-side mean_and_log
  in pufferl.py already amortizes via its 0.25s rate-limit.
- num_agents arg kept on the C signature for caller-API compatibility,
  but is now unused.

Validated locally: `python setup.py build_ext --inplace --force` clean,
existing scenario-length and single-agent-yaml tests pass, and a 64-agent
smoke run emits one log dict per scenario_length window with n=num_agents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 30, 2026 17:47
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes how Drive environment metrics are aggregated in vec_log to remove completion-rate bias by switching from “mean over completed episodes” to a two-stage “per-agent window mean, then population mean across agent slots” approach. It also adds a Drive-specific vec_prepare_log step to drain per-agent accumulators into env->log immediately before vec_log.

Changes:

  • Add per-agent log accumulators in Drive and a prepare_log() drain step to produce per-agent population means.
  • Add a new C binding vec_prepare_log and call it from drive.py before vec_log at each report_interval.
  • Relax vec_log’s emission gate in env_binding.h from aggregate.n < num_agents to aggregate.n < 1 (API arg retained but unused).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
pufferlib/ocean/env_binding.h Changes vec_log emission gating to emit with any data (ignores num_agents arg).
pufferlib/ocean/drive/drive.py Calls new binding.vec_prepare_log() prior to binding.vec_log().
pufferlib/ocean/drive/drive.h Adds per-agent log sum/count buffers; introduces ensure_per_agent_log_capacity() and prepare_log(); rewrites add_log() to accumulate per-agent.
pufferlib/ocean/drive/binding.c Exposes vec_prepare_log Python binding to call prepare_log() across VecEnv envs.
Comments suppressed due to low confidence (1)

pufferlib/ocean/drive/binding.c:55

  • vec_prepare_log_py doesn’t validate its argument count. Most other vec_* entrypoints in env_binding.h explicitly check PyTuple_Size and raise a clear TypeError; without this, accidental extra args will be silently ignored and missing args will lead to confusing errors inside unpack_vecenv.
        prepare_log((Drive *) vec->envs[i]);
    }
    Py_RETURN_NONE;
}

static int my_put(Env *env, PyObject *args, PyObject *kwargs) {
    PyObject *obs = PyDict_GetItemString(kwargs, "observations");
    if (!PyObject_TypeCheck(obs, &PyArray_Type)) {
        PyErr_SetString(PyExc_TypeError, "Observations must be a NumPy array");
        return 1;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pufferlib/ocean/drive/drive.h Outdated
Comment on lines +2588 to +2598
static void ensure_per_agent_log_capacity(Drive *env, int needed) {
if (env->per_agent_log_capacity >= needed) {
return;
}
int old_cap = env->per_agent_log_capacity;
env->per_agent_log_sum = (Log *) realloc(env->per_agent_log_sum, needed * sizeof(Log));
env->per_agent_log_count = (int *) realloc(env->per_agent_log_count, needed * sizeof(int));
memset(&env->per_agent_log_sum[old_cap], 0, (needed - old_cap) * sizeof(Log));
memset(&env->per_agent_log_count[old_cap], 0, (needed - old_cap) * sizeof(int));
env->per_agent_log_capacity = needed;
}
Comment thread pufferlib/ocean/env_binding.h Outdated
Comment on lines +761 to +769
// Emit whenever any env has data. With Drive's per-agent prepare_log
// path, aggregate.n is the cross-env count of agents that contributed
// a window-mean (not completed-episode count), so the meaningful gate
// is "at least one contribution." Other ocean envs that don't run
// prepare_log retain completed-episode-count semantics and now emit
// smaller batches more often — the Python-side mean_and_log
// (pufferl.py) re-averages across emissions in its rate-limit window.
if (aggregate.n < 1) {
return dict;
Eugene Vinitsky and others added 9 commits May 30, 2026 13:54
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the window-mean per-agent state (sum + count, reset on every emit)
with a per-agent EMA that persists across emissions. Each agent slot tracks
slot = alpha*slot + (1 - alpha)*new_episode_log on every completion, with the
first completion seeding the slot directly. prepare_log then emits the
population mean across agents flagged has_data, never resetting the EMAs.

Removes the residual completion-rate bias that the previous PR design still
carried: fast-completers no longer dominate the multi-emission average that
pufferl.py's mean_and_log produces, because every agent now contributes a
single smoothed value to every emission rather than appearing in some
windows and not others.

New env config knob log_ema_alpha (default 0.95, half-life ~14 episodes per
agent) controls smoothing. alpha=0 collapses to "most recent episode only",
alpha->1 freezes the slot at its first observation.

Files:
- drive.h: rename per_agent_log_sum to per_agent_log_ema, add
  per_agent_has_data flag, add log_ema_alpha field. add_log builds a fresh
  episode_log snapshot, then seeds the slot or EMA-blends. prepare_log skips
  no-data slots and sums (no division) into env->log; vec_log divides by
  env->log.n = num_with_data downstream.
- binding.c: unpack log_ema_alpha kwarg.
- drive.py: __init__ takes log_ema_alpha=0.95, plumbs through env_kwargs.
- drive.ini: log_ema_alpha = 0.95 under [env].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T1: gate skips emissions until the first agent completes an episode; first
log dict appears exactly when the scenario truncates.

T2: once every agent has completed at least one episode, dict["n"] equals
num_agents (population fully represented in the cross-agent mean).

T3: prepare_log preserves per-agent EMA state across emissions -- two
consecutive emissions with no intervening completions produce identical
metric values, locking in that the EMA buffers are not reset on emit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
active_agent_count is set once by set_active_agents (called only from init()
and never from c_reset), so the grow-only ensure_per_agent_log_capacity
helper only ever did real work on its first call and the comment justifying
it as defense against varying populations was inaccurate. Allocate the three
per-agent arrays in init() right after set_active_agents and drop the helper,
the per-step ensure call in add_log, and the now-unused per_agent_log_capacity
field. prepare_log iterates active_agent_count directly.

Also trim the struct-field, prepare_log, episode_log-snapshot, and
env-composition comments that were either describing wiring history or
referencing downstream consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval branch of vec_log only checked env[0]'s log.n before dividing every
env's log fields by their own n, so envs other than env[0] with no data
would divide by zero. The empty-data short-circuit also returned the unused
outer dict instead of the half-built list, leaking it to the GC.

Rewrite the eval branch to per-env-check inside the loop: skip the
divide-and-emit for envs with n=0 and leave their list slot as an empty
dict, so the list shape stays vec->num_envs regardless of which envs have
contributed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_drive_train called load_config which reads sys.argv; under pytest that
contains pytest's own flags and the argparser bailed with SystemExit(2).
Patch sys.argv to a single-element list inside the test so pytest discovery
works (the existing train-ci.yml bare-script invocation is unaffected).

Add tests/test_drive_per_agent_logging.py and tests/test_drive_train.py to
the utest pytest pipeline. test_drive_train runs in its own pytest step
because the test calls os._exit(0) on success to dodge worker-thread hangs,
which would otherwise mask any earlier-test failures with exit 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returning vec->num_envs dicts with empty entries for no-data envs poisons
downstream eval aggregation: pufferlib/ocean/benchmark/evaluators/base.py
treats a missing "n" key as 1, so each empty dict inflates total_n and the
weighted-mean denominator without contributing any numerator -- per-metric
means get diluted. Worse, multi_scenario._should_stop counts emissions
against num_scenarios, so cold-start steps with all-empty lists could
terminate the eval loop early with garbage data.

Return a list of only the populated dicts (length <= vec->num_envs). When
no env has data the list is empty, drive.py's `if log: info.append(log)`
guard skips it, and downstream counters keep ticking on real emissions
only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
log_ema_alpha = 0.707 (was 0.95) so the per-agent EMA's effective half-life
is ~2 episodes -- responsive enough to track training progress within a few
emissions per agent. Drive.__init__ default tracks drive.ini.

Also join a split f-string in test_drive_per_agent_logging.py to make
ruff-format happy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eugenevinitsky eugenevinitsky changed the title WIP: vec_log per-agent population mean (remove completion-rate bias) vec_log per-agent population mean - average over agents instead of over completed episodes May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants