Skip to content

Add nonlinear probes, streaming extraction, and mediation analysis#8

Merged
lgngrvs merged 2 commits into2.1from
worktree-interpretability-work
Apr 10, 2026
Merged

Add nonlinear probes, streaming extraction, and mediation analysis#8
lgngrvs merged 2 commits into2.1from
worktree-interpretability-work

Conversation

@lgngrvs
Copy link
Copy Markdown
Collaborator

@lgngrvs lgngrvs commented Apr 3, 2026

Summary

  • Class reweighting for linear probes (--class-weight balanced) to handle imbalanced binary features
  • Nonlinear MLP probes (2-layer ReLU, configurable hidden dim) -- finding: representations are mostly linear, MLPs don't dramatically outperform
  • Streaming JSONL extraction with StreamingRecordWriter to prevent OOM on large runs (previous runs crashed the server)
  • Parallel binary cache loader for fast data loading (30+ min -> <1s)
  • Causal mediation analysis via Frisch-Waugh-Lovell residualization -- key finding: n_visible_players retains R²=0.40 after controlling for combat/position/time, despite the model lacking player observations
  • 10x data collection (28k samples vs 1k) with dramatic probe improvement across all features

See EXPERIMENTS.md for full experiment details and results.

Test plan

  • Linear probes with/without class reweighting tested on baseline_10M
  • Nonlinear probes tested with hidden_factor=4 and hidden_factor=32
  • Streaming extraction validated with smoke test + full 40-episode runs
  • Mediation analysis validated on 10x data
  • Cache loader verified: load + compute_features in <1s

🤖 Generated with Claude Code

lgngrvs and others added 2 commits April 3, 2026 23:13
…diation analysis

- Add class reweighting (--class-weight balanced) to train_probes.py for
  handling imbalanced binary features
- Add train_nonlinear_probes.py: 2-layer ReLU MLP probes with configurable
  hidden dim, supporting both discrete and continuous features
- Add streaming JSONL output to extract_activations.py to prevent OOM during
  large-scale data collection (StreamingRecordWriter with periodic flushing)
- Add parallel binary cache builder to analyze_activations.py for fast data
  loading (30+ min -> <1s via .cache.npz files)
- Add mediation_analysis.py: causal mediation via Frisch-Waugh-Lovell
  residualization to identify confounded probe results
- Add EXPERIMENTS.md documenting all experiments and findings
- Key finding: linear probes are sufficient (MLPs don't dramatically
  outperform), and n_visible_players retains R²=0.40 after controlling
  for combat/position/time despite the model lacking player observations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions

Replaces full-activation patching with principled probe-direction
interchange intervention (Geiger et al., 2021). Decomposes activations
into tick-relevant and orthogonal components, swaps only the tick
component, and compares to random-direction and full-swap controls.

Key finding: tick direction has a real causal effect (14-94x random
baseline) but accounts for <1% of full-swap behavioral change in
movement. Largest effect is on economic action heads (gold/price).
Dose-response curve is clean and monotonic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lgngrvs lgngrvs marked this pull request as ready for review April 10, 2026 18:13
@lgngrvs lgngrvs merged commit 84ce39a into 2.1 Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant