Add nonlinear probes, streaming extraction, and mediation analysis by lgngrvs · Pull Request #8 · Metta-AI/NeuralMMO-interpretability

lgngrvs · 2026-04-03T23:14:08Z

Summary

Class reweighting for linear probes (--class-weight balanced) to handle imbalanced binary features
Nonlinear MLP probes (2-layer ReLU, configurable hidden dim) -- finding: representations are mostly linear, MLPs don't dramatically outperform
Streaming JSONL extraction with StreamingRecordWriter to prevent OOM on large runs (previous runs crashed the server)
Parallel binary cache loader for fast data loading (30+ min -> <1s)
Causal mediation analysis via Frisch-Waugh-Lovell residualization -- key finding: n_visible_players retains R²=0.40 after controlling for combat/position/time, despite the model lacking player observations
10x data collection (28k samples vs 1k) with dramatic probe improvement across all features

See EXPERIMENTS.md for full experiment details and results.

Test plan

Linear probes with/without class reweighting tested on baseline_10M
Nonlinear probes tested with hidden_factor=4 and hidden_factor=32
Streaming extraction validated with smoke test + full 40-episode runs
Mediation analysis validated on 10x data
Cache loader verified: load + compute_features in <1s

🤖 Generated with Claude Code

…diation analysis - Add class reweighting (--class-weight balanced) to train_probes.py for handling imbalanced binary features - Add train_nonlinear_probes.py: 2-layer ReLU MLP probes with configurable hidden dim, supporting both discrete and continuous features - Add streaming JSONL output to extract_activations.py to prevent OOM during large-scale data collection (StreamingRecordWriter with periodic flushing) - Add parallel binary cache builder to analyze_activations.py for fast data loading (30+ min -> <1s via .cache.npz files) - Add mediation_analysis.py: causal mediation via Frisch-Waugh-Lovell residualization to identify confounded probe results - Add EXPERIMENTS.md documenting all experiments and findings - Key finding: linear probes are sufficient (MLPs don't dramatically outperform), and n_visible_players retains R²=0.40 after controlling for combat/position/time despite the model lacking player observations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ions Replaces full-activation patching with principled probe-direction interchange intervention (Geiger et al., 2021). Decomposes activations into tick-relevant and orthogonal components, swaps only the tick component, and compares to random-direction and full-swap controls. Key finding: tick direction has a real causal effect (14-94x random baseline) but accounts for <1% of full-swap behavioral change in movement. Largest effect is on economic action heads (gold/price). Dose-response curve is clean and monotonic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lgngrvs and others added 2 commits April 3, 2026 23:13

lgngrvs marked this pull request as ready for review April 10, 2026 18:13

lgngrvs merged commit 84ce39a into 2.1 Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nonlinear probes, streaming extraction, and mediation analysis#8

Add nonlinear probes, streaming extraction, and mediation analysis#8
lgngrvs merged 2 commits into2.1from
worktree-interpretability-work

lgngrvs commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lgngrvs commented Apr 3, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant