ChatGPU · ChatGPU · May 27, 2026 · May 27, 2026
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/README.md b/labs/rl_decision/lab_cql_offline_minigrid/README.md
@@ -0,0 +1,64 @@
+# lab · CQL on an 8×8 offline-RL gridworld
+
+> 最小但忠实的 CQL（Conservative Q-Learning, Kumar et al. 2020）复现。8×8 离散网格、固定 10 000 条离线数据、三个 trainer（BC / DQN / CQL）共用同一个 MLP critic，整个 notebook 在 CPU 上 < 2 分钟跑完。
+
+## Quickstart
+
+```bash
+cd labs/rl_decision/lab_cql_offline_minigrid
+pip install -r requirements.txt
+
+# (可选) 单独跑某个 trainer，每个 < 20s：
+python -m src.dataset      # 收集并保存 data/offline_dataset.pt
+python -m src.trainer_bc   # data/bc.pt
+python -m src.trainer_dqn  # data/dqn.pt
+python -m src.trainer_cql  # data/cql.pt  (alpha=auto)
+python -m src.trainer_cql --alpha-mode fixed --alpha-fixed 5.0   # ablation
+python -m src.trainer_cql --alpha-mode zero                       # = DQN
+
+# 端到端故事 (~80 秒)：
+jupyter nbconvert --execute --to notebook --inplace notebook.ipynb
+```
+
+## What this lab proves
+
+- **离线 DQN 会过估计 OOD 动作。** Cell 3 / 4 的 `q_overestimation.png` 右栏直接画出 `Q_OOD − Q_seen`：DQN 翻正到 +0.45；CQL 一路压到 −1.3。
+- **CQL 的 log-sum-exp 罚项把 Q 牢牢保留在数据流形内。** OOD-density 图（cell 5）显示 CQL 选的 greedy 动作在行为策略下的概率比 DQN 高，并且不再像 DQN 那样塌缩到右+下两个动作。
+- **α 自适应 > 任何固定 α。** Ablation 直接对比 α=0 / fixed-5 / auto；α=0（=DQN）的评估曲线在 ±0.85 之间剧烈反复，fixed-5 把 Q_seen 推到 3.6（过保守），auto 把 α 自动调到 ~0.6，Q 量级与真实回报最接近。
+- **CQL ≈ BC 在这个 trivial 任务上**：环境太小，BC 的 mode-cloning 已经足以取到最优；这正是 CQL 论文反复强调的——CQL 的真正价值在 D4RL / 真实驾驶等更难任务里。本 lab 把 *机理* 跑清楚；scaling 留给 stretch goal。
+
+## 文件契约
+
+```
+.
+├── README.md
+├── notebook.ipynb        ← 7-cell 可执行故事
+├── paper.md              ← 250-token 蒸馏，链接 paper_cql 卡片
+├── requirements.txt
+├── src/
+│   ├── __init__.py
+│   ├── env.py            ← 8×8 稀疏奖励网格 + 10% slip
+│   ├── dataset.py        ← 50% random + 50% ε-greedy expert 收集器
+│   ├── model.py          ← shared QNet MLP（BC / DQN / CQL 共用）
+│   ├── trainer_bc.py     ← 交叉熵 BC（默认 full dataset）
+│   ├── trainer_dqn.py    ← TD(0) + target net，纯离线
+│   ├── trainer_cql.py    ← TD + log-sum-exp 罚项，α 自适应 (Lagrangian)
+│   ├── viz.py            ← 所有 matplotlib 绘图
+│   └── seeds.py
+├── assets/               ← notebook 生成的 PNG
+│   ├── action_histogram.png
+│   ├── q_overestimation.png         ← 核心图：DQN 发散 / CQL 守恒
+│   ├── ood_action_density.png       ← 行为策略下 greedy 动作的密度
+│   ├── eval_returns.png             ← BC / DQN / CQL 评估曲线
+│   ├── ablation_alpha.png           ← α=0 / fixed-5 / auto
+│   └── ablation_alpha_traj.png      ← α 与 gap 的轨迹
+└── data/
+    ├── offline_dataset.pt           ← 10k 条离线 transition
+    ├── bc.pt / dqn.pt / cql.pt      ← 三个 trainer 的 checkpoint
+```
+
+## Three stretch goals（也写在 notebook 末尾）
+
+1. **奖励改成纯稀疏 +1（去掉 step penalty）+ `max_steps=200`**：让 Q 没有边界压制，DQN 的发散尺度真正放飞；CQL 的 `log-sum-exp` 应该仍能把它压住。
+2. **加 IQL 做对照**：复用 `dataset.py` 与 `model.py`，写 `trainer_iql.py`，用 expectile 回归（τ=0.7）与 KL-policy extraction。CQL = 惩罚 OOD，IQL = 避开 OOD，两路线的核心实务对比。
+3. **数据规模扫描**：让 `n_transitions` 在 {500, 1k, 2k, 5k, 10k, 20k} 上扫描，画三条算法的最终评估对数据量曲线——CQL 的样本复杂度优势才是真正的研究题。
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/assets/ablation_alpha.png b/labs/rl_decision/lab_cql_offline_minigrid/assets/ablation_alpha.png
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/assets/action_histogram.png b/labs/rl_decision/lab_cql_offline_minigrid/assets/action_histogram.png
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/assets/eval_returns.png b/labs/rl_decision/lab_cql_offline_minigrid/assets/eval_returns.png
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/assets/ood_action_density.png b/labs/rl_decision/lab_cql_offline_minigrid/assets/ood_action_density.png
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/assets/q_overestimation.png b/labs/rl_decision/lab_cql_offline_minigrid/assets/q_overestimation.png
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/assets/q_overestimation_dqn_only.png b/labs/rl_decision/lab_cql_offline_minigrid/assets/q_overestimation_dqn_only.png
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/notebook.ipynb b/labs/rl_decision/lab_cql_offline_minigrid/notebook.ipynb
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/paper.md b/labs/rl_decision/lab_cql_offline_minigrid/paper.md
@@ -0,0 +1,31 @@
+# 把 Q-learning 关进数据的笼子里：CQL 在一个 8×8 网格上的最小复现
+
+链接 Atlas 卡片：[paper_cql](../../../docs/data/cards/extended/paper_cql.md)。
+
+离线 RL 的核心难题是 Bellman 备份会去查询数据集没覆盖的动作，让 $Q(s, a^{\text{OOD}})$ 在没有数据修正的情况下被乐观抬高，进而把策略推向真实世界中 catastrophic 的动作。Kumar et al. 2020 的 CQL 给出了一个干净的修补：在普通的 TD 损失上加一项
+$$
+\alpha \, \mathbb{E}_{s \sim D}\!\Big[\log\!\sum_{a}\exp Q(s,a) - \mathbb{E}_{a\sim\hat\pi_\beta(\cdot \mid s)} Q(s, a)\Big],
+$$
+即对所有动作做 log-sum-exp，再减去数据集行为策略下 $Q$ 的期望。第一项把 *所有* 动作的 $Q$ 压低，第二项把 *数据内* 动作的 $Q$ 拉回来——净效果是数据外动作被显著压低，数据内动作几乎不动。这正好对应论文证明的 "学到的 $Q$ 是真实 $Q$ 的逐点下界"。
+
+本 lab 用一个 8×8 离散网格世界把这条理论故事跑成可见的图：
+
+1. **数据**：50% 随机 + 50% ε-greedy 专家（ε=0.3），共 10 000 条；专家直奔右下角，随机覆盖死角；环境 10% slip 让评估时的 trajectory 必然偏离训练数据的 deterministic 主线。
+2. **模型**：同一个两层 MLP critic（hidden=128），分别用 BC（对动作做交叉熵）、离线 DQN（TD(0) + target net）、CQL（TD + log-sum-exp 项）训练 5 000 步，batch=256，Adam lr=3e-4。
+3. **CQL 实现细节**：α 用 Lagrangian dual 自适应（target gap=5.0，dual lr=1e-4）；α=0 / α=fixed-5 / α=auto 在 ablation 中三选一对比。
+4. **诊断**：每 25 步统计 `Q` 在数据内 (s,a) 与数据外 (s,a) 上的均值，并把两者的差 `Q_OOD − Q_seen` 当作 "过估计 gap"；每 100 步跑 20 个 greedy episodes 评估。
+5. **故事**：DQN 的过估计 gap 在 ~1 500 步翻正到 +0.45，之后即使 TD 损失收敛到 ≈ 0，greedy 评估仍在 0.85 与 −0.50 之间剧烈反复（Q 在 OOD 上被高估，每次 argmax 都可能指向新的 OOD 通道）；CQL 的 gap 一直 ≤ 0 并继续下降到 −1.3，评估稳定停在 0.85。BC 通过 mode-cloning 也能达到 0.85，但其 Q-net 没有显式 *回报* 概念，只是把行为策略的众数复制了一遍——若把 ε 调到 0.7 或换更高 slip 的环境就会暴露 BC 无法 *推理* 的本性（stretch goal #3）。
+
+```mermaid
+flowchart LR
+    D[(offline dataset / 10k transitions / random + epsilon-greedy expert)]
+    D --> BC[BC / cross-entropy on actions]
+    D --> DQN[DQN / TD(0) with target net]
+    D --> CQL[CQL / TD + alpha * (logsumexp - E_D Q)]
+    BC --> Eval[greedy eval / 20 episodes]
+    DQN --> Eval
+    CQL --> Eval
+    CQL <-- alpha (Lagrangian, target gap=5) --> Dual[dual update]
+```
+
+更广义地说，CQL 与 [IQL](../../../docs/data/cards/extended/paper_iql.md) 形成 "惩罚 OOD vs 避开 OOD" 的对照，与 [Decision Transformer](../../../docs/data/cards/extended/paper_decision_transformer.md) 形成 "价值函数 vs 序列建模" 的对照。本 lab 是 [paradigm_offline_rl](../../../docs/data/cards/extended/paradigm_offline_rl.md) 的最小入口；下一站建议复现 IQL 跑同一份数据集做横向对比。
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/requirements.txt b/labs/rl_decision/lab_cql_offline_minigrid/requirements.txt
@@ -0,0 +1,7 @@
+numpy==2.1.3
+matplotlib==3.10.9
+torch==2.12.0
+nbformat==5.10.4
+nbconvert==7.17.1
+ipykernel==6.30.1
+jupyter==1.1.1
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/src/__init__.py b/labs/rl_decision/lab_cql_offline_minigrid/src/__init__.py
@@ -0,0 +1 @@
+"""Offline-RL reproduction lab: BC vs DQN vs CQL on an 8x8 gridworld."""
diff --git a/labs/rl_decision/lab_cql_offline_minigrid/src/dataset.py b/labs/rl_decision/lab_cql_offline_minigrid/src/dataset.py
@@ -0,0 +1,219 @@
+"""Collect and save the offline dataset.
+
+Composition (mirrors common offline-RL benchmark splits):
+  * 50% random policy — uniform over the 4 actions.
+  * 50% epsilon-greedy expert with epsilon=0.3.
+
+Approximate total transitions: ``DEFAULT_TRANSITIONS = 10_000``. Episodes are
+short (max 50 steps), so this is roughly 200-300 episodes per source.
+
+The saved file is a torch checkpoint with the per-transition tensors and the
+empirical action histogram conditioned on each state — the latter is what the
+CQL regulariser needs (E[Q(s, a)] under the data distribution) and what the
+OOD-action density plot consumes.
+
+Run standalone::
+
+    python -m src.dataset
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+from dataclasses import dataclass, field, asdict
+from pathlib import Path
+
+import numpy as np
+import torch
+
+from .env import GRID_SIZE, N_ACTIONS, GridWorld, expert_action, make_env
+from .seeds import seed_everything
+
+
+# ---------------------------------------------------------------------------
+# Config
+# ---------------------------------------------------------------------------
+DEFAULT_TRANSITIONS = 10_000
+EXPERT_EPS = 0.3
+
+
+@dataclass
+class DatasetConfig:
+    n_transitions: int = DEFAULT_TRANSITIONS
+    expert_eps: float = EXPERT_EPS
+    random_fraction: float = 0.5
+    max_episode_steps: int = 50
+    seed: int = 0
+
+
+# ---------------------------------------------------------------------------
+# Rollout helpers
+# ---------------------------------------------------------------------------
+def _rollout(
+    env: GridWorld,
+    policy: str,
+    expert_eps: float,
+    rng: np.random.Generator,
+    max_steps: int,
+) -> list[tuple[np.ndarray, int, float, np.ndarray, bool]]:
+    """Run a single episode under either the random or expert epsilon-greedy policy.
+
+    Returns a list of (s, a, r, s', done) tuples.
+    """
+    transitions: list[tuple[np.ndarray, int, float, np.ndarray, bool]] = []
+    obs = env.reset()
+    for _ in range(max_steps):
+        pos = np.argmax(obs)
+        row, col = int(pos // env.size), int(pos % env.size)
+        if policy == "random":
+            action = int(rng.integers(0, N_ACTIONS))
+        elif policy == "expert":
+            if rng.random() < expert_eps:
+                action = int(rng.integers(0, N_ACTIONS))
+            else:
+                action = int(expert_action(row, col, env.size))
+        else:
+            raise ValueError(policy)
+        next_obs, reward, done, _ = env.step(action)
+        transitions.append((obs.copy(), action, reward, next_obs.copy(), done))
+        obs = next_obs
+        if done:
+            break
+    return transitions
+
+
+@dataclass
+class OfflineDataset:
+    obs: torch.Tensor           # (N, obs_dim) float32
+    actions: torch.Tensor       # (N,) int64
+    rewards: torch.Tensor       # (N,) float32
+    next_obs: torch.Tensor      # (N, obs_dim) float32
+    dones: torch.Tensor         # (N,) float32
+    source: torch.Tensor        # (N,) int64 — 0=random, 1=expert (for analysis)
+    config: dict = field(default_factory=dict)
+
+    def __len__(self) -> int:
+        return int(self.obs.shape[0])
+
+    def save(self, path: str) -> None:
+        Path(os.path.dirname(path)).mkdir(parents=True, exist_ok=True)
+        torch.save(
+            {
+                "obs": self.obs,
+                "actions": self.actions,
+                "rewards": self.rewards,
+                "next_obs": self.next_obs,
+                "dones": self.dones,
+                "source": self.source,
+                "config": self.config,
+            },
+            path,
+        )
+
+    @classmethod
+    def load(cls, path: str) -> "OfflineDataset":
+        ckpt = torch.load(path, map_location="cpu", weights_only=False)
+        return cls(
+            obs=ckpt["obs"],
+            actions=ckpt["actions"],
+            rewards=ckpt["rewards"],
+            next_obs=ckpt["next_obs"],
+            dones=ckpt["dones"],
+            source=ckpt["source"],
+            config=ckpt.get("config", {}),
+        )
+
+
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def collect_offline_dataset(cfg: DatasetConfig | None = None) -> OfflineDataset:
+    cfg = cfg or DatasetConfig()
+    seed_everything(cfg.seed)
+    rng = np.random.default_rng(cfg.seed)
+
+    obs_buf: list[np.ndarray] = []
+    next_obs_buf: list[np.ndarray] = []
+    action_buf: list[int] = []
+    reward_buf: list[float] = []
+    done_buf: list[float] = []
+    source_buf: list[int] = []
+
+    n_random_target = int(cfg.n_transitions * cfg.random_fraction)
+    n_expert_target = cfg.n_transitions - n_random_target
+
+    env = make_env(seed=cfg.seed)
+    # First fill the random half, then the expert half.
+    for target_count, policy_name, source_id in [
+        (n_random_target, "random", 0),
+        (n_expert_target, "expert", 1),
+    ]:
+        collected = 0
+        while collected < target_count:
+            trans = _rollout(env, policy_name, cfg.expert_eps, rng, cfg.max_episode_steps)
+            for s, a, r, sp, d in trans:
+                if collected >= target_count:
+                    break
+                obs_buf.append(s)
+                action_buf.append(a)
+                reward_buf.append(r)
+                next_obs_buf.append(sp)
+                done_buf.append(float(d))
+                source_buf.append(source_id)
+                collected += 1
+
+    ds = OfflineDataset(
+        obs=torch.from_numpy(np.stack(obs_buf, axis=0).astype(np.float32)),
+        actions=torch.tensor(action_buf, dtype=torch.long),
+        rewards=torch.tensor(reward_buf, dtype=torch.float32),
+        next_obs=torch.from_numpy(np.stack(next_obs_buf, axis=0).astype(np.float32)),
+        dones=torch.tensor(done_buf, dtype=torch.float32),
+        source=torch.tensor(source_buf, dtype=torch.long),
+        config=asdict(cfg),
+    )
+    return ds
+
+
+# ---------------------------------------------------------------------------
+# Analysis helpers (used by the notebook and the plotting cell)
+# ---------------------------------------------------------------------------
+def state_action_support(ds: OfflineDataset, n_states: int, n_actions: int = N_ACTIONS) -> np.ndarray:
+    """Return an (n_states, n_actions) binary mask of which (s,a) appear in the dataset.
+
+    1.0 = at least one transition with that (s, a). 0.0 = OOD for the dataset.
+    """
+    state_idx = ds.obs.argmax(dim=1).cpu().numpy()  # one-hot -> state id
+    actions = ds.actions.cpu().numpy()
+    mask = np.zeros((n_states, n_actions), dtype=np.float32)
+    mask[state_idx, actions] = 1.0
+    return mask
+
+
+def action_histogram(ds: OfflineDataset, by_source: bool = True) -> dict[str, np.ndarray]:
+    """Build per-source action counts (random vs expert) for cell-1 visualisation."""
+    actions = ds.actions.cpu().numpy()
+    source = ds.source.cpu().numpy()
+    out: dict[str, np.ndarray] = {}
+    if by_source:
+        for label, sid in [("random", 0), ("expert", 1)]:
+            mask = source == sid
+            out[label] = np.bincount(actions[mask], minlength=N_ACTIONS)
+    out["all"] = np.bincount(actions, minlength=N_ACTIONS)
+    return out
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--n", type=int, default=DEFAULT_TRANSITIONS)
+    parser.add_argument("--out", type=str, default="data/offline_dataset.pt")
+    parser.add_argument("--seed", type=int, default=0)
+    args = parser.parse_args()
+
+    cfg = DatasetConfig(n_transitions=args.n, seed=args.seed)
+    ds = collect_offline_dataset(cfg)
+    ds.save(args.out)
+    hist = action_histogram(ds)
+    print(f"[dataset] saved {len(ds)} transitions -> {args.out}")
+    print(f"[dataset] action histogram (random) = {hist['random'].tolist()}")
+    print(f"[dataset] action histogram (expert) = {hist['expert'].tolist()}")
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Offline-RL reproduction lab: BC vs DQN vs CQL on an 8x8 gridworld."""