Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions labs/rl_decision/lab_cql_offline_minigrid/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# lab · CQL on an 8×8 offline-RL gridworld

> 最小但忠实的 CQL(Conservative Q-Learning, Kumar et al. 2020)复现。8×8 离散网格、固定 10 000 条离线数据、三个 trainer(BC / DQN / CQL)共用同一个 MLP critic,整个 notebook 在 CPU 上 < 2 分钟跑完。

## Quickstart

```bash
cd labs/rl_decision/lab_cql_offline_minigrid
pip install -r requirements.txt

# (可选) 单独跑某个 trainer,每个 < 20s:
python -m src.dataset # 收集并保存 data/offline_dataset.pt
python -m src.trainer_bc # data/bc.pt
python -m src.trainer_dqn # data/dqn.pt
python -m src.trainer_cql # data/cql.pt (alpha=auto)
python -m src.trainer_cql --alpha-mode fixed --alpha-fixed 5.0 # ablation
python -m src.trainer_cql --alpha-mode zero # = DQN

# 端到端故事 (~80 秒):
jupyter nbconvert --execute --to notebook --inplace notebook.ipynb
```

## What this lab proves

- **离线 DQN 会过估计 OOD 动作。** Cell 3 / 4 的 `q_overestimation.png` 右栏直接画出 `Q_OOD − Q_seen`:DQN 翻正到 +0.45;CQL 一路压到 −1.3。
- **CQL 的 log-sum-exp 罚项把 Q 牢牢保留在数据流形内。** OOD-density 图(cell 5)显示 CQL 选的 greedy 动作在行为策略下的概率比 DQN 高,并且不再像 DQN 那样塌缩到右+下两个动作。
- **α 自适应 > 任何固定 α。** Ablation 直接对比 α=0 / fixed-5 / auto;α=0(=DQN)的评估曲线在 ±0.85 之间剧烈反复,fixed-5 把 Q_seen 推到 3.6(过保守),auto 把 α 自动调到 ~0.6,Q 量级与真实回报最接近。
- **CQL ≈ BC 在这个 trivial 任务上**:环境太小,BC 的 mode-cloning 已经足以取到最优;这正是 CQL 论文反复强调的——CQL 的真正价值在 D4RL / 真实驾驶等更难任务里。本 lab 把 *机理* 跑清楚;scaling 留给 stretch goal。

## 文件契约

```
.
├── README.md
├── notebook.ipynb ← 7-cell 可执行故事
├── paper.md ← 250-token 蒸馏,链接 paper_cql 卡片
├── requirements.txt
├── src/
│ ├── __init__.py
│ ├── env.py ← 8×8 稀疏奖励网格 + 10% slip
│ ├── dataset.py ← 50% random + 50% ε-greedy expert 收集器
│ ├── model.py ← shared QNet MLP(BC / DQN / CQL 共用)
│ ├── trainer_bc.py ← 交叉熵 BC(默认 full dataset)
│ ├── trainer_dqn.py ← TD(0) + target net,纯离线
│ ├── trainer_cql.py ← TD + log-sum-exp 罚项,α 自适应 (Lagrangian)
│ ├── viz.py ← 所有 matplotlib 绘图
│ └── seeds.py
├── assets/ ← notebook 生成的 PNG
│ ├── action_histogram.png
│ ├── q_overestimation.png ← 核心图:DQN 发散 / CQL 守恒
│ ├── ood_action_density.png ← 行为策略下 greedy 动作的密度
│ ├── eval_returns.png ← BC / DQN / CQL 评估曲线
│ ├── ablation_alpha.png ← α=0 / fixed-5 / auto
│ └── ablation_alpha_traj.png ← α 与 gap 的轨迹
└── data/
├── offline_dataset.pt ← 10k 条离线 transition
├── bc.pt / dqn.pt / cql.pt ← 三个 trainer 的 checkpoint
```

## Three stretch goals(也写在 notebook 末尾)

1. **奖励改成纯稀疏 +1(去掉 step penalty)+ `max_steps=200`**:让 Q 没有边界压制,DQN 的发散尺度真正放飞;CQL 的 `log-sum-exp` 应该仍能把它压住。
2. **加 IQL 做对照**:复用 `dataset.py` 与 `model.py`,写 `trainer_iql.py`,用 expectile 回归(τ=0.7)与 KL-policy extraction。CQL = 惩罚 OOD,IQL = 避开 OOD,两路线的核心实务对比。
3. **数据规模扫描**:让 `n_transitions` 在 {500, 1k, 2k, 5k, 10k, 20k} 上扫描,画三条算法的最终评估对数据量曲线——CQL 的样本复杂度优势才是真正的研究题。
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
439 changes: 439 additions & 0 deletions labs/rl_decision/lab_cql_offline_minigrid/notebook.ipynb

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions labs/rl_decision/lab_cql_offline_minigrid/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# 把 Q-learning 关进数据的笼子里:CQL 在一个 8×8 网格上的最小复现

链接 Atlas 卡片:[paper_cql](../../../docs/data/cards/extended/paper_cql.md)。

离线 RL 的核心难题是 Bellman 备份会去查询数据集没覆盖的动作,让 $Q(s, a^{\text{OOD}})$ 在没有数据修正的情况下被乐观抬高,进而把策略推向真实世界中 catastrophic 的动作。Kumar et al. 2020 的 CQL 给出了一个干净的修补:在普通的 TD 损失上加一项
$$
\alpha \, \mathbb{E}_{s \sim D}\!\Big[\log\!\sum_{a}\exp Q(s,a) - \mathbb{E}_{a\sim\hat\pi_\beta(\cdot \mid s)} Q(s, a)\Big],
$$
即对所有动作做 log-sum-exp,再减去数据集行为策略下 $Q$ 的期望。第一项把 *所有* 动作的 $Q$ 压低,第二项把 *数据内* 动作的 $Q$ 拉回来——净效果是数据外动作被显著压低,数据内动作几乎不动。这正好对应论文证明的 "学到的 $Q$ 是真实 $Q$ 的逐点下界"。

本 lab 用一个 8×8 离散网格世界把这条理论故事跑成可见的图:

1. **数据**:50% 随机 + 50% ε-greedy 专家(ε=0.3),共 10 000 条;专家直奔右下角,随机覆盖死角;环境 10% slip 让评估时的 trajectory 必然偏离训练数据的 deterministic 主线。
2. **模型**:同一个两层 MLP critic(hidden=128),分别用 BC(对动作做交叉熵)、离线 DQN(TD(0) + target net)、CQL(TD + log-sum-exp 项)训练 5 000 步,batch=256,Adam lr=3e-4。
3. **CQL 实现细节**:α 用 Lagrangian dual 自适应(target gap=5.0,dual lr=1e-4);α=0 / α=fixed-5 / α=auto 在 ablation 中三选一对比。
4. **诊断**:每 25 步统计 `Q` 在数据内 (s,a) 与数据外 (s,a) 上的均值,并把两者的差 `Q_OOD − Q_seen` 当作 "过估计 gap";每 100 步跑 20 个 greedy episodes 评估。
5. **故事**:DQN 的过估计 gap 在 ~1 500 步翻正到 +0.45,之后即使 TD 损失收敛到 ≈ 0,greedy 评估仍在 0.85 与 −0.50 之间剧烈反复(Q 在 OOD 上被高估,每次 argmax 都可能指向新的 OOD 通道);CQL 的 gap 一直 ≤ 0 并继续下降到 −1.3,评估稳定停在 0.85。BC 通过 mode-cloning 也能达到 0.85,但其 Q-net 没有显式 *回报* 概念,只是把行为策略的众数复制了一遍——若把 ε 调到 0.7 或换更高 slip 的环境就会暴露 BC 无法 *推理* 的本性(stretch goal #3)。

```mermaid
flowchart LR
D[(offline dataset / 10k transitions / random + epsilon-greedy expert)]
D --> BC[BC / cross-entropy on actions]
D --> DQN[DQN / TD(0) with target net]
D --> CQL[CQL / TD + alpha * (logsumexp - E_D Q)]
BC --> Eval[greedy eval / 20 episodes]
DQN --> Eval
CQL --> Eval
CQL <-- alpha (Lagrangian, target gap=5) --> Dual[dual update]
```

更广义地说,CQL 与 [IQL](../../../docs/data/cards/extended/paper_iql.md) 形成 "惩罚 OOD vs 避开 OOD" 的对照,与 [Decision Transformer](../../../docs/data/cards/extended/paper_decision_transformer.md) 形成 "价值函数 vs 序列建模" 的对照。本 lab 是 [paradigm_offline_rl](../../../docs/data/cards/extended/paradigm_offline_rl.md) 的最小入口;下一站建议复现 IQL 跑同一份数据集做横向对比。
7 changes: 7 additions & 0 deletions labs/rl_decision/lab_cql_offline_minigrid/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
numpy==2.1.3
matplotlib==3.10.9
torch==2.12.0
nbformat==5.10.4
nbconvert==7.17.1
ipykernel==6.30.1
jupyter==1.1.1
1 change: 1 addition & 0 deletions labs/rl_decision/lab_cql_offline_minigrid/src/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Offline-RL reproduction lab: BC vs DQN vs CQL on an 8x8 gridworld."""
219 changes: 219 additions & 0 deletions labs/rl_decision/lab_cql_offline_minigrid/src/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
"""Collect and save the offline dataset.

Composition (mirrors common offline-RL benchmark splits):
* 50% random policy — uniform over the 4 actions.
* 50% epsilon-greedy expert with epsilon=0.3.

Approximate total transitions: ``DEFAULT_TRANSITIONS = 10_000``. Episodes are
short (max 50 steps), so this is roughly 200-300 episodes per source.

The saved file is a torch checkpoint with the per-transition tensors and the
empirical action histogram conditioned on each state — the latter is what the
CQL regulariser needs (E[Q(s, a)] under the data distribution) and what the
OOD-action density plot consumes.

Run standalone::

python -m src.dataset
"""

from __future__ import annotations

import argparse
import os
from dataclasses import dataclass, field, asdict
from pathlib import Path

import numpy as np
import torch

from .env import GRID_SIZE, N_ACTIONS, GridWorld, expert_action, make_env
from .seeds import seed_everything


# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------
DEFAULT_TRANSITIONS = 10_000
EXPERT_EPS = 0.3


@dataclass
class DatasetConfig:
n_transitions: int = DEFAULT_TRANSITIONS
expert_eps: float = EXPERT_EPS
random_fraction: float = 0.5
max_episode_steps: int = 50
seed: int = 0


# ---------------------------------------------------------------------------
# Rollout helpers
# ---------------------------------------------------------------------------
def _rollout(
env: GridWorld,
policy: str,
expert_eps: float,
rng: np.random.Generator,
max_steps: int,
) -> list[tuple[np.ndarray, int, float, np.ndarray, bool]]:
"""Run a single episode under either the random or expert epsilon-greedy policy.

Returns a list of (s, a, r, s', done) tuples.
"""
transitions: list[tuple[np.ndarray, int, float, np.ndarray, bool]] = []
obs = env.reset()
for _ in range(max_steps):
pos = np.argmax(obs)
row, col = int(pos // env.size), int(pos % env.size)
if policy == "random":
action = int(rng.integers(0, N_ACTIONS))
elif policy == "expert":
if rng.random() < expert_eps:
action = int(rng.integers(0, N_ACTIONS))
else:
action = int(expert_action(row, col, env.size))
else:
raise ValueError(policy)
next_obs, reward, done, _ = env.step(action)
transitions.append((obs.copy(), action, reward, next_obs.copy(), done))
obs = next_obs
if done:
break
return transitions


@dataclass
class OfflineDataset:
obs: torch.Tensor # (N, obs_dim) float32
actions: torch.Tensor # (N,) int64
rewards: torch.Tensor # (N,) float32
next_obs: torch.Tensor # (N, obs_dim) float32
dones: torch.Tensor # (N,) float32
source: torch.Tensor # (N,) int64 — 0=random, 1=expert (for analysis)
config: dict = field(default_factory=dict)

def __len__(self) -> int:
return int(self.obs.shape[0])

def save(self, path: str) -> None:
Path(os.path.dirname(path)).mkdir(parents=True, exist_ok=True)
torch.save(
{
"obs": self.obs,
"actions": self.actions,
"rewards": self.rewards,
"next_obs": self.next_obs,
"dones": self.dones,
"source": self.source,
"config": self.config,
},
path,
)

@classmethod
def load(cls, path: str) -> "OfflineDataset":
ckpt = torch.load(path, map_location="cpu", weights_only=False)
return cls(
obs=ckpt["obs"],
actions=ckpt["actions"],
rewards=ckpt["rewards"],
next_obs=ckpt["next_obs"],
dones=ckpt["dones"],
source=ckpt["source"],
config=ckpt.get("config", {}),
)


# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def collect_offline_dataset(cfg: DatasetConfig | None = None) -> OfflineDataset:
cfg = cfg or DatasetConfig()
seed_everything(cfg.seed)
rng = np.random.default_rng(cfg.seed)

obs_buf: list[np.ndarray] = []
next_obs_buf: list[np.ndarray] = []
action_buf: list[int] = []
reward_buf: list[float] = []
done_buf: list[float] = []
source_buf: list[int] = []

n_random_target = int(cfg.n_transitions * cfg.random_fraction)
n_expert_target = cfg.n_transitions - n_random_target

env = make_env(seed=cfg.seed)
# First fill the random half, then the expert half.
for target_count, policy_name, source_id in [
(n_random_target, "random", 0),
(n_expert_target, "expert", 1),
]:
collected = 0
while collected < target_count:
trans = _rollout(env, policy_name, cfg.expert_eps, rng, cfg.max_episode_steps)
for s, a, r, sp, d in trans:
if collected >= target_count:
break
obs_buf.append(s)
action_buf.append(a)
reward_buf.append(r)
next_obs_buf.append(sp)
done_buf.append(float(d))
source_buf.append(source_id)
collected += 1

ds = OfflineDataset(
obs=torch.from_numpy(np.stack(obs_buf, axis=0).astype(np.float32)),
actions=torch.tensor(action_buf, dtype=torch.long),
rewards=torch.tensor(reward_buf, dtype=torch.float32),
next_obs=torch.from_numpy(np.stack(next_obs_buf, axis=0).astype(np.float32)),
dones=torch.tensor(done_buf, dtype=torch.float32),
source=torch.tensor(source_buf, dtype=torch.long),
config=asdict(cfg),
)
return ds


# ---------------------------------------------------------------------------
# Analysis helpers (used by the notebook and the plotting cell)
# ---------------------------------------------------------------------------
def state_action_support(ds: OfflineDataset, n_states: int, n_actions: int = N_ACTIONS) -> np.ndarray:
"""Return an (n_states, n_actions) binary mask of which (s,a) appear in the dataset.

1.0 = at least one transition with that (s, a). 0.0 = OOD for the dataset.
"""
state_idx = ds.obs.argmax(dim=1).cpu().numpy() # one-hot -> state id
actions = ds.actions.cpu().numpy()
mask = np.zeros((n_states, n_actions), dtype=np.float32)
mask[state_idx, actions] = 1.0
return mask


def action_histogram(ds: OfflineDataset, by_source: bool = True) -> dict[str, np.ndarray]:
"""Build per-source action counts (random vs expert) for cell-1 visualisation."""
actions = ds.actions.cpu().numpy()
source = ds.source.cpu().numpy()
out: dict[str, np.ndarray] = {}
if by_source:
for label, sid in [("random", 0), ("expert", 1)]:
mask = source == sid
out[label] = np.bincount(actions[mask], minlength=N_ACTIONS)
out["all"] = np.bincount(actions, minlength=N_ACTIONS)
return out


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--n", type=int, default=DEFAULT_TRANSITIONS)
parser.add_argument("--out", type=str, default="data/offline_dataset.pt")
parser.add_argument("--seed", type=int, default=0)
args = parser.parse_args()

cfg = DatasetConfig(n_transitions=args.n, seed=args.seed)
ds = collect_offline_dataset(cfg)
ds.save(args.out)
hist = action_histogram(ds)
print(f"[dataset] saved {len(ds)} transitions -> {args.out}")
print(f"[dataset] action histogram (random) = {hist['random'].tolist()}")
print(f"[dataset] action histogram (expert) = {hist['expert'].tolist()}")
Loading
Loading