DFP directly updates a one-step generative policy in action space through a drifting-field objective, avoiding ODE trajectory-level credit assignment during offline-to-online RL fine-tuning.
Drifting Field Policy (DFP) is a non-ODE one-step generative policy for offline-to-online reinforcement learning. DFP represents the policy as a single-pass pushforward map from Gaussian noise to actions, and frames policy improvement as a Wasserstein-2 gradient flow toward the soft policy improvement target.
Because the exact soft target is intractable, DFP uses a practical top-K critic-selected action surrogate: it samples candidate actions from the current policy, selects high-value actions with the critic, and trains the drifting field toward those actions. This keeps inference one-step while directly applying online reward signals at the action level.
This release contains:
drift: Drifting Field Policy.meanflow: Mean Velocity Policy comparison backbone.acfql: QC/FQL baseline retained from the original action-chunking codebase.
- Python: 3.10
- CUDA: 12.x
- Benchmarks: Robomimic, OGBench
conda env create -f environment.yml
conda activate dfpOr install the pip dependencies manually:
conda create -n dfp python=3.10 pip -y
conda activate dfp
pip install -r requirements.txtPlace the low-dimensional Robomimic datasets under the standard Robomimic directory:
~/.robomimic/lift/mh/low_dim_v15.hdf5
~/.robomimic/can/mh/low_dim_v15.hdf5
~/.robomimic/square/mh/low_dim_v15.hdf5
If your datasets live elsewhere, set:
export ROBOMIMIC_DATASET_DIR=/path/to/robomimicThe datasets can be downloaded from the Robomimic dataset page: https://robomimic.github.io/docs/datasets/robomimic_v0.1.html
For cube-quadruple, we use the 100M-size offline dataset:
wget -r -np -nH --cut-dirs=2 -A "*.npz" \
https://rail.eecs.berkeley.edu/datasets/ogbench/cube-quadruple-play-100m-v0/Pass the downloaded directory with:
--ogbench_dataset_dir=/path/to/cube-quadruple-play-100m-v0MVP is our main comparison baseline. Since
no official implementation was available, we implemented the MVP
baseline ourselves for reproduction. Most hyperparameters follow the MVP paper,
but for cube-triple experiments we set ivc_lambda=0 because it gave the
strongest performance in our runs.
We were not able to fully reproduce the reported paper performance across all settings, so the MVP results in our experiments use the best-performing configuration we found.
The main results are offline-to-online runs. Each command first trains on the offline dataset and then continues online fine-tuning in the same run.
# DFP
MUJOCO_GL=egl python main.py --agent_config=drift --run_group=reproduce --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5
# MVP
MUJOCO_GL=egl python main.py --agent_config=meanflow --run_group=reproduce --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5
# QC-BFN
MUJOCO_GL=egl python main.py --run_group=reproduce --agent.actor_type=best-of-n --agent.actor_num_samples=32 --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5
# QC-FQL
MUJOCO_GL=egl python main.py --run_group=reproduce --agent.alpha=100 --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5
# BFN
MUJOCO_GL=egl python main.py --run_group=reproduce --agent.actor_type=best-of-n --agent.actor_num_samples=4 --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=1
# FQL
MUJOCO_GL=egl python main.py --run_group=reproduce --agent.alpha=100 --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=1The default agent is acfql, so the QC-BFN, QC-FQL, BFN, and FQL commands do not need an explicit --agent_config=acfql. Override the environment when needed:
MUJOCO_GL=egl python main.py \
--agent_config=drift \
--run_group=reproduce \
--env_name=cube-quadruple-play-100m-singletask-task3-v0 \
--ogbench_dataset_dir=/path/to/cube-quadruple-play-100m-v0 \
--seed=42To skip offline training and start online fine-tuning from a saved offline checkpoint, pass the checkpoint and set restore_epoch to the offline training horizon:
MUJOCO_GL=egl python main.py \
--agent_config=drift \
--run_group=reproduce \
--env_name=cube-triple-play-singletask-task3-v0 \
--restore_path=/path/to/params_offline_final.pkl \
--restore_epoch=1000000 \
--seed=42| Path | Description |
|---|---|
agents/ |
DFP, MVP, and QC/FQL baseline agents |
config/ |
Main, evaluation, optimizer, and agent configs |
envs/ |
Robomimic, OGBench, and D4RL environment utilities |
utils/ |
Datasets, networks, drifting loss, logging, and Flax utilities |
If you find our work useful, please consider citing:
@article{koo2026drifting,
title={Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow},
author={Koo, Juil and Park, Mingue and Choi, Jiwon and Min, Yunhong and Sung, Minhyuk},
journal={arXiv preprint arXiv:2605.07727},
year={2026}
}This repository builds on the Q-chunking/FQL codebase.