LabVLA turns a Qwen3-VL-4B-Instruct vision–language backbone into a real-time robot controller through a DiT flow-matching action expert, trained with the π0.5 recipe: FAST action-token pre-training → flow-matching post-training with knowledge insulation → task fine-tuning. This README covers installation and usage — method details are in the paper.
✨ Features • 📋 TODO • 📦 Installation • 🚀 Quick Start • 🎓 Training • 🔧 Fine-tuning • 📡 Deployment • 📝 Citation
🎓 Training — every stage of the recipe, one codebase
| Mode | What it does |
|---|---|
| VLM pre-training | FAST action-token cross-entropy on the VLM backbone. |
| Flow-matching post-training | Trains the DiT action expert to generate 50-step continuous action chunks. |
| Knowledge Isolation (KI) | Stop-gradient between VLM and action expert. |
| Task fine-tuning | Fine-tuning for downstream tasks. |
| Multi-dataset & VQA co-training | π0-style mixture with homogeneous batches. |
| delta / abs action modes | Per-dimension delta_mask — arm joints delta, gripper absolute, in one vector. |
⚙️ Engineering
- 🚀 Efficiency — selective gradient checkpointing (only a subset of modules — e.g. visual encoder or language model — is checkpointed per stage), Liger-Kernel fused ops, DeepSpeed ZeRO-2, and EMA offload together keep per-GPU batch size at 64 on 80 GB A100 with minimal speed penalty.
| Stage | A100 80 GB | BS / GPU | Global BS | ~ s / step |
|---|---|---|---|---|
| VLM Pre-training | 24 (3 × 8) | 64 | 1 536 | ≈ 7 |
| KI Post-training | 16 (2 × 8) | 64 | 1 024 | ≈ 5 |
| Task Fine-tuning | 4 | 48 | 192 | ≈ 3 |
We are actively organizing the training code and will continue to optimize and maintain it!
Python 3.10 · CUDA 12.6 · PyTorch 2.7.1 — pinned versions in requirements.txt.
conda create -n labvla python=3.10 -y && conda activate labvla
# 1. PyTorch (CUDA 12.6) → 2. FlashAttention (built against it) → 3. everything else
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
pip install flash_attn==2.8.3 --no-build-isolation
pip install -r requirements.txtPrerequisite: You need task-specific fine-tuning data — either collected on your own robot or from a simulator (e.g. LabUtopia). Organize it into LeRobot v2 format and register a
DatasetSchemaunderschemas/.
1. Fine-tune — adapt a post-trained checkpoint to your task:
# Edit launch/finetune/train_labutopia.sh:
# - Set PretrainedCkpt to your post-trained checkpoint path
# - Set DataRoot / RepoIds / DatasetSchema to point at your data
# - Set ExternalStatsPath to your dataset's normalization stats
bash launch/finetune/train_labutopia.sh2. Deploy — start a WebSocket inference server:
PRETRAINED_PATH=/path/to/your/checkpoint bash deployment/deploy.sh3. Evaluate — connect your robot or simulator client to the server and run rollouts. See Deployment for configuration details.
One entrypoint (scripts/train.py) for all stages, launched through Accelerate + DeepSpeed ZeRO-2 (bf16). Edit the env vars inside each script before running.
python -m data_process scan --root /path/to/dataset --out /tmp/report.json # detect bad episodes
python -m data_process clean --src /path/to/dataset --dst /path/to/clean \
--report /tmp/report.json # apply report (symlink copy)
python -m data_process stats --dataset /path/to/clean --schema robointer_droid # normalization stats| Subcommand | Purpose |
|---|---|
scan |
Detect bad episodes (corrupt video, decode failures, missing files). |
clean |
Apply a scan report as a renumbered symlink copy; originals untouched. |
stats |
Compute normalization statistics → meta/stats.json. |
validate |
Cross-repo integrity checks. |
preflight |
Gate a launch on HIGH/CRIT issues before it starts. |
Each dataset is described by an auto-registered DatasetSchema (src/schema/). Add a module under schemas/ or pass --dataset_schema /abs/path.py.
Train the Qwen3-VL backbone on FAST action-token cross-entropy, with robot state discretized into the prompt.
bash launch/vlm_pretrain/train_vlm_pretrain.shTrain the DiT action expert to generate 50-step continuous action chunks, with stop-gradient between VLM and DiT.
bash launch/ki_posttrain/train_ki_posttrain.shNote: LabVLA is not limited to the datasets listed above. You can extend it to any new dataset by adding a
DatasetSchemaunderschemas/— the full training pipeline (VLM pre-training, post-training, and fine-tuning) will work out of the box.
Fine-tuning for downstream tasks. Loads a post-trained checkpoint and does not require the data preparation step above.
bash launch/finetune/train_labutopia.shDownload LabVLA from Hugging Face, then deploy via the script:
bash deployment/deploy.sh@article{ren2026labvla,
title = {LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories},
author = {Ren, Baochang and Liu, Xinjie and Chen, Xi and Liu, Yanshuo and
Li, Chenxi and Gao, Daqi and Su, Zeqin and Xing, Jintao and
Xue, Zirui and Li, Rui and Zhao, Xiangyu and Qiao, Shuofei and
Pan, Minting and Zuo, Wangmeng and Bai, Lei and Zhou, Dongzhan and
Zhang, Ningyu and Chen, Huajun},
journal = {arXiv preprint arXiv:2606.13578},
year = {2026}
}Our codebase references LeRobot and Liger-Kernel. We sincerely thank their teams for the outstanding contributions to the open-source community.


