Skip to content

zjunlp/LabVLA

Repository files navigation

LabVLA symbol LabVLA

Grounding Vision–Language–Action Models in Scientific Laboratories

Typing Animation

LabVLA Framework

LabVLA turns a Qwen3-VL-4B-Instruct vision–language backbone into a real-time robot controller through a DiT flow-matching action expert, trained with the π0.5 recipe: FAST action-token pre-training → flow-matching post-training with knowledge insulation → task fine-tuning. This README covers installation and usage — method details are in the paper.


✨ Features

🎓 Training — every stage of the recipe, one codebase

Mode What it does
VLM pre-training FAST action-token cross-entropy on the VLM backbone.
Flow-matching post-training Trains the DiT action expert to generate 50-step continuous action chunks.
Knowledge Isolation (KI) Stop-gradient between VLM and action expert.
Task fine-tuning Fine-tuning for downstream tasks.
Multi-dataset & VQA co-training π0-style mixture with homogeneous batches.
delta / abs action modes Per-dimension delta_mask — arm joints delta, gripper absolute, in one vector.

⚙️ Engineering

  • 🚀 Efficiency — selective gradient checkpointing (only a subset of modules — e.g. visual encoder or language model — is checkpointed per stage), Liger-Kernel fused ops, DeepSpeed ZeRO-2, and EMA offload together keep per-GPU batch size at 64 on 80 GB A100 with minimal speed penalty.
Stage A100 80 GB BS / GPU Global BS ~ s / step
VLM Pre-training 24 (3 × 8) 64 1 536 ≈ 7
KI Post-training 16 (2 × 8) 64 1 024 ≈ 5
Task Fine-tuning 4 48 192 ≈ 3

📋 TODO

We are actively organizing the training code and will continue to optimize and maintain it!


📦 Installation

Python 3.10 · CUDA 12.6 · PyTorch 2.7.1 — pinned versions in requirements.txt.

conda create -n labvla python=3.10 -y && conda activate labvla

# 1. PyTorch (CUDA 12.6)  →  2. FlashAttention (built against it)  →  3. everything else
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
pip install flash_attn==2.8.3 --no-build-isolation
pip install -r requirements.txt

🚀 Quick Start

Prerequisite: You need task-specific fine-tuning data — either collected on your own robot or from a simulator (e.g. LabUtopia). Organize it into LeRobot v2 format and register a DatasetSchema under schemas/.

1. Fine-tune — adapt a post-trained checkpoint to your task:

# Edit launch/finetune/train_labutopia.sh:
#   - Set PretrainedCkpt to your post-trained checkpoint path
#   - Set DataRoot / RepoIds / DatasetSchema to point at your data
#   - Set ExternalStatsPath to your dataset's normalization stats
bash launch/finetune/train_labutopia.sh

2. Deploy — start a WebSocket inference server:

PRETRAINED_PATH=/path/to/your/checkpoint bash deployment/deploy.sh

3. Evaluate — connect your robot or simulator client to the server and run rollouts. See Deployment for configuration details.


🎓 Training

One entrypoint (scripts/train.py) for all stages, launched through Accelerate + DeepSpeed ZeRO-2 (bf16). Edit the env vars inside each script before running.

1 · Prepare data

python -m data_process scan  --root /path/to/dataset --out /tmp/report.json     # detect bad episodes
python -m data_process clean --src  /path/to/dataset --dst /path/to/clean \
                             --report /tmp/report.json                          # apply report (symlink copy)
python -m data_process stats --dataset /path/to/clean --schema robointer_droid  # normalization stats
Subcommand Purpose
scan Detect bad episodes (corrupt video, decode failures, missing files).
clean Apply a scan report as a renumbered symlink copy; originals untouched.
stats Compute normalization statistics → meta/stats.json.
validate Cross-repo integrity checks.
preflight Gate a launch on HIGH/CRIT issues before it starts.

Each dataset is described by an auto-registered DatasetSchema (src/schema/). Add a module under schemas/ or pass --dataset_schema /abs/path.py.

2 · VLM Pre-training

Train the Qwen3-VL backbone on FAST action-token cross-entropy, with robot state discretized into the prompt.

bash launch/vlm_pretrain/train_vlm_pretrain.sh

3 · Flow-Matching Post-training (Knowledge Insulation)

Train the DiT action expert to generate 50-step continuous action chunks, with stop-gradient between VLM and DiT.

bash launch/ki_posttrain/train_ki_posttrain.sh

Note: LabVLA is not limited to the datasets listed above. You can extend it to any new dataset by adding a DatasetSchema under schemas/ — the full training pipeline (VLM pre-training, post-training, and fine-tuning) will work out of the box.


🔧 Fine-tuning

Fine-tuning for downstream tasks. Loads a post-trained checkpoint and does not require the data preparation step above.

bash launch/finetune/train_labutopia.sh

📡 Deployment

Download LabVLA from Hugging Face, then deploy via the script:

bash deployment/deploy.sh

📝 Citation

@article{ren2026labvla,
  title   = {LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories},
  author  = {Ren, Baochang and Liu, Xinjie and Chen, Xi and Liu, Yanshuo and
             Li, Chenxi and Gao, Daqi and Su, Zeqin and Xing, Jintao and
             Xue, Zirui and Li, Rui and Zhao, Xiangyu and Qiao, Shuofei and
             Pan, Minting and Zuo, Wangmeng and Bai, Lei and Zhou, Dongzhan and
             Zhang, Ningyu and Chen, Huajun},
  journal = {arXiv preprint arXiv:2606.13578},
  year    = {2026}
}

🙏 Acknowledgments

Our codebase references LeRobot and Liger-Kernel. We sincerely thank their teams for the outstanding contributions to the open-source community.

Releases

No releases published

Packages

 
 
 

Contributors