GitHub - zjunlp/LabVLA: LabVLA: Grounding Vision–Language–Action Models in Scientific Laboratories

Grounding Vision–Language–Action Models in Scientific Laboratories

LabVLA turns a Qwen3-VL-4B-Instruct vision–language backbone into a real-time robot controller through a DiT flow-matching action expert, trained with the π0.5 recipe: FAST action-token pre-training → flow-matching post-training with knowledge insulation → task fine-tuning. This README covers installation and usage — method details are in the paper.

✨ Features • 📋 TODO • 📦 Installation • 🚀 Quick Start • 🎓 Training • 🔧 Fine-tuning • 📡 Deployment • 📝 Citation

✨ Features

🎓 Training — every stage of the recipe, one codebase

Mode	What it does
VLM pre-training	FAST action-token cross-entropy on the VLM backbone.
Flow-matching post-training	Trains the DiT action expert to generate 50-step continuous action chunks.
Knowledge Isolation (KI)	Stop-gradient between VLM and action expert.
Task fine-tuning	Fine-tuning for downstream tasks.
Multi-dataset & VQA co-training	π0-style mixture with homogeneous batches.
delta / abs action modes	Per-dimension `delta_mask` — arm joints delta, gripper absolute, in one vector.

⚙️ Engineering

🚀 Efficiency — selective gradient checkpointing (only a subset of modules — e.g. visual encoder or language model — is checkpointed per stage), Liger-Kernel fused ops, DeepSpeed ZeRO-2, and EMA offload together keep per-GPU batch size at 64 on 80 GB A100 with minimal speed penalty.

Stage	A100 80 GB	BS / GPU	Global BS	~ s / step
VLM Pre-training	24 (3 × 8)	64	1 536	≈ 7
KI Post-training	16 (2 × 8)	64	1 024	≈ 5
Task Fine-tuning	4	48	192	≈ 3

📋 TODO

We are actively organizing the training code and will continue to optimize and maintain it!

📦 Installation

Python 3.10 · CUDA 12.6 · PyTorch 2.7.1 — pinned versions in requirements.txt.

conda create -n labvla python=3.10 -y && conda activate labvla

# 1. PyTorch (CUDA 12.6)  →  2. FlashAttention (built against it)  →  3. everything else
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
pip install flash_attn==2.8.3 --no-build-isolation
pip install -r requirements.txt

🚀 Quick Start

Prerequisite: You need task-specific fine-tuning data — either collected on your own robot or from a simulator (e.g. LabUtopia). Organize it into LeRobot v2 format and register a DatasetSchema under schemas/.

1. Fine-tune — adapt a post-trained checkpoint to your task:

# Edit launch/finetune/train_labutopia.sh:
#   - Set PretrainedCkpt to your post-trained checkpoint path
#   - Set DataRoot / RepoIds / DatasetSchema to point at your data
#   - Set ExternalStatsPath to your dataset's normalization stats
bash launch/finetune/train_labutopia.sh

2. Deploy — start a WebSocket inference server:

PRETRAINED_PATH=/path/to/your/checkpoint bash deployment/deploy.sh

3. Evaluate — connect your robot or simulator client to the server and run rollouts. See Deployment for configuration details.

🎓 Training

One entrypoint (scripts/train.py) for all stages, launched through Accelerate + DeepSpeed ZeRO-2 (bf16). Edit the env vars inside each script before running.

1 · Prepare data

python -m data_process scan  --root /path/to/dataset --out /tmp/report.json     # detect bad episodes
python -m data_process clean --src  /path/to/dataset --dst /path/to/clean \
                             --report /tmp/report.json                          # apply report (symlink copy)
python -m data_process stats --dataset /path/to/clean --schema robointer_droid  # normalization stats

Subcommand	Purpose
`scan`	Detect bad episodes (corrupt video, decode failures, missing files).
`clean`	Apply a scan report as a renumbered symlink copy; originals untouched.
`stats`	Compute normalization statistics → `meta/stats.json`.
`validate`	Cross-repo integrity checks.
`preflight`	Gate a launch on HIGH/CRIT issues before it starts.

Each dataset is described by an auto-registered DatasetSchema (src/schema/). Add a module under schemas/ or pass --dataset_schema /abs/path.py.

2 · VLM Pre-training

Train the Qwen3-VL backbone on FAST action-token cross-entropy, with robot state discretized into the prompt.

bash launch/vlm_pretrain/train_vlm_pretrain.sh

3 · Flow-Matching Post-training (Knowledge Insulation)

Train the DiT action expert to generate 50-step continuous action chunks, with stop-gradient between VLM and DiT.

bash launch/ki_posttrain/train_ki_posttrain.sh

Note: LabVLA is not limited to the datasets listed above. You can extend it to any new dataset by adding a DatasetSchema under schemas/ — the full training pipeline (VLM pre-training, post-training, and fine-tuning) will work out of the box.

🔧 Fine-tuning

Fine-tuning for downstream tasks. Loads a post-trained checkpoint and does not require the data preparation step above.

bash launch/finetune/train_labutopia.sh

📡 Deployment

Download LabVLA from Hugging Face, then deploy via the script:

bash deployment/deploy.sh

📝 Citation

@article{ren2026labvla,
  title   = {LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories},
  author  = {Ren, Baochang and Liu, Xinjie and Chen, Xi and Liu, Yanshuo and
             Li, Chenxi and Gao, Daqi and Su, Zeqin and Xing, Jintao and
             Xue, Zirui and Li, Rui and Zhao, Xiangyu and Qiao, Shuofei and
             Pan, Minting and Zuo, Wangmeng and Bai, Lei and Zhou, Dongzhan and
             Zhang, Ningyu and Chen, Huajun},
  journal = {arXiv preprint arXiv:2606.13578},
  year    = {2026}
}

🙏 Acknowledgments

Our codebase references LeRobot and Liger-Kernel. We sincerely thank their teams for the outstanding contributions to the open-source community.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
assets		assets
configs		configs
data_process		data_process
deployment		deployment
launch		launch
schemas		schemas
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grounding Vision–Language–Action Models in Scientific Laboratories

✨ Features

📋 TODO

📦 Installation

🚀 Quick Start

🎓 Training

1 · Prepare data

2 · VLM Pre-training

3 · Flow-Matching Post-training (Knowledge Insulation)

🔧 Fine-tuning

📡 Deployment

📝 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Grounding Vision–Language–Action Models in Scientific Laboratories

✨ Features

📋 TODO

📦 Installation

🚀 Quick Start

🎓 Training

1 · Prepare data

2 · VLM Pre-training

3 · Flow-Matching Post-training (Knowledge Insulation)

🔧 Fine-tuning

📡 Deployment

📝 Citation

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages