This directory contains the launch scripts for CapRL++ training with the bundled verl source tree.
start_reward_serve_rm.sh: starts the reward service and exposeshttp://<host>:<port>/get_reward.train_caprl.sh: launches single-node CapRL++ training with verl.
Both scripts infer VERL_ROOT as CapRL++/train/verl, so run them from
CapRL++/train unless you explicitly set VERL_ROOT.
Prepare a Python environment with the dependencies required by verl, vLLM, Ray,
PyTorch, and the model family you use. If you want the scripts to activate a
conda environment automatically, set CONDA_ENV before running them.
One possible setup is:
cd CapRL++/train
conda create -n caprl python=3.10 -y
conda activate caprl
pip install -r scripts/requirements.txt
pip install -e ./verlYou also need:
- a caption model, for example a local path or Hugging Face id for
Qwen3-VL-4B-Instruct; - a reward model, for example a local path or Hugging Face id for
Qwen3-4B-Instructwhen usingREWARD_SCORE_MODE=qa; - a training JSONL file compatible with the CapRL++ data loader (our training data CapRL-Video-QA-20K has been released);
- writable output directories for checkpoints and optional W&B logs.
Run this first. It can run on the same machine as training or on a separate reward node.
cd CapRL++/train
REWARD_MODEL=/path/to/Qwen3-4B-Instruct \
CUDA_VISIBLE_DEVICES=0 \
REWARD_PORT=18889 \
REWARD_NUM_WORKERS=1 \
bash scripts/start_reward_serve_rm.shImportant reward variables:
REWARD_MODEL: required. Reward model path or Hugging Face model id.REWARD_PORT: master service port. Defaults to18889.REWARD_WORKER_BASE: first worker port. Defaults toREWARD_PORT + 10.CUDA_VISIBLE_DEVICES: GPUs used by the reward service. Defaults to0.REWARD_NUM_WORKERS: number of reward workers. Defaults to1in the wrapper script.REWARD_SCORE_MODE: defaults toqa. Usevl_judgefor direct VLM judge scoring.REWARD_TASK: defaults tovideo; set toimagefor image caption training.REWARD_QA_NUM: sampling rounds inqamode. Defaults to8.FORMAT_REWARD_WEIGHT: video timestamp format reward weight. Defaults to0.2. For video captions, the unweighted format reward is0.5 * N_valid / max(N_all, 1) + 0.5 * I_chrono, whereN_allis the number of timestamp-like brackets matched by the regex,N_validis the number that satisfy logical constraints such as valid seconds andt_end >= t_start, andI_chronois1only when valid timestamp start times are monotonically non-decreasing.
For REWARD_SCORE_MODE=qa, the reward model can be a text LLM. For
REWARD_SCORE_MODE=vl_judge, use a multimodal VLM.
Run this after the reward service is ready.
cd CapRL++/train
CAPTION_MODEL=/path/to/Qwen3-VL-4B-Instruct \
DATASET=/path/to/video_train.jsonl \
SAVE_DIR=/path/to/output/checkpoints \
REWARD_NODE_IP=127.0.0.1 \
REWARD_PORT=18889 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
bash scripts/train_caprl.shIf the reward service runs on another node, set REWARD_NODE_IP to that node's
reachable IP address. You can also bypass REWARD_NODE_IP and REWARD_PORT by
setting the full URL directly:
REWARD_REMOTE_URL=http://reward-node.example.com:18889/get_reward \
bash scripts/train_caprl.shRequired training variables:
CAPTION_MODEL: initial caption model path or Hugging Face model id.DATASET: training JSONL file.SAVE_DIR: checkpoint output directory.
Common training variables:
BATCH_SIZE: generation and PPO mini-batch size. Defaults to128.ROLLOUT_N: number of responses sampled per prompt. Defaults to8.TOTAL_EPOCHS: total training epochs. Defaults to3.SAVE_FREQ: checkpoint interval in training steps. Defaults to50.ACTOR_LR: actor learning rate. Defaults to1e-5.MAX_PROMPT_LENGTH: data prompt length. Defaults to4096.MAX_RESPONSE_LENGTH: data response length. Defaults to4096.ROLLOUT_PROMPT_LENGTH: vLLM rollout prompt length. Defaults to13000.ROLLOUT_RESPONSE_LENGTH: vLLM rollout response length. Defaults to4096.ROLLOUT_MAX_MODEL_LEN: vLLM max model length. Defaults to18000.ROLLOUT_GPU_MEMORY_UTILIZATION: vLLM GPU memory fraction. Defaults to0.88.ROLLOUT_AGENT_NUM_WORKERS: async rollout workers. Defaults to8.SAVE_HF_MODEL: set toFalseto skip saving Hugging Face model weights.WANDB_MODE: defaults tooffline.WANDB_PROJECT: defaults toCapRL_video.WANDB_DIR: defaults toCapRL++/train/logs/wandb.RUN_NAME: defaults toqwen3_vl_4b_video.
Length reward variables:
REWARD_LENGTH_TOKENIZER_PATH: defaults toCAPTION_MODEL.REWARD_LENGTH_L1: defaults to2048.REWARD_LENGTH_L2: defaults to3072.REWARD_LENGTH_WEIGHT: defaults to0.2.
The same reward service can be used for image caption training. Start it with
REWARD_TASK=image:
REWARD_TASK=image \
REWARD_MODEL=/path/to/Qwen3-4B-Instruct \
bash scripts/start_reward_serve_rm.shFor training, train_caprl.sh currently documents the two image overrides in
comments:
data.input_type=image
data.prompt_key=prompt