-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
使用Qwen3-235B-A22B 跑GRPO,OOM,大佬看下有什么建议?
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
128卡 H20
cuda版本:CUDA Version: 12.9
torch版本:2.8.0+cu129
训练参数
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" torchrun $DISTRIBUTED_ARGS ${SCRIPT_DIR}/swift/cli/rlhf.py \ --rlhf_type grpo \ --model ${MODEL_NAME_OR_PATH} \ --external_plugins ${RUN_DIR}/custom_reward.py \ --reward_funcs xxx \ --use_vllm true \ --vllm_mode colocate \ --vllm_gpu_memory_utilization 0.7 \ --vllm_max_model_len 10240 \ --train_type lora \ --lora_rank 128 \ --torch_dtype bfloat16 \ --dataset ${DATASET_PATH} \ --max_completion_length 2048 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-6 \ --gradient_accumulation_steps 1 \ --eval_steps 200 \ --save_steps 200 \ --report_to wandb \ --sleep_level 1 \ --offload_optimizer true \ --offload_model true \ --vllm_tensor_parallel_size 8 \ --vllm_enable_expert_parallel true \ --save_total_limit 2 \ --logging_steps 5 \ --max_length 8192 \ --async_generate false \ --output_dir ${OUTPUT_PATH} \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --dataset_num_proc 4 \ --num_generations 8 \ --temperature 0.9 \ --deepspeed zero3_offload \ --move_model_batches 10 \ --beta 0.0 \ --log_completions false \ --importance_sampling_level sequence
Additional context
Add any other context about the problem here(在这里补充其他信息)