Skip to content

Qwen3-235B-A22B GRPO OOM #7125

@qingyuanxingsi

Description

@qingyuanxingsi

Describe the bug
使用Qwen3-235B-A22B 跑GRPO,OOM,大佬看下有什么建议?

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
128卡 H20
cuda版本:CUDA Version: 12.9
torch版本:2.8.0+cu129
训练参数
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" torchrun $DISTRIBUTED_ARGS ${SCRIPT_DIR}/swift/cli/rlhf.py \ --rlhf_type grpo \ --model ${MODEL_NAME_OR_PATH} \ --external_plugins ${RUN_DIR}/custom_reward.py \ --reward_funcs xxx \ --use_vllm true \ --vllm_mode colocate \ --vllm_gpu_memory_utilization 0.7 \ --vllm_max_model_len 10240 \ --train_type lora \ --lora_rank 128 \ --torch_dtype bfloat16 \ --dataset ${DATASET_PATH} \ --max_completion_length 2048 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-6 \ --gradient_accumulation_steps 1 \ --eval_steps 200 \ --save_steps 200 \ --report_to wandb \ --sleep_level 1 \ --offload_optimizer true \ --offload_model true \ --vllm_tensor_parallel_size 8 \ --vllm_enable_expert_parallel true \ --save_total_limit 2 \ --logging_steps 5 \ --max_length 8192 \ --async_generate false \ --output_dir ${OUTPUT_PATH} \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --dataset_num_proc 4 \ --num_generations 8 \ --temperature 0.9 \ --deepspeed zero3_offload \ --move_model_batches 10 \ --beta 0.0 \ --log_completions false \ --importance_sampling_level sequence

Additional context
Add any other context about the problem here(在这里补充其他信息)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions