fix(megatron): honor --[no-]gradient-accumulation-fusion on the megatron.bridge provider by aoshen02 · Pull Request #1999 · THUDM/slime

aoshen02 · 2026-06-01T00:19:52Z

Summary

When the model is built via megatron.bridge (--megatron-to-hf-mode bridge), the GPT provider sets gradient_accumulation_fusion from can_enable_gradient_accumulation_fusion() (True whenever APEX fused_weight_gradient_mlp_cuda is importable), silently ignoring the CLI --no-gradient-accumulation-fusion.

slime/backends/megatron_utils/model_provider.py builds the provider via bridge.to_megatron_provider(load_weights=False) and overrides several fields from args (parallelism, sequence_parallel, …) but not gradient_accumulation_fusion, so the flag is a no-op on the bridge path. This propagates it:

provider.gradient_accumulation_fusion = args.gradient_accumulation_fusion

Motivation (Blackwell / GB200, cu13)

On colocate training on GB200, the APEX fused-wgrad path (Megatron tensor_parallel/layers.py → fused_weight_gradient_mlp_cuda.wgrad_gemm_accum_fp32) raises CUDA error: CUBLAS_STATUS_NOT_INITIALIZED in the backward weight GEMM — APEX uses its own cuBLAS handle which fails in the colocate process, while torch's cuBLAS in the same process is fine (the forward GEMM succeeds). --no-gradient-accumulation-fusion is the intended way to avoid the APEX path, but it had no effect under the bridge provider. With this change, wgrad routes through torch.matmul and a 2-node colocate run trains cleanly.

Test

Qwen3-0.6B 2-node colocate: with this + --no-gradient-accumulation-fusion, train steps complete (no CUBLAS_NOT_INITIALIZED).
No behavior change at the default (still fuses when enabled / APEX available).

AI assistance was used to author this change; it has been reviewed.

…e provider The megatron.bridge GPT provider defaults gradient_accumulation_fusion to can_enable_gradient_accumulation_fusion() (True whenever APEX fused_weight_gradient_mlp_cuda imports) and ignores the CLI arg; propagate it so --no-gradient-accumulation-fusion takes effect on the bridge path. See PR description for the Blackwell/GB200 colocate CUBLAS_STATUS_NOT_INITIALIZED motivation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aoshen02 force-pushed the fix/honor-gradient-accumulation-fusion branch from 467f0df to d412114 Compare June 1, 2026 00:22

Merge branch 'main' into fix/honor-gradient-accumulation-fusion

a63d626

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(megatron): honor --[no-]gradient-accumulation-fusion on the megatron.bridge provider#1999

fix(megatron): honor --[no-]gradient-accumulation-fusion on the megatron.bridge provider#1999
aoshen02 wants to merge 2 commits into
THUDM:mainfrom
aoshen02:fix/honor-gradient-accumulation-fusion

aoshen02 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aoshen02 commented Jun 1, 2026

Summary

Motivation (Blackwell / GB200, cu13)

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant