Skip to content

fix(megatron): honor --[no-]gradient-accumulation-fusion on the megatron.bridge provider#1999

Open
aoshen02 wants to merge 2 commits into
THUDM:mainfrom
aoshen02:fix/honor-gradient-accumulation-fusion
Open

fix(megatron): honor --[no-]gradient-accumulation-fusion on the megatron.bridge provider#1999
aoshen02 wants to merge 2 commits into
THUDM:mainfrom
aoshen02:fix/honor-gradient-accumulation-fusion

Conversation

@aoshen02

@aoshen02 aoshen02 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

When the model is built via megatron.bridge (--megatron-to-hf-mode bridge), the GPT provider sets gradient_accumulation_fusion from can_enable_gradient_accumulation_fusion() (True whenever APEX fused_weight_gradient_mlp_cuda is importable), silently ignoring the CLI --no-gradient-accumulation-fusion.

slime/backends/megatron_utils/model_provider.py builds the provider via bridge.to_megatron_provider(load_weights=False) and overrides several fields from args (parallelism, sequence_parallel, …) but not gradient_accumulation_fusion, so the flag is a no-op on the bridge path. This propagates it:

provider.gradient_accumulation_fusion = args.gradient_accumulation_fusion

Motivation (Blackwell / GB200, cu13)

On colocate training on GB200, the APEX fused-wgrad path (Megatron tensor_parallel/layers.pyfused_weight_gradient_mlp_cuda.wgrad_gemm_accum_fp32) raises CUDA error: CUBLAS_STATUS_NOT_INITIALIZED in the backward weight GEMM — APEX uses its own cuBLAS handle which fails in the colocate process, while torch's cuBLAS in the same process is fine (the forward GEMM succeeds). --no-gradient-accumulation-fusion is the intended way to avoid the APEX path, but it had no effect under the bridge provider. With this change, wgrad routes through torch.matmul and a 2-node colocate run trains cleanly.

Test

  • Qwen3-0.6B 2-node colocate: with this + --no-gradient-accumulation-fusion, train steps complete (no CUBLAS_NOT_INITIALIZED).
  • No behavior change at the default (still fuses when enabled / APEX available).

AI assistance was used to author this change; it has been reviewed.

…e provider

The megatron.bridge GPT provider defaults gradient_accumulation_fusion to
can_enable_gradient_accumulation_fusion() (True whenever APEX
fused_weight_gradient_mlp_cuda imports) and ignores the CLI arg; propagate it so
--no-gradient-accumulation-fusion takes effect on the bridge path. See PR
description for the Blackwell/GB200 colocate CUBLAS_STATUS_NOT_INITIALIZED motivation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aoshen02 aoshen02 force-pushed the fix/honor-gradient-accumulation-fusion branch from 467f0df to d412114 Compare June 1, 2026 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant