fix(megatron): honor --[no-]gradient-accumulation-fusion on the megatron.bridge provider#1999
Open
aoshen02 wants to merge 2 commits into
Open
fix(megatron): honor --[no-]gradient-accumulation-fusion on the megatron.bridge provider#1999aoshen02 wants to merge 2 commits into
aoshen02 wants to merge 2 commits into
Conversation
…e provider The megatron.bridge GPT provider defaults gradient_accumulation_fusion to can_enable_gradient_accumulation_fusion() (True whenever APEX fused_weight_gradient_mlp_cuda imports) and ignores the CLI arg; propagate it so --no-gradient-accumulation-fusion takes effect on the bridge path. See PR description for the Blackwell/GB200 colocate CUBLAS_STATUS_NOT_INITIALIZED motivation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
467f0df to
d412114
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the model is built via megatron.bridge (
--megatron-to-hf-mode bridge), the GPT provider setsgradient_accumulation_fusionfromcan_enable_gradient_accumulation_fusion()(True whenever APEXfused_weight_gradient_mlp_cudais importable), silently ignoring the CLI--no-gradient-accumulation-fusion.slime/backends/megatron_utils/model_provider.pybuilds the provider viabridge.to_megatron_provider(load_weights=False)and overrides several fields from args (parallelism, sequence_parallel, …) but notgradient_accumulation_fusion, so the flag is a no-op on the bridge path. This propagates it:Motivation (Blackwell / GB200, cu13)
On colocate training on GB200, the APEX fused-wgrad path (
Megatron tensor_parallel/layers.py→fused_weight_gradient_mlp_cuda.wgrad_gemm_accum_fp32) raisesCUDA error: CUBLAS_STATUS_NOT_INITIALIZEDin the backward weight GEMM — APEX uses its own cuBLAS handle which fails in the colocate process, while torch's cuBLAS in the same process is fine (the forward GEMM succeeds).--no-gradient-accumulation-fusionis the intended way to avoid the APEX path, but it had no effect under the bridge provider. With this change, wgrad routes throughtorch.matmuland a 2-node colocate run trains cleanly.Test
--no-gradient-accumulation-fusion, train steps complete (no CUBLAS_NOT_INITIALIZED).AI assistance was used to author this change; it has been reviewed.