Skip to content

Fix no-grad grad-fn lookup in ZeRO hook counting on PyTorch 2.3 (#7830)#7841

Open
tohtana wants to merge 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-issue7830-zero3-grad-checkpointing-attrerror
Open

Fix no-grad grad-fn lookup in ZeRO hook counting on PyTorch 2.3 (#7830)#7841
tohtana wants to merge 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-issue7830-zero3-grad-checkpointing-attrerror

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Feb 10, 2026

Fixes #7830

In torch==2.3, _get_grad_fn_or_grad_acc can fail in backward-hook no-grad context (NoneType.next_functions), which breaks count_used_parameters_in_backward.
This PR wraps that lookup with torch.enable_grad() (matching newer torch behavior) and updates the unit test accordingly.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
PyTorch 2.3's _get_grad_fn_or_grad_acc can fail in backward hooks because
it calls view_as() without torch.enable_grad(). DeepSpeed invokes this from
count_used_parameters_in_backward(), so ZeRO stage 1/2/3 can hit
AttributeError: 'NoneType' object has no attribute 'next_functions'.

Wrap the lookup in torch.enable_grad() to match newer PyTorch behavior and
keep hook counting semantics intact. Update the unit test to validate grad
mode is enabled for this lookup path instead of swallowing AttributeError.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]AttributeError in ZeRO-3 with gradient_checkpointing: 'NoneType' object has no attribute 'next_functions' (DeepSpeed 0.18.5)

1 participant

Comments